In a significant stride towards enhancing the representation of Indian languages in the realm of Language Model training, AI4Bharat has launched the IndicLLMSuite. This comprehensive suite of resources is tailored to address the challenges faced by low and mid-resource languages in the development of Language Models (LLMs). The initiative aims to democratize access to advanced NLP technologies across a diverse linguistic landscape.
Empowering linguistic diversity
IndicLLMSuite encompasses a rich repository of data spanning 22 Indian languages, totaling an impressive 251 billion tokens and 74.8 million instruction-response pairs. This extensive corpus is meticulously curated from various sources, including curated URLs, multilingual corpora, and large-scale translations. Such diversity in data collection ensures robust representation and fosters inclusivity in language model training.
The suite comprises several essential components designed to facilitate the creation and refinement of Language Models tailored to Indian languages:
This foundational component serves as the bedrock of IndicLLMSuite, offering a vast pre-training dataset aggregated from diverse linguistic sources. With 251 billion tokens spanning 22 languages, Sangraha provides the raw material necessary for training language models effectively. Setu presents a sophisticated Spark-based distributed pipeline, custom-built for Indian languages. This versatile tool streamlines the extraction of content from a multitude of sources, including websites, PDFs, and videos. Its built-in functionalities for cleaning, filtering, toxicity removal, and deduplication ensure the integrity and quality of the extracted data.
IndicAlign-Instruct introduces a comprehensive collection of 74.7 million prompt-response pairs across 20 languages. These pairs are meticulously curated using diverse methodologies, including the compilation of existing Instruction Fine-Tuning (IFT) datasets, translation of English datasets, generation of discussions from India-centric Wikipedia articles, and crowd-sourcing through the Anudesh platform. Additionally, a novel IFT dataset drawn from IndoWordNet enriches the suite’s resources, facilitating enhanced language and grammar learning for models.
This component addresses the crucial aspect of safety alignment in Language Models by providing a curated dataset comprising 123K pairs of toxic prompts and non-toxic responses. Leveraging open-source English LLMs and translation to 14 Indian languages, IndicAlign–Toxic enhances the safety and reliability of Indic Language Models.
Collaborative endeavors in language technology
The unveiling of IndicLLMSuite underscores a collaborative effort within the Indian AI landscape to advance the development of language technologies. Partnering with Sarvam AI and IIT Madras, AI4Bharat recently introduced IndicVoices, a comprehensive speech dataset aimed at fostering inclusivity and diversity in speech recognition applications. With 7348 hours of natural speech from 16237 speakers across 145 Indian districts and 22 languages, IndicVoices complements the efforts of IndicLLMSuite in enriching the linguistic ecosystem of India.
The introduction of IndicLLMSuite marks a pivotal moment in the journey towards inclusive language technology development in India. By democratizing access to resources and fostering collaboration among stakeholders, AI4Bharat reinforces its commitment to promoting linguistic diversity and empowering Indian languages in the digital age. As the landscape of NLP continues to evolve, initiatives like IndicLLMSuite serve as catalysts for innovation and progress, paving the way for a more inclusive and accessible linguistic future.