Expert Speak Digital Frontiers
Published on Apr 20, 2026

India’s multilingual AI push is bringing local languages into AI systems, but without community stewardship, inclusion risks proceeding without representation

Language Stewardship in India’s AI Ecosystem

This essay is part of the series: World Creativity and Innovation Day 2026: Sparks and Shields


AI systems globally are predominantly English-centric. Among the nearly 7,000 languages spoken globally, fewer than 100 are significantly represented in major AI training corpora. This resembles the language skew of the internet, where English accounts for 49 percent of all internet content, while English speakers make up less than 20 percent of the world’s population. The United Nations Development Programme (UNDP)’s February 2026 analysis found that AI systems in low-resource languages are up to five times more expensive to source and process than their English-language equivalents. This is not solely a technical limitation. It is also a failure of governance.

Among the nearly 7,000 languages spoken globally, fewer than 100 are significantly represented in major AI training corpora.

India has one of the most ambitious multilingual AI programmes in the world, seeking to cover its 22 scheduled languages, hundreds of tribal dialects, and over a billion citizens for whom English is not a first language. The IndiaAI Mission, BharatGen, Bhashini, and Adi-Vaani together aim to ensure that AI training data and output reflect India’s linguistic reality. As these systems scale, a critical governance issue demands attention: ensuring that when AI is trained on the local languages of Indian communities, mechanisms are in place for those communities to act as stewards of the process rather than merely as its sources.

India’s Multilingual AI Ambition

The IndiaAI Mission, approved in March 2024, has, among other activities, funded BharatGen, a Large Language Model (LLM) built by IIT Bombay and a consortium of national research institutions. The LLM includes models such as Param-1 for text processing, Shrutam for speech recognition, and Sooktam for text-to-speech, spanning all of India’s scheduled languages.

India has one of the most ambitious multilingual AI programmes in the world, seeking to cover its 22 scheduled languages, hundreds of tribal dialects, and over a billion citizens for whom English is not a first language.

In parallel, the Office of the Principal Scientific Advisor has issued a white paper on Foundation Models, identifying small language models (SLM) as a strategic instrument for linguistic inclusion. Adi-Vaani, an AI translation tool introduced in 2025, extends this ambition to tribal languages. It supports Santali, Bhili, Mundari, and Gondi, languages that rely on oral transmission with limited digital presence. However, the governance of their linguistic data has become a matter of concern.

The Data Stewardship Question

The Government of India has confirmed that digitised language data from SPPEL (Scheme for Protection and Preservation of Endangered Languages) and Sanchika’s government archives enrich training for the AI models BharatGen and Bhashini. SPPEL recordings were collected under a preservation mandate. Speakers of Gondi, Mundari, and Santali have participated in language archiving exercises without the awareness that the material collected might someday serve as training data for AI systems whose outputs may be commercialised. While BharatGen describes its data practices as statutory-compliant, it does not address whether target communities have given consent or have any say in how the language is represented in the resulting model. Moreover, if a Santali corpus captures only one dialect, this raises the concern that it could set the standard for all subsequent AI systems trained on it.

While BharatGen describes its data practices as statutory-compliant, it does not address whether target communities have given consent or have any say in how the language is represented in the resulting model.

The Stewardship Gap in India’s Current Framework

The Digital Personal Data Protection Act, 2023, protects personal data — data about identifiable individuals. India’s AI Governance Guidelines, released in November 2025, establish transparency and accountability principles across the AI lifecycle. However, they do not address the issue of collective linguistic data. Privacy frameworks are built around the individual as the unit of protection. A corpus of Santali folk songs has no identifiable person. A dataset of Gondi agricultural vocabulary names no individual. Yet the harm from their misuse is collective and lasting, and current law has no mechanism to address it. There is also no existing provision for communities to legally contest or safeguard against linguistic data extraction.

Precedents and Models

India could consider reframing language resources as collective assets whose use requires structured consent. The institutional precedent for this exists. The Traditional Knowledge Digital Library (TKDL) was developed in 2001 by the Government of India to protect traditional medicinal knowledge from unauthorised commercial use, not by restricting access outright, but by documenting that knowledge in a form that gave communities legal standing when it was misused. The same principle could be applied to linguistic data: for example, a Santali speech corpus before it enters the commercial pipeline.

Similarly, Canada’s FirstVoices platform, launched in 2003 by the First Peoples’ Cultural Council, functions as a community-governed digital language repository hosting words, phrases, audio recordings, songs, and stories across more than 65 Indigenous language sites. Its governance architecture is instructive: nations retain full ownership and control of their data, each language site is managed by community-designated administrators, and communities determine which content is publicly accessible and which is restricted. The platform follows Canada’s OCAP® principles (Ownership, Control, Access, Possession), which explicitly extend to languages, cultures, knowledge, stories, songs, and ceremonies. AI developers cannot access protected language data without community consent, ensuring language enters into digital systems only on the community’s terms.

A possible institutional vocabulary for adapting such a model to the Indian context may be derived from the Community-in-the-Loop[1] (CITL) AI governance framework. The CITL’s second pillar, Community-Based Data Stewardship, identifies three mechanisms through which communities could govern their data collectively. Data Trusts place a fiduciary intermediary between the community and the data user, protecting community rights and ensuring fair sharing of benefits. Data Cooperatives vest democratic governance in community members, enabling them to reclaim the value of data generated from their participation. Civic Data Commons establishes a shared open infrastructure that facilitates innovation while retaining community oversight. The CITL framework draws on the Data Stewardship Ladder, which maps the transition from extraction to stewardship.

The core principle potentially transferable to the Indian context is that stewardship of language in AI is a community responsibility, not a technical one.

New Zealand’s Kaitiakitanga approach offers a further international reference. In 2022, Te Hiku Media, a Māori broadcaster, declined to release its speech recognition model as open source. Instead, it created the Kaitiakitanga licence, a guardianship instrument requiring users to demonstrate genuine care for the Māori community before gaining access. The core principle potentially transferable to the Indian context is that stewardship of language in AI is a community responsibility, not a technical one. India’s own Fifth and Sixth Schedules and the PESA Act already recognise community self-governance over shared resources. A language data stewardship architecture could build on this foundation.

A Data Stewardship Architecture The following recommendations outline a governance and licensing framework for the use of language as training data.

  • Mandate Data Declaration Records: The Ministry of Electronics and Information Technology (MeitY) should require, as a condition of IndiaAI Mission funding, that every supported model publish a Data Declaration Record before open-source release or government deployment. The record would disclose which languages are in the corpus, the source of each dataset, any dialectal variants excluded, and what community consultation was conducted. This draws on CITL's concept of the Dataset Nutrition Label, enabling practitioners and communities to assess dataset suitability for specific applications.
  • Pilot Community Language Data Trusts: MeitY, the Ministry of Culture, and the Ministry of Tribal Affairs should jointly pilot Language Data Trusts for selected low-resource languages, such as Santali, Gondi, Bodo, Maithili, and Mizo. Each trust would serve as a statutory advisory body with elected community representatives as the majority voice, linguistic scholars in advisory roles, and a government nominee for accountability. Its mandate would cover corpus design, dialectal validation, and review of model outputs, thereby creating stewardship bodies rather than veto bodies.

This approach is grounded in demonstrated precedent. A 2025 Community-in-the-Loop pilot in rural Maharashtra invited farmers to co-design flood prediction models, integrating sensor data with local ecological knowledge. Accuracy improved by 15 percent, while agreement that the AI aligned with local needs rose from 5 percent to 35 percent. Participants also refused permission to record sessions until data stewardship agreements were in place, a concern that language data trusts could directly address.

Without institutional mechanisms for stewardship, inclusion risks becoming extractive rather than participatory.

Develop Community Verified Language Data Commons: Building on the TKDL’s documentation architecture and the FirstVoices model, MeitY and the Ministry of Culture should develop language data commons for low-resource scheduled languages. These commons would function as shared digital infrastructure to host community-verified corpora, maintain full provenance records, and make them available to AI developers under licensing terms that include benefit-sharing obligations and representation audits.

Conclusion

India’s multilingual AI ambition will ultimately be judged not only by the number of languages its systems can process, but by whether the communities that speak those languages are meaningfully represented in how they are encoded. Without institutional mechanisms for stewardship, inclusion risks becoming extractive rather than participatory. The frameworks outlined here provide a pathway to align technological expansion with community agency. Embedding such safeguards early will be critical to ensuring that India’s AI ecosystem evolves in a way that is both inclusive and accountable.


Purushraj Patnaik is a Research Assistant with the Centre for Digital Societies at the Observer Research Foundation.


[1]The concept of Community-in-the-Loop governance embeds structured community participation into AI design, training, and audit, treating communities as data stewards with the locus standi to refuse or impose conditions for access, not as end-users to be consulted after decisions are made. Its three pillars cover community co-design, participatory data validation, and public algorithmic auditing.

The views expressed above belong to the author(s). ORF research and analyses now available on Telegram! Click here to access our curated content — blogs, longforms and interviews.

Author

Purushraj Patnaik

Purushraj Patnaik

Purushraj Patnaik is a Research Assistant with the Centre for Digital Societies at Observer Research Foundation (ORF). His research focuses on the governance of emerging ...

Read More +