Expert Speak Digital Frontiers
Published on Jul 21, 2021
This effort should begin with a demand-side assessment of the Open Data ecosystem via consultations with AI innovators at startups, domestic corporates, and multinationals.
Deploying open government data for AI-enabled public interest technologies

As exhibited at the Government of India’s inaugural Artificial Intelligence (AI) conference, RAISE 2020, there is a growing recognition that AI poses to significantly accelerate progress towards achieving improved developmental and socioeconomic outcomes. In fact, studies published by Nature and the McKinsey Global Institute demonstrate that AI could enable improvements in 100+ targets across the United Nations’ 17 Sustainable Development Goals (SDGs). This progress is generally thought to be from AI-driven technological interventions.

However, the potential of AI-driven ‘public interest’ technologies — such as in fields like agriculture, health, and education — depends largely on the availability of suitable datasets. If more governmental data in areas of social consequence were available to the public in a high-quality, machine-readable format, it would greatly enhance the prospect for the domestic development of public interest AI. Refined datasets are required to train autonomous algorithms (what we refer to as AI) that in turn power enhanced analytics, perception, decision-making, and prediction technologies. The more extensive and high-quality the data that is fed to a model, the more accurate, effective, and impactful an AI-enabled technology can become.

The potential of AI-driven ‘public interest’ technologies — such as in fields like agriculture, health, and education — depends largely on the availability of suitable datasets. If more governmental data in areas of social consequence were available to the public in a high-quality, machine-readable format, it would greatly enhance the prospect for the domestic development of public interest AI.

Presently, the majority of well-known AI-enabled technologies rely on proprietary datasets owned by large technology corporations and their subsidiaries. These companies can leverage their enormous troves of user data or pay large sums to acquire datasets in order to fuel internal research and development, providing significant competitive advantage. In contrast, cash-strapped startups often struggle to collect, procure, and refine a sufficient amount of data. To this extent, adequate data is one of the biggest barriers experienced by Indian AI startups as well as other public agencies and civic organisations attempting to harness AI for social good.

The scope and value of open (government) data

Each year, the Union and State governments combined spend hundreds of billions of dollars to provide services like healthcare, food, insurance, education, skill training, housing, and fuel to more than 500 million Indian citizens. Moreover, they play a pivotal role in shaping the nation’s conditions for energy, infrastructure, transportation, and environmental stewardship. In the course of these operations, they organically collect tremendous amounts of data. At this time, more data in sectors like agriculture, rural development, education, migration, and energy is in the government’s hands than any other stakeholder in India by a wide margin.

In India, efforts to provide the public with access to government data remain significantly constrained.

In many countries with highly developed technology ecosystems, startups turn to government data repositories, also referred to as Open Data platforms, to source datasets in the aforementioned sectors. However, in India, efforts to provide the public with access to government data remain significantly constrained. Presently, the Union government’s primary Open Data initiative, the Open Government Data Platform, hosts hundreds of thousands of datasets. However, it is compromised by issues of quality, disparate schema and metadata standardisation, and a lack of high-value data.

Given that data preparation and engineering tasks already comprise approximately 80 percent of time spent on AI projects, these circumstances make it challenging for AI startups to leverage the Open Government Data Platform’s datasets. In some cases, Indian technologists opt to use data from nations with mature Open Data programmes. However, given the significant variation in demographic, socioeconomic, epidemiological, and climatic circumstances, data collected from these geographies has limited use in informing AI models deployed in India’s diverse locales.

It is critical that government, industry, and civil society address the fundamental bottlenecks and inefficiencies impeding the nation’s Open Data efforts.

While the Open Government Data Platform boasts a significant repository, the Union government’s National Strategy for AI (NSAI) asserts that the “Government of India has large amounts of data lying in silos across ministries.” If the vast majority of India’s Union and State government datasets in areas of social consequence were available to the public in a high-quality, machine-readable format, it would greatly enhance the prospect for domestic development of public interest AI. This strongly aligns with the government’s socially-oriented AI strategy. Captured by the mantra of ‘AI for All,’ India’s NSAI commits to leveraging the technology to drive inclusive growth and progress in fields like agriculture, health, and education.

As the Union government prepares to launch the forthcoming National Program for AI (also known as the AI Mission), a nodal initiative informed by the NSAI to foster the nation’s AI ecosystem, it is critical that government, industry, and civil society address the fundamental bottlenecks and inefficiencies impeding the nation’s Open Data efforts. Furthermore, they should launch an initiative dedicated to fostering Open Data in such a way so as to maximise its potential use by Indian startups, corporations, and other beneficiaries.

Building institutional capacity and improving the budget allocation for Open Data

The technical nature of the Open Data mandate entails the collection, processing, and publication of data at each government institution. These functions demand highly-skilled human resources, ample capacity, and sophisticated Standard Operating Procedures. However, Open Data is presently allocated as an additional responsibility upon government personnel, who are constrained by and sometimes overworked with their primary responsibilities.

The insertion of dedicated data cadres into government institutions would almost certainly expand the Indian bureaucracy’s ability to reposit data internally and release relevant datasets to the public.

Given the dearth in the bandwidth of existing staff, the government should consider augmenting institutional capacity by providing a budget to onboard dedicated data management and Open Data personnel in each government agency. Like IT systems require regular maintenance from specialised teams with sufficient manpower and time, institutional data likewise requires personnel with the time and skills required to effectively collect, refine, and curate datasets. The insertion of dedicated data cadres into government institutions would almost certainly expand the Indian bureaucracy’s ability to reposit data internally and release relevant datasets to the public.

Simultaneously, the government should invest in educational and training programmes aimed at enhancing the general capacity of existing Open Data personnel. Presently, training is largely limited to guidance for contributing datasets to and operating the Open Government Data Platform. Programmes should be expanded to include technical skills training as well as support for the creation or refining of institution-specific Standard Operating Procedures relevant to the collection, aggregation, and engineering of datasets. This effort would facilitate the publication of an increased number of high-quality and high-value datasets, as well as heighten awareness of emerging sector-specific privacy, security, and confidentiality requirements.

Codifying the open government data mandate into law

Under the 2012 National Data Sharing and Accountability Policy (NDSAP), where India’s Open Data mandate originated, each government agency’s Chief Data Officer receives broad discretion in curating the datasets their institution will contribute to the Open Government Data Platform. However, this has led to non-transparent, inconsistent, and untargeted data sharing practices across institutions. Furthermore, NDSAP neglected to create any meaningful incentive or accountability mechanisms for Open Data personnel, leading to concerning gaps in motivation and performance.

An Open Data Act should incorporate a robust framework of checks and balances.

To overcome these impediments, the Government of India should consider enacting comprehensive Open Data legislation. An Open Data Act should incorporate a robust framework of checks and balances for Open Data personnel, delineate extensive criteria for the identification and selection of datasets, and expand India’s Open Data mandate to State governments (presently, it is only applicable to Union government entities).

Consolidating ecosystem-wide Open Data initiatives by enabling interoperability and promoting collaboration

Since the Union government launched the Open Government Data Platform in 2012, a range of governmental, academic, and civil society organisations have launched separate Open Data platforms. Some notable examples include the India Urban Data Exchange (The Ministry of Housing and Urban Affairs Smart City Mission), Pune DataStore (Municipality of Pune), The India Observatory (Foundation for Ecological Security), Open Budgets India (Centre for Budget and Governance Accountability), The India Data Portal (Indian School of Business), and the forthcoming National Data Analytics Platform (NITI Aayog), amongst others.

Vast stores of data in other repositories are by and large untapped amongst innovators in the AI community.

While these efforts are a testimony to the combined will and capacity of the broader Indian Open Data community, they also present critical gaps and inefficiencies. At present, the aforementioned platforms are not interoperable, meaning that they act as siloed data repositories and do not share information with each other. Even if a technologist is familiar with the Government of India’s flagship platform, it is probable that they will be unaware of many other existing Open Data initiatives in India. This means that the vast stores of data housed in other repositories are by and large untapped amongst innovators in the AI community.

In the interest of amplifying the use of datasets hosted across various platforms, each institution hosting an Open Data platform should strive to retroactively incorporate broad data interoperability. This would allow datasets hosted on each platform to be automatically pushed to and be accessible via the others, allowing technologists and other beneficiaries to make better use of the collective of Open Data. The Union government could lead such an effort by creating and open-sourcing software communication infrastructure that connect different Indian Open Data platforms. A precedent for such an effort can be drawn from the India Urban Data Exchange, which intends to facilitate the transfer of urban-focused data between various government-operated repositories. This model could be expanded to include non-governmental data platforms and a larger range of sectors.

In the interest of amplifying the use of datasets hosted across various platforms, each institution hosting an Open Data platform should strive to retroactively incorporate broad data interoperability.

In addition, as is demonstrated by a range of public-private co-creation initiatives, there is vast potential to leverage public-private-civic partnerships to improve the status of key governmental Open Data programmes. For example, CivicDataLab has been at the forefront of upskilling state-level officials to open up government data in sectors like finance and budgeting.

Streamlining focus on Open Data for AI

Alongside the nation’s forthcoming National Programme for AI, the government should constitute a committee drawn from industry, academia, and civil society dedicated to creating the conditions necessary to leverage the nation’s Open Data to fuel AI innovation. In the United States, generally considered the world’s leading AI ecosystem, the federal government likewise recently launched a task force to make more government data available to AI innovators and researchers. Echoing similar sentiments, the European Commission (the EU’s executive body) has identified Open Data as a “critical asset for the development of new technologies, such as artificial intelligence (AI), which require the processing of vast amounts of high-quality data.”

These sentiments underpin GAIA-X a European initiative to develop infrastructure for a data ecosystem that promotes innovation while also meeting privacy, transparency, security and rights standards.

In India, this effort should begin with a demand-side assessment of the Open Data ecosystem via consultations with AI innovators at startups, domestic corporates, and multinationals. After evaluating where stakeholders in AI perceive the specific gaps and limitations in Open Data, the government can accordingly develop a roadmap for prioritised improvements. Moreover, this information could help officials in the National Informatics Centre and/or the Department of Science and Technology to identify an informed criteria for the identification and publication of high-value Open Data for use in AI. This criteria could then be extrapolated to operational guidelines, which could be instituted in the Open Data practices of government personnel through incorporation in Standard Operating Procedures.

In the wake of the devastation wreaked by the COVID-19 pandemic, a unified campaign to leverage Open Data for AI, coupled with the aforementioned fundamental reforms required to bolster Open Data could serve to precipitate meaningful social and developmental impacts in India by unlocking the potential for disruptive technological innovation in sectors like health, agriculture, and education. Meanwhile, it is important that Personal Data Protection legislation be enacted by the Indian Parliament as soon as possible to institute measures that safeguard the privacy, security, and rights of individuals, in particular with respect to the release of (anonymised) sensitive data.

The views expressed above belong to the author(s). ORF research and analyses now available on Telegram! Click here to access our curated content — blogs, longforms and interviews.

Contributor

Sam Neufeld

Sam Neufeld

Sam Neufeld is a project manager at the Open Data Working Group co-hosted by the International Innovation Corp.

Read More +