Expert Speak Digital Frontiers
Published on Sep 01, 2023

As the biotechnology industry evolves, integrating synthetic data and model-based reasoning will remain crucial for addressing complex biological challenges and improving human health

Synthetic data in biotechnology research

In August 2023, the Indian government passed the Digital Personal Data Protection Act, 2023. While the Act was critiqued in Bill form for its many gaps, a primary concern is the continuing lack of healthcare and biological and biometric data protection.

The field of biotechnology has witnessed tremendous growth in recent years, driven by advances in data science, artificial intelligence (AI), and machine learning (ML). To address the lack of privacy and protection of data in these sectors and to reduce biases that may afflict smaller datasets or incomplete datasets, two critical aspects that have emerged as game-changers in biotechnology are synthetic data and computational modelling.

Synthetic Data in Biotechnology Research 

Using real patient data in biotechnology research often raises privacy and security concerns. Synthetic data refers to artificially generated data that imitates the statistical characteristics of real-world data but does not contain any sensitive or personally identifiable information. In biotechnology, developing synthetic data holds significant potential due to the sensitive and private nature of patient data and experimental results. The Ministry of Electronics and Information Technology (MeITy) has also outlined the importance of synthetic data under their India AI initiative, highlighting many fields of technology application that would benefit, including biological data.

Synthetic data refers to artificially generated data that imitates the statistical characteristics of real-world data but does not contain any sensitive or personally identifiable information.

Synthetic data can mimic the structure and distributions present in real datasets while safeguarding patient privacy and avoiding the risks associated with individual identification. Techniques like building generative models based on real data are employed to generate synthetic data. These models capture the correct relationships and distributions through expert coding or inferring from real data using Bayesian networks[1] or Belief Networks (BNs).

Synthetic data thus offers a solution by enabling researchers to work with data that retains the essential statistical properties required for analysis without compromising individuals' privacy. Additionally, it enhances data availability by creating access to large, diverse, and well-curated datasets while prioritising privacy, which is crucial for robust ML models in biotechnology research and databases. Synthetic data can augment existing datasets, especially in cases where obtaining real data is challenging due to limited samples or data restrictions, such as medical data or any data that deals with sensitive information that may be unethical to procure or retain. Further, synthetic data can expedite model development and testing, as researchers can use this data to simulate various scenarios and validate the performance of AI-driven models before applying them to real-world datasets.

A study by researchers from the University of Southern California is an example of synthetic, AI-generated data that successfully produced brain waves with applications in improving accessibility for people with disabilities.

Model-Based Reasoning in Biotechnology Research 

Model-based reasoning uses computational models and datasets to simulate and predict biological processes, interactions, and outcomes. These models can be based on mathematical equations, physics-based simulations, or even neural networks. Such model-based reasoning can accelerate drug discovery by simulating how potential drug compounds interact with target molecules, predicting their efficacy, and optimising molecular structures for desired properties. Further, in the same line of drug discovery and AI-based medical enhancement, Model-Based Reasoning systems can also enhance the field of personalised medicine that aims to create tailor-made treatments for individuals based on their genetic makeup. These fields are augmented by modelling complex biological systems that help researchers study biology in new, more detailed ways, called systems biology.

Synthetic data can augment existing datasets, especially in cases where obtaining real data is challenging due to limited samples or data restrictions, such as medical data or any data that deals with sensitive information that may be unethical to procure or retain.

Impact and Synergy of Synthetic Data and Model-Based Reasoning 

As mentioned before, there are certain advantages to combining synthetic data with real-world datasets, providing a more comprehensive and diverse set of training samples and enhancing ML models' robustness and generalisation capabilities. These include accelerating drug discovery, scope for personalised medicine, and reduced data and algorithmic biases.

Additionally, certain rare diseases or specific patient cohorts may have limited available data. Synthetic data generation can help address data scarcity issues, enabling more inclusive research and analysis.

In Finland, for example, synthetic data assisted in protecting the privacy of individuals afflicted by COVID-19 while permitting data sharing that augmented medical research.

Using synthetic data also has ethical advantages beyond privacy. Preclinical research and testing can reduce the need for animal testing and human trials, leading to more ethical research practices and potential reductions in regulatory burdens.

Risk of Biases 

However, despite well-established techniques like synthetic data and Bayesian networks for creating high-fidelity synthetic patient data and vast datasets available, biases can persist and be carried over to the data generators. Biases in data have proven to be a significant challenge in applying AI techniques, risking the replication and even amplification of human biases, particularly those affecting protected groups.

The application of synthetic data generators on biased data can lead to the generation of synthetic data that needs more specific cohorts of patients due to cultural sensitivities or standardised procedures in data collection within particular communities. These factors can also result in structurally missing data or incorrect correlations and distributions in the synthetic data generated from biased ground truth datasets. Medical datasets often need to be more balanced, with specific patient groups being under-represented.

Preclinical research and testing can reduce the need for animal testing and human trials, leading to more ethical research practices and potential reductions in regulatory burdens.

Currently, three main approaches assist in de-biasing synthetic datasets: Reweighing,[2] adversarial de-biasing,[3] and reject option classification.[4]

While there are still risks to using synthetic data and model-based reasoning, these can be countered by the aforementioned debiasing techniques. The use of synthetic data and model-based reasoning, thus, have emerged as powerful tools in the biotechnology landscape.

While India still grapples with its approach to data privacy, especially regarding biological data, synthetic data addresses privacy concerns while enhancing data availability and accelerating model development. Model-based reasoning enables simulation and prediction of biological processes, advancing drug discovery, personalised medicine, and systems biology research. These innovations' impact is amplified, leading to more efficient biotechnological solutions, faster drug development, and ethical advancements in biomedical research. As the biotechnology industry evolves, integrating synthetic data and model-based reasoning will remain crucial for addressing complex biological challenges and improving human health. These innovations can revolutionise biological research, leading to faster drug discovery, personalised medicine, and more efficient solutions to complex biomedical challenges.


Shravishtha Ajaykumar is an Associate Fellow with the Centre for Security, Strategy and Technology at the Observer Research Foundation


[1] A Bayesian network structures synthetic data or replaces missing data. This is done using its two components; a graphical model of the joint probability distribution and a set of conditions that describe the probability distributions.

[2] Make modifications on the training data by computing and weighted values to the data collected or generated.

[3] Intentionally adding non-matching data to “confuse” the model.

[4] Adding classification manually in automated classification of data results in biases.

The views expressed above belong to the author(s). ORF research and analyses now available on Telegram! Click here to access our curated content — blogs, longforms and interviews.

Author

Shravishtha Ajaykumar

Shravishtha Ajaykumar

Shravishtha Ajaykumar is Associate Fellow at the Centre for Security, Strategy and Technology. Her fields of research include geospatial technology, data privacy, cybersecurity, and strategic ...

Read More +