How synthetic data is boosting AI at scale

Artificial intelligence (AI) relies heavily on large, diverse and meticulously-labeled datasets to train machine learning (ML) algorithms. In the modern era, data has become the lifeblood of AI, and obtaining the right data is considered the most critical and challenging aspect of developing robust AI systems.

However, collecting and labeling vast datasets with millions of elements sourced from the real world is time-consuming and expensive. As a result, those training ML models have started to rely heavily on synthetic data, or data that is artificially generated rather than produced by real-world events.

Synthetic data has soared in popularity in recent years, presenting a viable solution to the data-quality problem and offering the potential to reshape large-scale ML deployments. According to a Gartner study, synthetic data is expected to account for 60% of all data used in the development of AI by 2024.

Turbocharging AI/ML with synthetic data

The concept is elegantly simple. It allows practitioners to generate the data they need digitally, on demand, and in any desired volume, tailored to their precise specifications. Researchers can now even turn to synthetic datasets that were created using 3D models of scenes, objects and humans to produce action clips quickly --- without encountering copyright issues or ethical concerns associated with real data.

“Using synthetic data for machine learning training allows companies to build models for scenarios that were previously out of reach due to the needed data being private, too low-quality or simply not existing at all,” Forrester analyst Rowan Curran told VentureBeat. “Creating synthetic datasets uses techniques like generative adversarial networks (GANs) to take a dataset of a few thousand individuals and transform it into a dataset that performs the same when training the ML model — but doesn’t have any of the personally identifiable information (PII) of the original dataset.”

Proponents point to a variety of benefits to choosing synthetic datasets. For one thing, using synthetic data can significantly reduce the cost of generating training data. It can also address privacy concerns related to potentially sensitive data obtained from the real world.

Synthetic data can help mitigate bias, as compared to real data, which may not accurately represent the full range of information about the real world. Greater diversity may also be accounted for in synthetic datasets by incorporating rare cases that represent realistic possibilities but are difficult to obtain from genuine data.

Curran explained that synthetic datasets are used to create data for models in situations where the needed data does not exist because the data collection scenario occurs too infrequently.

“A healthcare provider wanted to do a better job catching early-stage lung cancer, but little imagery data was available. So to build their model, they created a synthetic dataset that used healthy lung imagery combined with early-stage tumors to build a new training dataset that would function as if it were the same data collected from the real world,” said Curran.

He said synthetic data is also finding traction in other secure industries, such as financial services. These companies have significant restrictions on how they can use and move their data, particularly to the cloud.

Synthetic data has the potential to enhance software development, accelerate research and development, facilitate the training of ML models, enable organizations to gain a deeper understanding of their internal data and products, and improve business processes. These benefits, in turn, can promote the growth of AI on a large scale.

How does it function in the real world of AI?

But the question remains: Can artificially generated data be as effective as real data? How well does a model trained with synthetic data perform when classifying real actions?

Yashar Behzadi, CEO and founder of synthetic data platform Synthesis AI, says that companies often use synthetic and real-world data in conjunction, to train their models and ensure they are optimized for the best performance.

“Synthetic data is often used to augment and extend real-world data, ensuring more robust and performant models,” he told VentureBeat. For example, he said Synthesis AI is working with a handful of tier 1 auto manufacturers and software companies.

“We keep hearing that the available training data is either too low-res or there isn’t enough of it — and they don’t have their customers’ consent to train computer vision models with it either way,” he said. “Synthetic data solves all three challenges — quality, quantity and privacy.”

Companies also turn to synthetic data when they cannot obtain certain annotations from human labelers, such as depth maps, surface normals, 3D landmarks, detailed segmentation maps and material properties, he explained.

“Bias in AI models is well documented, and related to incomplete training data that lack the necessary diversity related to ethnicity, skin tone or other demographics,” he said. “As a result, AI bias disproportionately impacts underrepresented demographics and leads to less inclusive applications and products.” Using synthetic data, he continued, companies can explicitly define the training dataset to minimize bias and ensure more inclusive, human-centered models without breaching consumer privacy.

Replacing even a small portion of real-world training data with synthetic data makes it possible to accelerate and streamline the training and deployment of AI models of all scales.

At IBM, for instance, researchers have used the ThreeDWorld simulator and its corresponding Task2Sim platform to generate simulated images of realistic scenes and objects, which can be used to pretrain image classifiers. These synthetic images reduce the amount of genuine training data required, and they have been found to be equally effective in pretraining models for tasks such as detecting cancer in medical scans.

In addition, supplementing authentic data with artificially generated data can mitigate the risk of a model that has been pretrained on raw data scraped from the internet that exhibits racist or sexist tendencies. Custom-made artificial data is pre-vetted to minimize the presence of biases, reducing the risk of such unwanted behaviors in models.

“Doing as much as we can with synthetic data before we start using real-world data has the potential to clean up that Wild West mode we’re in,” said David Cox, codirector of the MIT-IBM Watson AI Lab and head of exploratory AI research.

_{Image source: Forrester}

Synthetic data and model quality

Alp Kucukelbir, cofounder and chief scientist of factory optimization platform Fero Labs and an adjunct professor at Columbia University, said that although synthetic data can complement real-world data for training AI models, it comes with a big caveat: You need to know what gap you're plugging in your real-world dataset.

“Say you are using AI to decarbonize a steel mill. You want to use AI to unravel and expose the specific operation of that mill (e.g., precisely how machines at a specific factory work together) and not to rediscover the basic metallurgy you can find in a textbook. In this case, to use synthetic data, you would have to simulate the precise operation of a steel mill beyond our knowledge of textbook metallurgy,” explained Kucukelbir. “If you had such a simulator, you wouldn’t need AI to begin with.”

Machine learning is good at interpolating, but could stand improvement at extrapolating from training datasets. However, artificially generated data allows researchers and practitioners to provide “corner-case” data to an algorithm, and could eventually accelerate R&D efforts, added Julian Sanchez, director of emerging technologies at John Deere.

“We have tried synthetic data in an experimental fashion at John Deere, and it shows some promise. The general set of examples involve agriculture, where you are likely to have a very low occurrence rate of specific corner cases,” Sanchez told VentureBeat. “Synthetic data provides AI/ML algorithms with the required reference points through data and gives researchers a chance to understand how the trained [model] could handle the different use cases. It will be an important aspect of how AI/ML scales.”

Likewise, Sebastian Thrun, ex-Google VP and current chairman and cofounder of online learning platform Udacity, says that this kind of data is usually unrealistic along some dimensions. Simulations through synthetic data are a quick and safe way to accelerate learning, but they typically have known shortcomings.

“This is specifically the case for data in perception (camera images, speech, etc.). But the right strategy is usually to combine real-world data with synthetic data,” Thrun told VentureBeat. “During my time at Google’s self-driving car project Waymo, we used a combination of both. Synthetic data will play a big role in situations we never want to experience in the real world.”

Challenges of using synthetic data for AI

Michael Rinehart, VP of AI at multicloud data security platform Securiti AI, says that there’s a tradeoff between synthetic data's usefulness and the privacy it affords.

“Finding the appropriate tradeoff is a challenge because it is company-dependent, much like any risk-reward assessment,” said Rinehart. “This challenge is further compounded by the fact that quantitative estimates of privacy are imperfect, and more privacy may actually be afforded by the synthetic dataset than the estimate suggests.”

He explained that consequently, looser controls or processes might be applied to this kind of data. For instance, companies may skip known synthetic data files during sensitive data scans, losing visibility into their proliferation. Data science teams may even train large models on them, ones capable of memorizing and regenerating the synthetic data, and then disseminate them.

“If synthetic data or any of its derivatives are meant to be shared or exposed, companies should ensure it protects the privacy of any customers it represents by, for example, leveraging differential privacy with it,” advised Rinehart. “High-quality differentially-private synthetic data ensures that teams can run experiments with realistic data that does not expose sensitive information.”

Fernando Lucini, global lead for data science and machine learning engineering at Accenture, adds that generating synthetic data is a highly complex process, requiring people with specialized skills and truly advanced knowledge of AI.

“A company needs very specific and sophisticated frameworks and metrics to validate that it created what it intended,” he explained.

What’s next for synthetic data in AI?

Lucini believes synthetic data is a boon for researchers and will soon become a standard tool in every organization’s tech stack for scaling their AI/ML models’ prowess.

“Utilizing synthetic data provides not only an opportunity to work on more interesting problems for researchers and accelerate solutions, but also has the potential to develop far more innovative algorithms that may unlock new use cases we hadn’t previously thought possible,” Lucini added. “I expect synthetic data to become a part of every machine learning, AI and data science workflow and thereby of any company’s data solution.”

For his part, Synthesis AI’s Behzadi predicts that the generative AI boom has been and will continue to be a huge catalyst for synthetic data.

“There has been explosive growth in just the past few months, and pairing generative AI with synthetic data will only further adoption,” he said.

Coupling generative AI with visual effects pipelines, the diversity and quality of synthetic data will drastically improve, he said. “This will further drive the rapid adoption of synthetic data across industries. In the coming years, every computer vision team will leverage synthetic data.”