How synthetic data could save AI

AI is facing several critical challenges. Not only does it need huge amounts of data to deliver accurate results, but it also needs to be able to ensure that data isn't biased, and it needs to comply with increasingly restrictive data privacy regulations. We have seen several solutions proposed over the last couple of years to address these challenges -- including various tools designed to identify and reduce bias, tools that anonymize user data, and programs to ensure that data is only collected with user consent. But each of these solutions is facing challenges of its own.

Now we're seeing a new industry emerge that promises to be a saving grace: synthetic data. Synthetic data is artificial computer-generated data that can stand-in for data obtained from the real world.

A synthetic dataset must have the same mathematical and statistical properties as the real-world dataset it is replacing but does not explicitly represent real individuals. Think of this as a digital mirror of real-world data that is statistically reflective of that world. This enables training AI systems in a completely virtual realm. And it can be readily customized for a variety of use cases ranging from healthcare to retail, finance, transportation, and agriculture.

There's significant movement happening on this front. More than 50 vendors have already developed synthetic data solutions, according to research last June by StartUs Insights. I will outline some of the leading players in a moment. First, though, let's take a closer look at the problems they're promising to solve.

The trouble with real data

Over the last few years, there has been increasing concern about how inherent biases in datasets can unwittingly lead to AI algorithms that perpetuate systemic discrimination. In fact, Gartner predicts that through 2022, 85% of AI projects will deliver erroneous outcomes due to bias in data, algorithms, or the teams responsible for managing them.

The proliferation of AI algorithms has also led to growing concerns over data privacy. In turn, this has led to stronger consumer data privacy and protection laws in the EU with GDPR, as well as U.S. jurisdictions including California and most recently Virginia.

These laws give consumers more control over their personal data. For example, the Virginia law grants consumers the right to access, correct, delete, and obtain a copy of personal data as well as to opt out of the sale of personal data and to deny algorithmic access to personal data for the purposes of targeted advertising or profiling of the consumer.

By restricting access to this information, a certain amount of individual protection is gained but at the cost of the algorithm's effectiveness. The more data an AI algorithm can train on, the more accurate and effective the results will be. Without access to ample data, the upsides of AI, such as assisting with medical diagnoses and drug research, could also be limited.

One alternative often used to offset privacy concerns is anonymization. Personal data, for example, can be anonymized by masking or eliminating identifying characteristics such as removing names and credit card numbers from ecommerce transactions or removing identifying content from healthcare records. But there is growing evidence that even if data has been anonymized from one source, it can be correlated with consumer datasets exposed from security breaches. In fact, by combining data from multiple sources, it is possible to form a surprisingly clear picture of our identities even if there has been a degree of anonymization. In some instances, this can even be done by correlating data from public sources, without a nefarious security hack.

Synthetic data's solution

Synthetic data promises to deliver the advantages of AI without the downsides. Not only does it take our real personal data out of the equation, but a general goal for synthetic data is to perform better than real-world data by correcting bias that is often engrained in the real world.

Although ideal for applications that use personal data, synthetic information has other use cases, too. One example is complex computer vision modeling where many factors interact in real time. Synthetic video datasets leveraging advanced gaming engines can be created with hyper-realistic imagery to portray all the possible eventualities in an autonomous driving scenario, whereas trying to shoot photos or videos of the real world to capture all these events would be impractical, maybe impossible, and likely dangerous. These synthetic datasets can dramatically speed up and improve training of autonomous driving systems.

Parallel Domain.)

Perhaps ironically, one of the primary tools for building synthetic data is the same one used to create deepfake videos. Both make use of generative adversarial networks (GAN), a pair of neural networks. One network generates the synthetic data and the second tries to detect if it is real. This is operated in a loop, with the generator network improving the quality of the data until the discriminator cannot tell the difference between real and synthetic.

The emerging ecosystem

Forrester Research recently identified several critical technologies, including synthetic data, that will comprise what they deem “AI 2.0,” advances that radically expand AI possibilities. By more completely anonymizing data and correcting for inherent biases, as well as creating data that would otherwise be difficult to obtain, synthetic data could become the saving grace for many big data applications.

Synthetic data also comes with some other big benefits: You can create datasets quickly and often with the data labeled for supervised learning. And it does not need to be cleaned and maintained the way real data does. So, theoretically at least, it comes with some large time and cost savings.

Several well-established companies are among those that generate synthetic data. IBM describes this as data fabrication, creating synthetic test data to eliminate the risk of confidential information leakage and address GDPR and regulatory issues. AWS has developed in-house synthetic data tools to generate datasets for training Alexa on new languages. And Microsoft has developed a tool in collaboration with Harvard with a synthetic data capability that allows for increased collaboration between research parties. Notwithstanding these examples, it is still early days for synthetic data and the developing market is being led by the startups.

To wrap up, let's take a look at some of the early leaders in this emerging industry. The list is constructed based on my own research and industry research organizations including G2 and StartUs Insights.

AiFi -- Uses synthetically generated data to simulate retail stores and shopper behavior.
AI.Reverie -- Generates synthetic data to train computer vision algorithms for activity recognition, object detection, and segmentation. Work has included wide-scope scenes like smart cities, rare plane identification, and agriculture, along with smart-store retail.
Anyverse -- Simulates scenarios to create synthetic datasets using raw sensor data, image processing functions, and custom LiDAR settings for the automotive industry.
Cvedia -- Creates synthetic images that simplify the sourcing of large volumes of labeled, real, and visual data. The simulation platform employs multiple sensors to synthesize photo-realistic environments resulting in empirical dataset creation.
DataGen -- Interior-environment use cases, like smart stores, in-home robotics, and augmented reality.
Diveplane -- Creates synthetic ‘twin’ datasets for the healthcare industry with the same statistical properties of the original data.
Gretel -- Aiming to be GitHub equivalent for data, the company produces synthetic datasets for developers that retain the same insights as the original data source.
Hazy -- generates datasets to boost fraud and money laundering detection to combat financial crime.
Mostly AI -- Focuses on insurance and finance sectors and was one of the first companies to create synthetic structured data.
OneView - Develops virtual synthetic datasets for analysis of earth observation imagery by machine learning algorithms.

Gary Grossman is the Senior VP of Technology Practice at Edelman and Global Lead of the Edelman AI Center of Excellence.