Why synthetic data may be better than the real thing

To deploy successful AI, organizations need data to train models.

That said, high-quality data isn’t always easy to access – creating a major hurdle for organizations in launching AI initiatives.

This is where synthetic data can be so useful.

As opposed to data that is collected from and measured in the real world, synthetic data is generated in the digital world by computer simulations, algorithms, simple rules, statistical modeling, simulation, and other techniques. It is an alternative to real-world data, but it reflects real-world data, mathematically and statistically.

Some experts even contend that synthetic data is better than real-world people, places, and things when it comes to training AI models. Constraints in using sensitive and regulated data are removed or reduced; datasets can be tailored to certain conditions that might otherwise be unobtainable; insights can be gained much more quickly; and training is less cumbersome and much more effective.

To that point, Gartner projects synthetic data to completely overshadow real data in AL models by 2030.

“The fact is you won’t be able to build high-quality, high-value AI models without synthetic data,” according to the Gartner report.

Leaders in synthetic data

To support accelerating demand, a growing number of companies are offering synthetic models – top and emerging companies in the space include Mostly AI, AI.Reverie, Sky Engine, and Datagen. Leading data engineering company Innodata has also entered the market, today launching an e-commerce portal where customers can purchase on-demand synthetic datasets and immediately train models.

“The kind of datasets we’re going after reflect real-world problems that CIOs and customers have come back to us with,” said CPO Rahul Singhal. “We began looking at: How do we create large amounts of training data that machines need?”

The Innodata AI Data Marketplace has been developed by in-house experts specifically for building and training AI/ML models. The data packs are off-the-shelf, easily previewable, unbiased, diverse, thorough, and secure, according to Singhal. Innodata is initially releasing 17 data packs in four languages that home in on financial services. These packs are textual, meaning they include invoices, purchase orders, and banking and credit card statements.

“One of the big needs in AI is diversity of data,” said Singhal. “We need lots of diverse ways that invoice can be created, we need visibility. It seems very easy, but it’s actually really complicated.”

The marketplace compliments Innodata’s open-source repository of more than 4,000 datasets. These help in the prototyping of supervised and unsupervised ML projects.

The new synthetic datasets take that to the next level based on real-world information. “Machines learn by seeing real-world examples,” Singhal said.

For instance, he pointed to the many ways in which a credit card statement could be structured – one could have names listed on the right side; another on the left; one could use a table format; another a column format. To be accurate, machines have to be provided with those variations, and in both quality and quantity. Innodata models have been provided with hundreds of templates to allow for such variations and to replicate true scenarios.

“Machine learning (ML) depends on a diversity of datasets,” Singhal said. “We create real-world data sets as much as possible and replicate what real-world document types will look like.”

Why synthetic data?

Among their many advantages, synthetic datasets are free from personal data and therefore not subject to compliance restrictions or other privacy protection laws, Singhal pointed out. This also shields against security breaches. Biases are removed to help automate workflows and enable predictive modeling. Singhal pointed out that, “things in the real world are not pristine,” and that people can smudge banking statements or accidentally or purposely obfuscate things.

Ultimately, synthetic data will be an important tool in driving the adoption of AI, Singhal said.

The eventual intent with Innodata’s marketplace is to expand to third-party AI training data sets, as well as beyond documents to images, video, audio, and speech (the latter in response to the growth in conversational AI). These datasets will also span industries – telecom and utilities, transportation and logistics, energy services, pharmaceuticals, hospitality, insurance, retail, healthcare – and will be provided in an expanding number of languages so that data scientists can build from a global perspective.

“Our goal is to create a vibrant marketplace where companies can contribute datasets and monetize data sets,” Singhal said. “This has the potential of democratizing data for AI.”

Leaders in synthetic data

Why synthetic data?

More