The multi-billion-dollar potential of synthetic data

Synthetic data will be a huge industry in five to 10 years. For instance, Gartner estimates that by 2024, 60% of data for AI applications will be synthetic. This type of data and the tools used to create it have significant untapped investment potential. Here's why.

Synthetic data can feed data-hungry AI/ML

We are effectively on the cusp of a revolution in how machine learning (ML) and artificial intelligence (AI) can grow and have even more applications across sectors and industries.

We live in an era of skyrocketing demand for ML algorithms in every aspect of our lives, from fun face-masking applications such as filters on Instagram or Snapchat to deeply useful applications designed to improve our work and living experiences, such as assisting in diagnosing illness or recommending treatment. Among the prime opportunities are emotion and engagement recognition, better homeland security features and better anomaly detections in industrial contexts.

At the same time, while people and businesses are hungry for ML/AI-based products, algorithms are hungry for data to train on. All of that means we will inevitably see more and more different data needs, and entirely manufactured data is the key.

From Grand Theft Auto to Google

Heard about self-driving cars learning the rules of the road by playing games like Grand Theft Auto V to study virtual traffic? That was an early version of ML through synthetic data. Similarly, many in tech may have come across synthetic “scanned documents,” which have been used to train text recognition and data extraction models.

Banking and finance is one sector that already leans heavily on synthetic data for certain processes, while tech giants like Google and Facebook are also using it, drawn by the extraordinary efficiency it can bring to the work of project managers and data scientists.

In fact, we expect to see the number of synthetic images and data points increasing tenfold over the next year and by many hundred-fold in the next few years.

Constraints of real-world data

Those at the cutting edge of ML are increasingly turning to synthetic data to circumvent the numerous constraints of original or real-world data. For instance, company Synthesis AI offers a cloud-based generation platform that delivers millions of perfectly labeled and diverse images of artificial people. Synthesis AI has been able to accomplish many challenges that come with the messy reality of original data. For a start, the company makes the data cheaper. It can be too expensive for an organization to generate the quantity and diversity of data it needs.

For example, could you get photos of someone from every conceivable angle, wearing every possible combination of clothing in every possible light condition? It would be an unimaginable amount of work to do that in real life, but synthetic data can be designed to account for endless variations.

That also means much easier labeling of data. Imagine trying to pinpoint the source of light, its brightness, and its distance from an object in photos to train a shadow development algorithm. It would be pretty much impossible. With synthetic data, you have that data by default, because it was generated with such parameters.

Furthermore, companies must also contend with stringent restrictions on the use of real-world data. In the past, companies have shared data without the layers of cybersecurity expected now. GDPR and other data regulations make it complex and challenging, and sometimes illegal, for companies to share real-world data with partners and vendors.

In other cases, it may not be even possible or safe to generate the data. The real-time 3D engine producer Unigine counts as a client Daedalean, which is working on urban flying mobility. Daedalean has started to train its autonomous flying cars in Unigine virtual worlds. This makes complete sense — it doesn’t yet have a safe real-world environment in which to test its products extensively and generate the deep datasets it needs. A similar case is CarMaker software by IPG Automotive. Its 10.0 release introduced upgraded 3D visualization powered by UNIGINE 2 Sim, featuring physically-based rendering and real-world camera parameters.

Synthetic people and synthetic objects have been much more widely used by tech giants recently. Amazon used synthetic data to train Alexa, Facebook acquired synthetic data generator AI.Reverie, and Nvidia realized NVIDIA Omniverse Replicator, a powerful synthetic-data-generation engine that produces physically simulated synthetic data for training deep neural networks.

Combating bias in data

The challenges of real-world data don’t end there. In some fields, huge historical bias pollutes data sets. This is how we end up with global tech behemoths running into hot water because their algorithms don’t recognize black faces properly. Even now, with ML technology experts acutely aware of the bias issue, it can be challenging to collate a real-world dataset entirely free of bias.

Even if a real-world dataset can account for all of the above challenges, which in reality is hard to imagine, data models need to be improved and tweaked constantly to stay unbiased and avoid degradation over time. That means a constant need for fresh data.

Understanding the opportunity

Synthetic data is in the relatively early stages of growth and it’s not a panacea for every use case. It continues to face technical challenges and limitations, and the tools and standards for it have not yet been standardized.

Nonetheless, synthetic data is definitely an accelerator for ML/AI-based products as they continue to expand into every industry and sector, and we’ll certainly see a lot of new companies and deals in the area. For anyone who wants to dive deeper into the topic of synthetic data, here is the Open Synthetic Data Community. Discover a hub for synthetic datasets, papers, code, and people pioneering their use in machine learning.

Sergey Toporov is partner at Leta Capital.

Welcome to the VentureBeat community!

Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.

Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!