Check out all the on-demand sessions from the Intelligent Security Summit here.

Data scientists are increasingly using synthetic data to develop their AI systems. Indeed, a 2019 survey of the field calls the use of synthetic data “one of the most promising general techniques on the rise in [AI], especially computer vision.” Gartner predicts that 60% of the data used for the de­vel­op­ment of AI and an­a­lyt­ics projects will be syn­thet­i­cally gen­er­ated by 2024.

With the global AI training dataset market expected to be worth $4.8 billion by 2027, according to Grand View Research, it’s perhaps unsurprising that new startups are emerging to meet the demand. In January, Mostly AI, a company that uses AI to create synthetic data for enterprises, raised $25 million in venture capital. Synthetic data company Synthesis AI emerged from stealth in April. And Facebook acquired synthetic data startup AI.Reverie last October.

Another company, Synthetaic, goes a step beyond most synthetic data startups in claiming that its platform can eliminate the need for data labeling. Synthetaic — which today announced that it raised $13 million in series A financing — says its technology has already been deployed for rare tumor diagnosis, tracking endangered species, and insights from geospatial data,

Data labeling

In the enterprise, the most common type of AI system relies on supervised learning during the development process. Supervised learning involves recruiting people to annotate data — whether text, images, audio, or otherwise — so that an AI model can learn to associate certain annotations (i.e., labels) with characteristics of the data. For example, a supervised learning system that’s fed a large library of pictures of cats with annotations for each breed will eventually “learn” to distinguish between bobtails and shorthairs.


Intelligent Security Summit On-Demand

Learn the critical role of AI & ML in cybersecurity and industry specific case studies. Watch on-demand sessions today.

Watch Here

Synthetaic, which was founded in 2019 by Corey Jaskolski, claims to eliminate the need for labeling through the use of synthetic data. Synthetic data — which comes with auto-generated labels — can be used in place of real-world data in cases where the real-world data is scarce or difficult to obtain, Synthetaic asserts, enabling organizations to create AI systems quickly and cheaply.

“Jaskolski started Synthetaic after working with National Geographic in conservation efforts to preserve the Sumatran Rhino. His work there led to the realization that generative AI was the answer to the lack of data the AI models used in impactful applications such as conservation, security, and medical imaging, where good data is hard to come by,” a company spokesperson told VentureBeat via email. “We have a technology that can democratize AI and apply AI to projects or applications that have previously been inaccessible.”

Synthetaic espouses the benefits of synthetic data in healthcare, noting that the data it creates isn’t constrained by regulations like HIPAA, the U.S. law that governs the release of sensitive patient information. In partnership with Michigan Medicine, the University of Michigan-owned academic medical center, Synthetaic claims to have helped to boost the accuracy of a brain tumor-detecting computer vision model from 68% to 96%.

For another client, National Geographic, Synthetaic says that it helped to create an AI-powered platform that identifies and detects poachers and other “dangerous anomalies,” like illegal harvesting and signs of environmental impact (particularly missing trees and coastline changes), from satellite images. Synthetaic also worked with the U.S. Air Force to “demonstrate how [the company’s] technology can rapidly speed up AI-powered object detection in geospatial data,” according to Jaskolski.

“[Our platform] is different than most other AI tools in that it does not require a traditional trained model to be effective,” the spokesperson continued. “Using [it,] a user can find things such as all the full parking lots in Milwaukee, specific vehicles in full motion video, or photos in which someone is holding a pistol. In each of these examples, [the platform] can provide initial AI results in minutes without labeled data from a single example image. This allows for easy AI experimentation … and allows enterprise customers to develop AI models without needing to send their company’s data out for human labeling.”

Accuracy questions

It’s not just Synthetaic and rivals who’ve heralded synthetic data as the solution to some of the major problems plaguing AI. For example, Nvidia researchers have explored a way to use synthetic data created in virtual environments to train robots to pick up objects like cans of soup, a mustard bottle, and a box of Cheez-Its in the real world. Institutions including the U.S. Department of Veterans Affairs are using synthetic medical histories for thousands of fake patients in order to study disease patterns and treatment paths.

“All of the AI being developed today is data-hungry, and feeding AI with high-quality labeled data is a vast challenge regardless of the environment. Our flagship technology …. automates the analysis of large, unstructured, multidimensional datasets,” the Synthetaic spokesperson added. “Synthetaic introduces new technology that solves AI’s data problem by building models in minutes instead of months, vastly reducing the time to insight … [Our platform] eliminates the need for time-intensive human labeling or expensive labeled data troves, which is, perhaps, the single-largest barrier to unlocking practical AI.”

In a survey of executives applying AI, 89% said synthetic data will be essential for their organizations to stay competitive. But there’s a downside: Some evidence suggests that synthetic data can perpetuate biases in both data and the AI systems developed using them.

In a January 2020 study, researchers at Arizona State University showed that an AI system trained on a dataset of images of professors could create highly realistic synthetic faces, but synthetic faces that were mostly male and white. The system amplified biases contained in the original dataset, which captured mostly male and white professors.

“There are several different approaches to [synthetic data generation], but in some ways, the data ethics risks are greater [with approaches like Synthetaic’s] … because they rely heavily on additional text attributes beyond the class name itself,” Bernard Koch, a Ph.D. student at the University of California, Los Angeles studying the intersection of science, culture, and machine learning, told VentureBeat via email. “After training, the idea is that you can learn to predict a truck without seeing one before, [for example] because you know that it is not a car or a bus but has attributes in common with cars and buses. From an ethics perspective, any socially insensitive or under-representation annotation issues that can occur with class labels can now occur with descriptive attributes as well.”

Jaskolski claims that the company has taken steps to mitigate bias in the systems that it creates.

“While it is true that synthetic data can introduce bias in AI models, at Synthetaic, we use synthetic data in a novel manner in that we don’t actually create synthetic data for the purpose of training. Rather, we use synthetic data models to power our … product which can rapidly build an AI model on real data without millions of human labeled samples,” he said. “This actually reduces another major source of bias, which is human label error, while also removing the large time and budgetary downside of human labeling.”

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.