Datagen emerges from stealth to create synthetic datasets for computer vision models

Datagen, a Tel Aviv, Israel-based startup offering a platform to create synthetic computer vision system training data, today emerged from stealth with $18.5 million in funding from TLV Partners and Viola Ventures. The company says the proceeds will be put toward growing its R&D lab while it expands into new markets globally.

Datagen, which Ofir Chakon and Gil Elbaz founded in 2018, leverages computer graphics and data generation to simulate the real world with datasets that include 2D and 3D annotations. By combining generative adversarial networks (GANs) with reinforcement learning-driven humanoid motion algorithms within a physical simulator, Datagen says it can deliver photorealistic, scalable datasets suitable for augmented and virtual reality, internet of things, smart store, robotics, and smart car use cases.

GANs are two-part AI models consisting of a generator that creates samples and a discriminator that attempts to differentiate between the generated samples and real-world samples. As for reinforcement learning, it's a technique that allows AI models to learn how to make decisions automatically through trial and error.

Collecting and labeling training data can be expensive for enterprises. For example, self-driving vehicle companies alone spend billions of dollars per year collecting and labeling training data, according to estimates.

Third-party contractors enlist hundreds of thousands of human data labelers to draw and trace the annotations machine learning models need to learn. (A properly labeled dataset provides a ground truth that the models use to check their predictions for accuracy and continue refining their algorithms.) Curating these datasets to include the right distribution and frequency of samples becomes exponentially more difficult as performance requirements increase. And the pandemic has underscored how vulnerable these practices are, as contractors have been increasingly forced to work from home, prompting some companies to turn to synthetic data as an alternative.

To create synthetic training data, Datagen works with customers to establish requirements like camera lens specifications, lighting, environmental factors, demographic distributions, and annotations and metadata. The process begins with 3D base models of people and objects scanned from the real world or designed with computer graphics software. Datagen's platform creates representations of these models with meshes and textures as well as semantic metadata. Lastly, Datagen employs GANs to sample from these representations and synthesize unique models, building libraries of millions of 3D assets that are then subjected to physics-based algorithms that simulate motion and help to scale rendering.

For example, Datagen says that its platform can capture hand data that could power gesture-based interactions with headsets. Beyond creating meshes and skeletal models for a range of human hands, the company claims its technology can accurately mimic real-world hand-to-object and hand-to-hand interactions.

"Computer vision can be an amazing tool for defect and risk detection -- things like errors on an assembly line or rust or cracks that threaten the structural integrity of a building," Chakon told VentureBeat via email. "Simulated data can supercharge this application by simulating extreme cases that would be dangerous to capture manually in a data set or are extremely rare. It also allows enterprises to create environmental variations to strengthen performance, like different lighting conditions, robotic attachments, or tools."

The AI training dataset market is anticipated to be worth $4.8 billion by 2027, according to Grand View Research, and Datagen has rivals in a number of startups. Parallel Domain also taps AI and machine learning to create synthetic computer vision datasets. There's also Cvedia and AI Reverie, both of which are developing simulators targeting applications across data generation, labeling, and enhancement.

However, unlike many of its competitors, one of Datagen's focuses is privacy. Chakon points out that by 2023, Gartner estimates, 65% of the world's population will have their data protected by privacy laws and regulations. This stands to make collecting AI training data in the real world less straightforward and the alternative -- synthetic datasets that don't sweep up data like faces or license plates -- more attractive.

"Many new products not yet in production -- smart appliances, robotics, and more -- will have specific camera types and orientations. In many cases, this means datasets need to reflect the specific nuances of that hardware in order to be effective," Chakon continued. "But, if the hardware is not in the hands of consumers or is highly secretive, it can be impossible to efficiently collect the data you need. Simulated data can imitate these specifications, allowing teams to develop software solutions that are perfectly attuned to hardware that is still in development."

Of course, synthetic data isn't a panacea in the absence of real-world data. For example, in the autonomous vehicle domain, simulations and running vehicles on test routes can help to prove that cars meet specific compliance needs. But public roads present complex, real-world dynamics that even the best simulators can't consistently deliver, including different weather conditions and a range of pedestrian and driver behaviors.

That's why Chakon advises Datagen's customers, which include the AI research arms of several manufacturing giants, that a mix of synthetic and real-world data is the best approach. "The real-world implication is that, once deployed, you can be sure it's going to work well in different domains, with different ethnicities, in different geographic locations, or any environment you can imagine," he said.

Existing investor Spider Capital participated in 40-employee Datagen's first public round of fundraising announced today, in addition to individual investors Kaggle CEO Anthony Goldbloom and UC Berkeley AI Research Lab founder Trevor Darrell.

More