Unity's Danny Lange explains why synthetic data is better than the real thing at Transform 2021

Synthetic data is one of the latest transformative technologies that could solve the biggest problems organizations face: collecting the right data, structuring it, cleaning it, and ensuring that it's bias-free and privacy-law compliant, said Danny Lange, senior VP of AI and machine learning at Unity during Transform 2021.

AI and machine learning is data-driven, Lange said, and now we train computers instead of programming them. The algorithms are still important, but what’s much more important today is the data, which determines the outcome of a learning algorithm.

Up to this point, real-world data has been hand-labeled and fed into a computer. But that requires an intense amount of time and labor.

"What if I could re-create all that data in a synthetic way?" Lange said. "That’s what we’re doing at Unity. We’re using the Unity engine to re-create three-dimensional worlds with objects in there. Then we can generate synthetic images that look very much like what they would look like in the real world, perfectly labeled."

Creating synthetic data

Synthetic data is built with assets, Lange said. If you want to build a system that can, for instance, automate checkout in a grocery store, you'd need to create a data set of grocery products. You could purchase grocery products, cut them up in pieces, scan them, re-create 3D assets that you can enter into the Unity system and generate image data. You could also use CAD files, files from manufacturers, or use other tools and technologies for scanning objects directly into a 3D format.

Once you have these assets, you can start manipulating them within the system, rotate them, change the lighting, juxtapose them between other objects, and place them on a store shelf -- the combinations are endless, allowing you to manipulate these objects, add them to any context, experiment with light, shadow, and any other kind of backdrop to improve the training data for their machine learning system.

"Real-world data is really just a snapshot of the situation," Lange said. "What you can do with the synthetic data is augment that real world with special use cases, special situations, special events. You can improve the diversity of your data by adding synthetic data to your data set."

By augmenting real-world data with synthetic data, Unity’s customers have been able to improve some of their object recognition rates from 70% or 80% to almost 100% because it adds much more diversity and many more scenarios to the training data.

The use case for synthetic data

Synthetic data, since it’s generated, comes perfectly labeled, and since it's created by a computer, you can produce a great many images per second. Synthetic data is also created in an instant, which means the cost is many times lower.

The use cases that lend themselves best to synthetic data are in things like smart checkout. Consider a large cafeteria: the customer slides their cafeteria tray under a camera, and it will recognize what they selected for breakfast, and then charge for those items. There are smart stores where cameras track customers who purchase items by taking them off the shelf.

But there are many other areas, including robots, Lange said.

"Robots are leaving their cages in automotive manufacturing and other manufacturing facilities, becoming more “co-bots,” or “collaborative robots," he explained. "They’ll have to have many more skills based on vision. They’ll be able to interact with you when they can see you."

There's also the AR space: augmented reality applications can get so much smarter if they know what they’re looking at. And then finally, synthetic data can be used for security and safety, with cameras that detect dangerous or risky behavior.

For example, safety applications using synthetic data can detect people doing dangerous things, whether in the workplace or at an amusement park. Normally, it would be very difficult to get that kind of data, because it would be very dangerous to obtain, Lange says. Synthetic data can create images of humans doing dangerous things, but without any risk. It can also help remove bias, ensuring that things like skin color, hair, and clothing are evenly represented.

Implementing synthetic data

When integrating synthetic data into your systems, you still need to ensure you have a real-world baseline, Lange said. You also need to define events that you can imagine happening in the real world.

The other key aspect is understanding the data, he added. The synthetic data you create needs to cover these real-world incidents, and that’s where you use data analytics. You map out distributions of events, angles that you’re going to be able to detect the object at, lighting conditions and so on, and then since it’s synthetic, you can add randomization over that.

"What we can do here, compared to the real world, is we can scale," Lange said. "We can create improbable situations, because it’s not going to cost us anything in milliseconds, rather than trying to stage them in reality. The ease with which you can create all these scenarios is driving the use of synthetic data."

"I believe that the vast majority of training data will be synthetic," Lange continued. "You have to have the real world as a baseline, but synthetic data eliminates privacy concerns, because there are no real people involved. You can eliminate bias. You can do your data analytics and ensure that your data represents the real world in a very even way, better than the real world does."

Creating synthetic data

The use case for synthetic data

Implementing synthetic data

More