Why Unity claims synthetic data sets can improve computer vision models

When the pandemic hit, one of the casualties was the autonomous vehicle market. With fleets largely grounded, AV companies couldn't put in the long, real-world miles to collect the massive amounts of data they need to improve the cars' perception capabilities, thus slowing down progress on reaching new levels of autonomy and moving into new pilots and markets. It was a gut punch, but AV makers turned to using synthetic data and simulations to continue training as much as possible.

The forced pivot could be a mixed blessing. Real-world data is invaluable, but in a presentation at Transform 2020, Unity's principal ML engineer Cesar Romero made the case for using synthetic data to train autonomous vehicles, robots, and more. Unity is widely known for its eponymous game engine, but the company also offers tools for the transportation, film, architecture, engineering, and construction industries. While noting that all of these systems require a lot of data and large collections of examples, he pointed to the inherent challenges with real-world data and juxtaposed them with the relative upsides of synthetic data.

"First we have regulatory concerns such as GDPR, for example," he said. "These kinds of regulations try to emphasize that the data belongs to the individual, and not to the entity that collects it -- in which case it might make it hard for us to collect all that data blindly and just use it to learn from it." But simulated data obviates that concern entirely; it's completely synthetic, so there's no privacy to violate or ownership to question.

Real-world data also often suffers from bias, or there simply may not be enough of it. "That data that you might need to train your system might not naturally occur frequently enough in the real world," Romero noted. And even if you can acquire a sufficient amount of it, the collection and data annotation processes take time -- which is to say, money. And, Romero said, those problems don't go away with scale.

For example, computer vision systems for AVs learn from road events like car accidents, which are (fortunately) so rare that it's difficult to collect enough examples to train models. But, he said, you can "create a simulation in Unity where you can actually add multiple pedestrians and cars to the intersection and see how they interact with each other and intentionally simulate car accidents or near misses, and use those as examples to train the computer vision model."

He illustrated how, over the course of just a few years, the need for complexity within computer vision data has grown sharply. He said that in 2012, ImageNet was a revelation, but what it offered is considered simple by today's standards. A photo of a busy intersection would get a single label, like "cars." "Just knowing that there are cars is this image is not sufficient for any autonomous system to make a decision -- it's not sufficient [enough] to tell a car that there are other cars here," he said, "so other tasks become more relevant over time."

He illustrated the layers and levels of tasks that AV systems need from an image like this one: The next step is object detection, where there are bounding boxes around each item in the image, so the system knows that there are things it needs to avoid. But that's a more complex labeling challenge than "cars." Next is semantic segmentation, where every pixel in the image gets a label according to what the object represents; in the sample image, for example, cars are blue, pedestrians are magenta, and so on. Then there's instance segmentation, which shows you how many individual cars, pedestrians, and other objects there are. From there we get into panoptic segmentation, where every pixel is labeled according to both instance and class. "This is closer to implicitly what humans do, and what you might want a system like an autonomous vehicle to be able to do in real time as they make decisions," Romero said.

He said that because each successive task is more difficult, each takes more time to label and audit, and therefore the cost of annotation grows.

And, of course, you're limited to one view of an object and a scene -- the "world" it's in. But because a simulation is rendered, the game engine knows the entirety of the object and the world it's in. "It knows exactly what each pixel is because it is rendered itself. So we can use this information to generate data sets," Romero said.

Those data sets built from the scene can be rich with variation because of "limitless domain randomization," which is when you can change colors, materials, and lighting within a given simulation to provide more data from the same scene. For example, you could change the lighting in a scene from morning to afternoon to night, and each change produces additional data. Achieving the same in real-life data, Romero said, is too hard and expensive (that is, if it's even possible, which it isn't in some cases).

And rendered objects aren't just flat 2D images; they can be 3D objects, which opens up myriad ways to manipulate it. "If you start from a single 3D model of a product, you can arbitrarily rotate it, change the background, change the distance between the object and the camera that is capturing the image, change the blur or focus or color of the light and then you might end up with millions of images," he said.

In a chart, he calculated the difference in cost between synthetic and real-world data sets. According to Romero, although a real-world data set may incur only two thirds of the cost of a synthetic one, the former has a significantly higher cost per image. In the end, you'd have 1,500 images from a real-world data set versus more than a million synthetic images.

Romero offered a few examples of studies in favor of synthetic data. In the SYNTHIA data set for autonomous vehicles, training results showed that a combination of real-world data and synthetic data performed better than real-world data alone.

A 2017 paper called "Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World" found that a model for improving robot arm grabbing accuracy that was training entirely on synthetic data showed that images didn't need to be especially photorealistic. "[The researchers] intentionally randomize aspects of the image that don't particularly matter for the test that the model needs to perform," Romero said. In this case, they needed the robot arm to pick up a cube. "It doesn't matter what the color is, you know when a cube is a cube."

And in a third example, Google Cloud AI researchers trained an object detection model on synthetic data -- supermarket items -- that they said outperformed one that was trained on real data.

For those looking to get started with simulations and synthetic data, Romero said, Unity offers SynthDet, which can actually generate assets and label frames. You can run simulations locally on your own hardware, and you can use the company's cloud service, Unity Simulation, for large-scale simulations.

Though some of Romero's points about the advantages of synthetic data are seemingly inalienable, when it comes to autonomous vehicle training, some posit that real-world data is indispensable. For instance, in an earlier interview with VentureBeat, Waymo product lead for simulation and automation Jonathan Karmel said, "If you just focus on synthetic miles and don't start bringing in some of the realism that you have from driving in the real world, it actually becomes very difficult to know where you are on that curve of realism." But, he added, "That said, what we're trying to do is learn as much as we can -- we're still getting thousands of years of experience during this period of time."

More