AI experts refute Cvedia's claim its synthetic data eliminates bias

Most of AI's critical challenges aren't actually about AI; they're about data. It's biased. It's collected and used without regard for privacy and consent. And machine learning systems require astronomical amounts of it. Now as privacy laws proliferate, it will also be harder to come by.

Enterprises are increasingly considering synthetic data to power their AI. Digitally generated as a stand-in for real-world data, it is touted as truly anonymous and bias-free. And because it's supposed to be free from all the issues of messy real-world data, much less of it would be required. But that's all easier said than done. While enterprises across industries are already using synthetic data to train voice recognition, computer vision, and other systems, serious issues persist. We know the original training data isn't always truly obscured, and there's currently little evidence to suggest synthetic data can effectively mitigate bias. On top of that, performance has been mixed compared to systems trained on real-world data.

Recently, synthetic data and computer vision company Cvedia announced it has "officially solved the 'domain adaptation gap'" with a proprietary synthetic data pipeline it claims performs better than algorithms trained on real data. The company is also claiming its system is free of bias, built on "zero data," and will enable customers to "sidestep the entire data process." If true, such advancements could strengthen the case for the use of synthetic data in AI, but experts say Cvedia lacks sufficient evidence and has oversold its work.

"It's not solving the entire domain gap, nor is it eliminating bias from the systems," Mike Cook, an AI researcher at the Queen Mary University of London, told VentureBeat. "It's definitely good. Like I say, I've seen similar techniques elsewhere. But it's not doing all the amazing things being claimed here."

The domain gap

The "domain gap" or "domain adaptation gap" refers to the way AI trained on a specific type of data struggles to transfer its knowledge to a different type of data. Beyond comparisons between synthetic and real-world performance, this performance issue often happens with the deployment of AI systems in general, since they're inherently moving from a clean environment into real scenarios. There's often also a domain gap when applying AI to a new task. Cook said "it's definitely a big problem" but that it's not the type that can be "solved."

In its announcement, Cvedia doesn't clearly spell out what it has actually accomplished. Other than one vague metric -- precision improvement of 170% while sustaining a gain of 160% on recall over benchmarks -- the company didn't release any information about the data or its processes. Cvedia cofounder and CEO Arjan Wijnveen told VentureBeat the data is mainly for EO/IR sensors used in various types of cameras, specifically for detection, classification, and regression algorithms. But he wouldn't share any information about tests and trials, which both Cook and Os Keyes, an AI researcher at the University of Washington, agreed are needed to support the claims.

Wijnveen declined to share such information with VentureBeat, calling it proprietary. But he did say the metric released and overall claims are based on just one use case -- defense supplier FLIR Systems, which provided the statistic from its own evaluation. Cook and Keyes agree that even if the company has seen performance success with one system, that's a far reach from solving the domain gap problem. They became especially skeptical upon hearing that Cvedia is funded by FLIR Systems and the defense company's CTO, Pierre Boulanger, is also one of Cvedia's two legal advisors (Wijnveen is the other).

Data is data

Synthetic data is typically created by digitally regenerating real-world data so it's still mathematically representative. But in its press release, Cvedia claims it didn't use any data at all. Wijnveen later explained it differently to VentureBeat, saying "it is simply created out of thin air" and that this "goes against all the things data scientists stand for, but for us it really does work."

Specifically, he explained the company tapped a team of 50 artists to create 3D models of various objects found in the real world, which the company then sells to be used for training AI systems. He added that labeling is "fully automated" and that a 3D engine "simply generates data with the labels and signs." For these reasons, he claims AI built on these models is free of bias. But models represent data, even if they're created internally rather than collected. And someone had to design every part of the systems that made all this happen. Wijnveen also admitted there are some exceptions, where real photos were used and annotations were done by hand. Overall, Cook called the belief that the technique eliminates bias an "eyebrow-raising claim."

"Generating your own data is definitely a useful approach, but in no way would anyone consider that free of bias," he said. "Who are these artists? Which objects did they model? Who chose them? Suppose this is a targeting AI for a military drone and I'm going to teach it to identify civilian targets from military ones. The artists still need to be told what to model. If they're told to model mosques as potential military targets and American bases as civilian ones, we wouldn't say that was unbiased simply because they're 3D models."

Keyes agreed, citing how limitations act as a bias in this scenario: "Whether you have 50 photographers out on the street taking pictures or 50 CAD artists in a basement making them up, those 50 people are still going to be limited in what objects they can see and imagine."

Defining bias

Even presented with these points, Wijnveen argued that systems trained on Cvedia's synthetic data are free of bias. He doubled down with regard to bias around race and face detection, saying "these are not biases we suffer from." It turns out, he was using his own definition of bias.

"There's always going to be tradeoffs, right, so it'll never be a perfect solution. But very often, depending on the jurisdiction of the application on top of it, you're still going to get suitable results," Wijnveen said. "So it really is about having a productive commercial level application that will work in the field, and not so much from a scientific, data science point of view." He went on to say "there are nuances" and we need to "redefine [bias] in terms of academic versus productive-ized biases."

But bias is no trivial issue when it comes to machine learning and AI. It plagues the field and can creep into algorithms in several ways. Cook said bias elimination is an "extremely strong claim" and that it makes the press release "transparent as a piece of PR." He added that saying you've eliminated bias means something specific to people, who overwhelmingly view the issue through a lens Wijnveen calls "academic." Keyes compared the claim to a doctor declaring they had cured cancer after treating one melanoma.

"No academic researcher worth their salt genuinely believes one can entirely eliminate bias because of how contextual to the use case 'bias' is," Keyes said. "The thing that makes this not academic is that there's zero actual detail or evidence. If an academic researcher tried to make a claim like this, they would be required to explain precisely what they were doing, how they were defining bias, what the system was for. He's done none of that. He's just declaring 'We fixed the problem! Please don't ask us how; it's proprietary.'"

Maintaining AI realism

In spite of issues with Cvedia's work and with synthetic data at large, the general approach may hold promise. Keyes and Cook agree the company's work could be interesting, and DeepMind has been working on something similar since 2018. If synthetic data truly could obscure its origins and perform as well as systems trained on real-world data, that would be a step forward, particularly when sensitive information is involved.

But as more enterprises consider using synthetic data and move to implement AI in various forms, caution is warranted. While there are practical strategies for mitigating bias, enterprises should remain highly skeptical of claims that it has been eliminated. And tools meant to help spot and mitigate bias often underdeliver, as these issues run deep and elude easy fixes.

"It's important to take steps to improve how we build AI systems, but we also need to be realistic about the process and realize it requires attacking from multiple angles," Cook said. "There's no silver bullet."

The domain gap

Data is data

Defining bias

Maintaining AI realism

More