Massive data sets are not a commodity for AI

Those of you who follow advancements in artificial intelligence (AI) have probably heard data called the "new currency" or the "crown jewels." At the same time, a contingent of people believe data is a commodity. But it can’t be both a highly prized proprietary possession and an interchangeable good.

So which is it?

Commodities 101

With college economics far behind many of us, here’s a quick refresher on what makes a commodity a commodity.

It’s fungible. The quality of a given commodity may vary slightly, but it’s essentially consistent across producers. Consider gold: The purity may differ across samples -- 9 karats versus 24 karats, for instance -- but gold is gold.
The market dictates the price, and it’s traded openly. The intersection of supply and demand serves as the pricing determinant of commodities. You can trade everything from pork futures to oil and precious metals on open markets like the Chicago Mercantile Exchange.
A commodity usually generates low margins. With no major differentiation from one product to the next, there’s minimal wiggle room and strong competition.
It has common standards, which makes it easy to exchange for goods of the same type.
A commodity is (usually) an input into another finished good. Oil on its own is essentially worthless; its value lies as an energy source to fuel other products.

Is data a commodity?

Spoiler alert: It’s not.

Data -- accurate, precise training data that teaches models to discover predictive relationships -- does offer the keys to the kingdom. Yet it is a far from being a commodity, and here’s why.

First, training data is not fungible. Consider the datasets we need to build autonomous navigation systems. A stoplight is a stoplight, so for a car to recognize one, you may think all you need is a series of positive and negative images to train a classifier. It might as well be the Not Hotdog app from HBO’s Silicon Valley. Except it’s not that simple. Stoplights don’t look the same in every country. Not to mention the question of how the data was captured. What type of camera did the car use? Where was it mounted? What’s the angle of the image? What’s the angle of capture, and is it (partially) obstructed? Was it a sunny day or a rainy night? Something as seemingly straightforward as labeling a stoplight is actually quite complex.

Second, is the market the sole determinant of price, and is data openly traded? There is no open market for training data. I suspect there never will be because many organizations closely guard data as premium among their intellectual property. Let’s stick with our autonomous driving example. Companies in this industry are in a race to get to Level 4 autonomy, where cars drive on their own. It's not likely the automakers will share their proprietary data in the midst of competition this fierce. Nor will banks, insurance providers, ecommerce merchants, advertisers, or, given the choice, many of the rest of us.

Third, we should consider the low margins. Is the differentiation between the methodologies of generating training data minimal? That’s debatable, and it’s hard to say if anyone really knows. We haven’t yet seen controlled experiments to determine if one method is truly better than another. As someone who works with dozens of leading AI practitioners around the world, I can tell you that AI teams will pay more for accurate and consistent training data to help scale their models. There’s plenty of traditional crowd-sourcing and outsourcing tools that offer more commoditized human knowledge as an API, but in that case, you’re getting what you pay for: lots of messy, imprecise data.

Fourth, there are no common standards. Let’s take autonomous vehicles again. A few regulations exist within regional and national governments, but they’re weak and inconsistent. In the U.S., the National Highway Traffic Safety Administration released guidance for vehicle performance for manufacturers and suppliers building autonomous vehicles, but it’s not a set of enforceable laws. The International Organization for Standardization issued the ISO 26262 standard in 2011, but it provides dated guidance with no enforcement teeth. And states have tremendous authority in creating laws and regulations that apply to traffic locally. I think we’re a long way off from seeing any common standards for autonomous driving data in the U.S., let alone globally.

And finally, is training data an input into a finished product? We can affirmatively check this box. It’s probably the most important input into an AI model, which is why so many people -- myself included -- liken it to the crown jewels.

Anyone who thinks training data is a commodity is in for a surprise. Companies are at vastly different stages of maturity in terms of building AI applications. And it’s not enough to have petabytes of raw data; you need it labeled consistently to provide reliable precision and recall. Only then does it become the catalyst that differentiates and accelerates your AI. This makes applying AI harder in the short-term but more interesting and valuable in the long-term.

Oh, and it also means that we pesky carbon life forms are likely to continue providing the "crown jewels" that allow silicon to simulate and enhance our cognition for a long, long time to come.

Matt Bencke is the founder and CEO of Mighty AI, a startup that helps train artificial intelligence models.

Commodities 101

Is data a commodity?

More