Data labeling will fuel the AI revolution

This article was contributed by Frederik Bussler, consultant, and analyst.

AI fuels modern life — from the way we commute to how we order online, and how we find a date or a job. Billions of people use AI-powered applications every day, looking at just Facebook and Google users alone. This represents the tip of the iceberg when it comes to AI’s potential.

OpenAI, which recently made headlines again for offering general availability to its models, uses labeled data to “improve language model behavior,” or to make its AI fairer and less biased. This is an important example, as OpenAI’s models were long reprimanded for being toxic and racist.

Many of the AI applications we use day-to-day require a particular dataset to function well. To create these datasets, we need to label data for AI.

Why does AI need data labeling?

The term artificial intelligence is somewhat of a misnomer. AI is not actually intelligent. It takes in data and uses algorithms to make predictions based on that data. This process requires a large amount of labeled data.

This is particularly the case when it comes to challenging domains like healthcare, content moderation, or autonomous vehicles. In many instances, human judgment is still required to ensure the models are accurate.

Consider the example of sarcasm in social media content moderation. A Facebook post might read, “Gosh, you’re so smart!” However, that could be sarcastic in a way that a robot would miss. More perniciously, a language model trained on biased data can be sexist, racist, or otherwise toxic. For instance, the GPT-3 model once associated Muslims and Islam with terrorism. This was until labeled data was used to improve the model’s behavior.

As long as the human bias is handled as well, “supervised models allow for more control over bias in data selection,” a 2018 TechCrunch article stated. OpenAI’s newer models are a perfect example of using labeled data to control bias. Controlling bias with data labeling is of vital importance, as low-quality AI models have even landed companies in court, as was the case with a firm that attempted to use AI as a screen reader, only to have to later agree to a settlement when the model didn’t work as advertised.

The importance of high-quality AI models is making its way into regulatory frameworks as well. For example, the European Commission’s regulatory framework proposal on artificial intelligence would subject some AI systems to “high quality of the datasets feeding the system to minimize risks and discriminatory outcomes.”

Standardized language and tone analysis are also critical in content moderation. It’s not uncommon for people to have different definitions of the word “literally” or how literally they should take something such as “It was like banging your head against a wall!” To decide which posts are violating community standards, we need to analyze these types of subtleties.

Similarly, the AI startup Handl uses labeled data to more accurately convert documents to structured text. We’ve all heard of OCR (Object Character Recognition), but with AI-powered by labeled data, it’s being taken to a whole new level.

To give another example, to train an algorithm to analyze medical images for signs of cancer, you would need to have a large dataset of medical images labeled with the presence or absence of cancer. This task is commonly referred to as image segmentation and requires labeling tens of thousands of samples in each image. The more data you have, the better your model will be at making accurate predictions.

Sure, it's possible to use unlabeled data for AI training algorithms, but this can lead to biased results, which could have serious implications in many real-world cases.

Applications using data labeling

Data labeling is vital for applications across search, computer vision, voice assistants, content moderation, and more.

Search was one of the first major AI use-cases relying on human judgment to determine relevance. With labeled data, a search can be extremely accurate. For instance, Yandex turned to human “annotators” from Toloka to help improve its search engine.

Some of the most popular uses of AI in health care include helping to diagnose skin conditions and diabetic retinopathy, boosting recall rates for medication compliance reviews, and analyzing radiologist reports to detect eye conditions like glaucoma.

Content moderation has also seen significant advances thanks to AI applied to large quantities of labeled data. This is especially true for sensitive topics like violence or threats of violence. For example, people may post videos on YouTube threatening suicide, which need to be immediately detected and differentiated from informational videos about suicide.

Another important use of AI for data labeling is understanding voices with any accent or tone, for voice assistants like Alexa or Siri. This requires training an algorithm to recognize male and female speech patterns based on large volumes of labeled audio.

Human computing for labeling at scale

All this begs the question: How do you create labeled data at scale?

Manually labeling data for AI is an extremely labor-intensive process. It can take weeks or months to label a few hundred samples using this approach, and the accuracy rate is not very good, particularly when facing niche labeling tasks. Additionally, it will be necessary to update datasets and build bigger datasets than competitors in order to remain competitive.

The best way to scale data labeling is with a combination of machine learning and human expertise. Companies like Toloka, Appen, and others use AI to match the right people with the right tasks, so the experts do the work that only they can do. This allows firms to scale their labeling efforts. Further, AI can weigh the answers from different respondents according to the quality of the responses. This ensures that each label has a high chance of being accurate.

With techniques like these, labeled data is fueling a new AI revolution. By combining AI with human judgment, companies can create accurate models of their data. These models can then be used to make better decisions that have a measurable impact on businesses.

Frederik Bussler is a consultant and analyst, with experience across innovative AI platforms such as Commerce.AI, Obviously.AI, and Apteo, as well as investment offices such as Supercap Digital, Maven 11 Capital, and Invictus Capital. He is featured in Forbes, Yahoo, and other outlets, and has presented for audiences including IBM and Nikkei.

Welcome to the VentureBeat community!

Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.

Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!