DefinedCrowd raises $50.5 million for AI data set curation

Seattle-based DefinedCrowd, which describes itself as a "smart" data curation platform, today announced that it has raised $50.5 million in equity financing. CEO and founder Daniela Braga says the proceeds will be used to expand the company's existing solutions, launch subscription-based offerings, and grow DefinedCrowd's international reach.

Training AI algorithms typically requires high-quality labeled data, which is why crafting corpora can take nearly as long as -- and oftentimes longer than -- developing the models that ingest them. It's a problem DefinedCrowd aims to solve with a bespoke model-training service for clients in customer service, automotive, retail, health care, and other enterprise segments.

Braga, who holds a Ph.D. in speech technology, is familiar with the ins and outs of data set curation. Prior to founding DefinedCrowd, she oversaw a $14 million effort to improve Microsoft's AI-powered Cortana voice assistant, which she described as an uphill battle. Roughly 18 months of every product development cycle went to procuring data to refresh the underlying models.

DefinedCrowd's approach employs a community -- via Neevo -- of more than 290,000 contributors (up from 45,000 two years ago) in 195 countries who complete paid jobs involving labeling, typing, and spoken words and phrases. They supply well over 500,000 samples a day to the data sets available through DefinedCrowd's natural language processing, voice recognition, and computer vision tools.

Via APIs and a web interface, DefinedCrowd's customers can filter demographics, specifying the age, location, and gender of Neevo members and even their proficiency in a language for applications like transcription, voice emotion tagging, text sentiment and semantic annotation, question and answer collection, and spontaneous speech. The platform supports over 50 languages and 79 dialects, or about 90% of the world's most widely spoken languages, with a claimed labeling accuracy of up to 98%.

DefinedCrowd's real value proposition is arguably its extensibility. Customers can use the platform to not only train models from scratch within budgetary constraints, but to augment existing models with data sets tailored to specific technical needs. Those with simpler requirements can take advantage of specialized workflows, templates, and off-the-shelf solutions or upload their own proprietary data sets, all while getting live cost estimates and a dashboard for viewing real-time progress.

For instance, developers of a news curation skill on Amazon's Alexa platform could use DefinedCrowd to generate multiple data sets to improve the algorithm's performance across markets.

DefinedCrowd, which saw 656% year-over-year revenue growth last year, counts Fortune 500 companies like BMW, Mastercard, Nuance, and Yahoo Japan among its clientele. The company's staff of over 100 people is spread across offices in Portugal, Seattle, and Japan, and DefinedCrowd plans to double its workforce to 500 and open additional R&D labs by 2021.

This latest round brings DefinedCrowd's total raised to $63.4 million, following an $11.8 million raise in July 2018, and included participation from new investors Semapa Next and Hermes GPE. Existing investors Evolution Equity Partners, Kibo Ventures, Portugal Ventures, Bynd Venture Capital, EDP Ventures, and IronFire Ventures also participated. They joined long-term backers that include Amazon Alexa Fund, Sony Innovation Fund, and Mastercard.

It's worth noting that DefinedCrowd isn't the only startup vying for a slice of the over $5 billion data annotation tools market. There's Scale AI, which recently raised $100 million for its extensive suite of data labeling services, and CloudFactory, which last November nabbed $65 million for its data processing and prep tools. That's not to mention Mighty AI, Hive, Appen, and Alegion.

More