DefinedCrowd raises $11.8 million to create bespoke datasets for AI model training

Gathering data on which to train machine learning models is no walk in the park. Well-trained algorithms require well-labeled, high-quality sources, which is why collating a dataset can take nearly as long as -- and oftentimes longer than -- developing the models that will eventually ingest it.

That's a problem DefinedCrowd aims to solve. The three-year-old Seattle-based startup, which describes itself as a "smart" data curation platform, offers a bespoke model-training service to clients in customer service, automotive, retail, health care, and other enterprise sectors. Today it announced that it's raised $11.8 million in a funding round led by Evolution Equity Partners, Mastercard, Kibo Ventures, and Energias de Portugal (EDP), and secured additional capital from current investors Sony, Portugal Ventures, Amazon, and Busy Angels.

"Data needs to be of high quality -- it can hurt the brand if it isn't," Daniela Braga, CEO of DefinedCrowd, told VentureBeat in a phone interview. "Simply put, we make easy the process of collecting and annotating high-quality training data for model training."

Braga, who holds a Ph.D. in speech technology, is intimately aware of data collection's Sisyphean nature. Prior to founding DefinedCrowd, she oversaw a $14 million effort to improve Cortana, Microsoft's AI-powered voice assistant, which she described as an uphill battle. Roughly 18 months of every product development cycle was spent procuring data to refresh the underlying models, she said.

"We were never at the place we needed to be in-house, which is when I realized there was a gap for enterprise corporations," she said. "They needed a partner who could [generate] large amounts of data at scale with high quality."

Braga found a silver bullet in crowdsourcing. DefinedCrowd's novel approach employs a community (Neevo) of more than 45,000 human contributors who complete jobs involving labeling, typing, and speaking words and phrases. They upload more than 500,000 units of data per day to the datasets that populate DefinedCrowd's natural language processing, voice recognition, and computer vision tools.

Through APIs and a web interface, said tools afford DefinedCrowd's customers the freedom to filter demographics with a fine-tooth comb -- they can specify the age, location, and gender of contributing members, and even their proficiency in a given language. The platform supports a whopping 46 languages, or about 90 percent of the world's most widely spoken languages, with up to 98 percent accuracy.

But its real value proposition is its flexibility, Braga said. Customers can use DefinedCrowd's platform not only to train machine learning models from scratch, but to augment existing models with datasets tailored to their specific needs. Those with simpler requirements, meanwhile, can take advantage of specialized workflows, templates, and off-the-shelf solutions.

Picture this: A news curation skill on Amazon's Alexa platform has a large contingent of international users, and so its developers need to train a voice recognition model that's equally accurate across markets. With DefinedCrowd's tools, they could generate multiple datasets to improve the algorithm's performance.

DefinedCrowd, which has grown by a factor of six year-over-year, counts Fortune 500 companies including BMW, Mastercard, Nuance, and Yahoo Japan among its lengthy list of clients. Its staff of more than 40 people is spread among offices in Portugal, Seattle, and Japan, and it hopes to hire an additional 40 by the end of this year.

The company will use the funding to expand its product offerings, grow its developer and sales team, and increase its global footprint.

More