Want to cut out sugar or add super-foods to your diet? Now app developers can help. Data curator Factual just added ingredient lists for over 350,000 of the most popular consumer packaged goods (CPGs) and nutrition parameters for over 150,000 of them to its Global Products API.
“It’s been a fun and difficult project,” said Eva Ho, a Factual vice president and a self-confessed health nut. “Nutritional data and ingredients are really difficult to normalize. After tackling CPG products, which we think will power a load of really wonderful couponing and commerce applications, we are moving on to things like electronics.”
When it comes to data, both Ho and founder Gil Elbaz know their stuff. Their last company, Applied Semantics, provided the basic technology for Google’s AdSense, which earns Google close to $10 billion a year.
The new ingredients list allows developers to exclude products containing high fructose corn syrup or find those with green tea, which rather surprisingly includes everything from shampoo to dog treats. Developers can also query calories, level of saturated fat, cholesterol, sugars, sodium, and eight other parameters. If you want to avoid a mid-morning sugar slump, check out the sugar content of popular breakfast cereal brands as shown on the chart below. Factual has also added EAN-13 data, a 13-digit code used to identify retail products worldwide, in addition to UPC, the US standard.
Factual currently provides clean and structured data sets on places (Facebook was the company’s first customer), restaurants, healthcare providers, hotels, and consumer products. The company recently added real-time updates, which allow anyone to contribute data to a particular data set and have it validated by Factual’s machine learning algorithms.
“You can say, ‘This product is not kosher.’ We will then make our best guess on how that new piece of data effects the model,” says Ho. “The update gets pushed out into the production-ready data set in less than 10 seconds.”
Factual spent three years developing its automated processed for structuring and cleaning up data sets. The data comes from partners (with which Factual has many “data swap” deals), users, and the web.
“We look at millions of web pages and extract facts using machine learning and other techniques, ” Ho explains. “These signals are not always obvious. From a Foursquare check-in you can deduce hours of operation. We are not just scraping opening hours from a website.” Each new input is assessed based on the source. Trusted sources get higher weight. The raw data is turned into facts by mapping to a known semantic data type like a phone number or a zip code. This new fact is matched against the existing database to eliminate duplication and stored if it is supported by multiple, trusted sources.
Factual’s data sets are currently most commonly used in local search, applications with geographical components, and e-health. Within large companies, they are used for CRM, credit scoring, supply chain analysis, and customer targeting. “What’s really fun is seeing a large financial services company using it both on the enterprise side for analytics as well as building, say a restaurant application,” says Ho.”So we are powering their digital initiatives as well as internal operational efficiencies. That’s been a little surprising for us.”
So what’s the next big trend in the data business? “Companies will realize that by pooling their information, they will create a new data platform on which lots of new applications can sit,” says Ho. “In the past, the CPG manufacturers only had to worry about their data being published in Amazon and drugstore.com.
“Now there are thousands of mobile apps out there and their data is completely mis-represented. The pictures are wrong, the descriptions are wrong. If you want your data to be accurate, go to one central hub which holds it and cleans it.”
Factual wants to be the Switzerland of data: tidy, neutral, and very likely rich. It may just get there.
VB's research team is studying web-personalization... Chime in here, and we’ll share the results.