Yahoo is announcing today that it has open-sourced algorithms for doing very quick and efficient computations on streams of data. The Java-based Data Sketches algorithms are now available for download on GitHub under an Apache License.
It’s a type of technology that researchers have talked about more and more in the academic literature, usually with different names. But they share some key qualities. For one, they can handle streaming data because they touch the data just one time. And they are additive, so you can add up or merge these calculations together. Most interestingly, they are approximate. There are great benefits to this.
“The whole science is based on a very fundamental function: If you can tolerate a little bit of error in your results, then you can improve the speed of the computation immensely,” Lee Rhodes, an architect for Yahoo’s Advertising and Data Platforms division, told VentureBeat in an interview.
Imagine that you want to count things, such as the number of people who visited both Yahoo Finance and Yahoo Sports in a single day. If you try to count exactly how many people visited, you can get an answer — as long as you have plenty of disk space, memory, and time on your hands. It’s very difficult. Yahoo naturally wanted to optimize this type of calculation.
In addition to speeding up counts, Data Sketches can also do certain types of joins more quickly than exact computations. An exact count of 100 million values that Rhodes showed me took two and a half minutes, while a sketch with one of his algorithms took just 2.7 seconds.
The sketching technology is working inside of many Yahoo properties. Yahoo-owned Flurry uses it to calculate real-time counts. Yahoo’s email service and search engine use it, as do some Yahoo audience data services. So these are production tools, not “toy” algorithms documented only in papers, Rhodes said.
And now it’s open source, meaning that others can try it out and build their own systems based on it. (Like other web companies, Yahoo regularly releases open source code — earlier this week it open-sourced the Anthelion web crawler for structured data.)
Some people might be concerned about what happens if you make approximate calculations. But Rhodes isn’t worried about that.
“They have constant relative accuracy independent of how big the input stream is — say, plus or minus 1.5 percent,” he said. All the while, he said, the algorithms will use around 100KB of memory and no disk space.
The library integrates with Hive and Pig, two tools in the Hadoop open source big data software ecosystem, as well as the Druid open-source data store. And it works easily with the Maven build management tool.