MIT CSAIL's machine learning algorithm helps predict patterns in large data streams

Ever heard of the "Britney Spears problem"? Contrary to what it sounds like, it's got nothing to do with the dalliances of the rich and famous. Rather, it's a computing puzzle related to data tracking: Precisely tailoring a data-rich service, like a search engine or fiber internet connection, to individual users hypothetically requires tracking every packet sent to and from the service provider, which needless to say isn't practical. To get around this, most companies leverage algorithms that make guesses about the frequency of data exchanged by hashing it (i.e., divvying it up into pieces). But this necessarily sacrifices nuance -- telling patterns that emerge naturally in large data volumes fly under the radar.

Luckily, researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) believe they've devised a viable alternative that relies on machine learning. In a newly published paper ("Learning-Based Frequency Estimation Algorithms"), they describe a system -- dubbed LearnedSketch, because of the way it "sketches" data in a data stream -- that predicts if specific data elements will appear more frequently than others and, if they in fact do, autonomously separates them from the rest of the hashed portions.

The paper's authors say it's the first machine learning-based approach not only for frequency estimation, but for streaming algorithms, a class of algorithms in which input data is presented as a sequence and can be examined only in a few passes. They're popularly used in security systems and natural language processing pipelines, among many applications.

"[S]treaming algorithms typically assume generic data and do not leverage useful patterns or properties of their input," the team explains. "For example, in text data, the word frequency is known to be inversely correlated with the length of the word. Analogously, in network data, certain applications tend to generate more traffic than others. If such properties can be harnessed, one could design frequency estimation algorithms that are much more efficient than the existing ones."

In experiments, LearnedSketch showed an aptitude for detecting and isolating rich bits of data. For instance, trained on 210 million data packets from a Tier 1 ISP, it outperformed existing approaches for estimating the amount of internet traffic in a network, achieving upwards of 57 percent less error. And given 3.8 million unique AOL queries, it managed to estimate the number of queries for an internet search term with upwards of 71 percent less error.

Moreover, LearnedSketch was highly generalizable; the structures it learned could be applied to items it hadn't seen before. In one experiment that tasked it with determining which internet connections had the most traffic, it clustered different connections by the prefix of their destination IP address, indicating an awareness of the rule that internet subscribers which generate large traffic tend to share a particular prefix.

The researchers believe that LearnedSketch (or an AI system like it) might someday be used to track trending topics on social media, or to identify troublesome spikes in web traffic and improve ecommerce sites' product recommendations. But really, said PhD student and coauthor Chen-Yu Hsu, the sky's the limit.

"These kinds of results show that machine learning is very much an approach that could be used alongside the classic algorithmic paradigms like 'divide and conquer' and dynamic programming," Hsu added. "We combine the model with classical algorithms so that our algorithm inherits worst-case guarantees from the classical algorithms naturally."

The research is scheduled to be presented in May at the International Conference on Learning in New Orleans.