AI uses Google Trends data to predict how many people will get the flu

Influenza results in over 31.4 million outpatient visits and more than 200,000 trips to emergency rooms and hospitals each year, according to estimates. The flue outbreak occurring between 2017 and 2018 alone -- one of the longest and most severe in recent years -- caused roughly 80,000 deaths and nearly 1 million hospitalizations.

Needless to say, there is plenty of incentive to predict flu outbreak scope and severity, and researchers investigating AI-augmented forecasting are making headway. In a newly published paper ("Sequence to Sequence with Attention for Influenza Prevalence Prediction using Google Trends") on Arxiv.org, scientists hailing from the University of Tokyo describe a system that taps data from Google Trends, a tool that analyzes the popularity of top search queries in Google Search, to improve precision. They report that their approach achieves state-of-the-art results in preliminary tests.

"The prediction of influenza in its early stages reduces its impact, along with determining the number of vaccines and other anti-influenza drugs that help the medical personnel to make the correct decision," wrote the paper's coauthors. "Various studies have been conducted to predict the number of influenza-infected people. However, they are not highly accurate, especially in the distant future."

The team leveraged a type of AI model known as sequence-to-sequence with attention, which processes input data selectively based on internal signals. Like most machine learning systems, sequence-to-sequence models consist of layers of mathematical functions -- neurons -- that ingest data and pass it along to subsequent layers, in the process adjusting the strength of the connections among neurons (weights). An encoder component outputs encoded vectors (mathematical representations) corresponding to inputs, while a decoder encodes the input vectors and predicts the next time step outputs.

As for the aforementioned Google Trends data, the researchers used it to gauge people's interest in the flu at any given point in time. Specifically, they honed in on the retrieval frequency of the word "influenza" as supplemental information for the model, which helped to compensate for dark data (data that's acquired but not used to derive insights) in a corpus of influenza-like illnesses compiled from hospitals by the U.S. Center for Disease Control and Prevention.

All told, the team used the unweighted percentage of people infected with flu-like illnesses across six states (New York, Oregon, California, Illinois, Texas, and Georgia) selected for their climate diversity. The researchers combined the figures with state-targeted Google Trends data from October 10, 2010 to December 30, 2018 (430 weeks). About 67% of the data was used to train the AI model and and 37% to test it.

The paper's coauthors say that in tests the sequence-to-sequence model with attention had a "significantly higher" Pearson correlation -- the measure of the linear correlation between two variables -- for all six states over a prediction period of one to four weeks compared with baseline models (.996). Additionally, they note that it showed a root mean squared error of 0.67, indicating that the data was relatively concentrated around the line of best fit.

The researchers caution that the peak value shifted downward as prediction time increased because the peak time couldn't be accurately predicted from the learning data. However, they believe the addition of a leading indicator -- which they leave to future work -- might address the problem by further improving accuracy.

More