How AI researchers used Bing search results to reveal disease knowledge gaps in Africa

What can you learn about the health needs of a population through search query data? Where are there opportunities to serve communities with little reliable data, and how can AI play a role?

Those are the questions Rediet Abebe has attempted to answer through web-based data like search engine queries and social media data.

To dive deeper into how data and AI can help address U.S. public health emergencies -- like the nation's disproportionately high maternal mortality rate -- Abebe currently serves on a 12-member body advising the National Institutes of Health on how machine learning can be better integrated into biomedical and clinical research.

"They want us to envision what kind of stuff we'd do to create real bridges between AI and biomedical and public health research," Abebe said. "I'm really excited about the broad set of techniques we have and the unique style of doing research that the AI community has and using that to help address problems that impact underserved and marginalized communities."

She's joined on the advisory committee exploring interdisciplinary approaches by Google AI senior research scientist Greg Corrado, Intel principal engineer Michael McManus, Verily engineering director David Glazer, and AI Now Institute cofounder Kate Crawford, as well as professors from Stanford University, MIT, and other universities.

The group will deliver interim findings in June and final thoughts to NIH director Francis Collins in December.

A cofounder of Black in AI who grew up in Ethiopia, Abebe is passionate about combining AI and data to help marginalized communities. Her work has grown from a 2016 project at Microsoft Research to explore the health needs of people in Africa.

How search results suggest answers

Using topical models and natural language processing, Abebe combed through 18 months of Bing search results for all 54 nations on the African continent to assess queries related to HIV/AIDS, malaria, and tuberculosis. Automation then created categories based on subject matter. The total number of queries included in the paper were not disclosed.

Facebook AI researchers also used artificial intelligence for public health and aid organizations to create population density maps of Africa.

Results were published last year in a paper coauthored by Shawndra Hill and Jennifer Vaughan of Microsoft Research, along with Peter Small and Andrew Schwartz of Stony Brook University. The paper was recently accepted for publication by the International AAAI on Web and Social Media scheduled to take place in June in Munich, Germany.

The AI also categorizes words and topics most associated with specific diseases. For example, women were more interested in questions related to pregnancy or breastfeeding, while men were more interested in news stories about people who say they've been cured of HIV.

Search results demonstrated that women and users aged 18-24 are more concerned about stigma than other groups, and natural cure searches were highest in the 35-49 age group. Cure myths that often appear in search results include the prayers of Nigerian prophets, moringa seed oil, and garlic.

The results also highlight such questions as: "I'm HIV positive, can my boss fire me?" and "What are my legal protections?" Or "What ways can you mitigate stigma in social settings?"

Annotators with graduate level experience were then invited to examine topics like natural cures, symptoms, stigma, and drugs to assess the objectivity, accuracy, and relevance of results.

"What we found was that for searches related to natural cures and remedies, people were getting web pages that have serious issues with accuracy, effectiveness, and relevance," Abebe said.

A correlation was found in the rate of stigma-related searches and high rates of HIV.

People with medical experience were asked to participate in this portion of the work in order to compensate for the lack of relevant health experience among AI researchers.

'Grain of salt' data

Abebe likes the use of search query results because, unlike a survey that asks pointed questions, search results are open-ended and can provide insights into people's concerns and their lived experiences.

As for any data derived from the internet, there are a number of caveats, such as the fact that the majority of searches are in English; internet connection rates are rising fast, but there are still sizable portions of African nations that lack web access; and people self-identified by using the name of a disease in their searches.

The study also makes no attempt to follow subsequent searches to trace the evolution of search patterns.

So while Abebe shares results with public health officials in countries like Ethiopia, Ghana, Nigeria, and South Africa, she cautions that the results should never replace manually collected ground truth data and that any attempt to do so could be dangerous.

"It's really more a guideline than it is ground truth that you can really rely on," she said.

Attempts to create predictive systems based on search query results could also have an adverse impact on public health.

When sharing the results of the study with health officials in Africa, Abebe said, it became clear that some officials are well aware of the prevalence of results claiming Nigerian prophets, moringa seed oil, or garlic can cure HIV, but they were less informed about the questions related to discrimination or stigma.

Abebe also cautions against the creation of predictive systems to go alongside query results.

"The obvious questions that you could ask as a computer scientist is [whether] we can use search data for the HIV prevalence rate. But that's not really the question we tried to answer here, because you don't even have the ground truth data. We know that HIV prevalence rates are misreported in many countries," she said.

The most famous example of predictive AI based on search results gone wrong is likely the Google Flu Trends project, which was shut down in 2015 after producing wildly inaccurate results.

How search results suggest answers

'Grain of salt' data

More