How Wikimedia is using machine learning to spot missing citations

One of the more compelling use cases for AI is automating mission-critical tasks that humans don't want to do, or can't do. Wikipedia ran into just such a problem with its citations. With crowdsourced content, citations are crucial to providing accuracy and reliability in the site's vast ocean of articles, but according to a blog post from the Wikimedia Foundation, around 25% of Wikipedia's English-language articles lack a single citation. "This suggests that while around 350,000 articles contain one or more 'citation needed' flags, we are probably missing many more," reads the post.

Citations needed

Anyone who's spent time on Wikipedia has seen that more citations, generally, would be helpful, especially considering the site's verifiability policy that states in part, "All quotations, and any material whose verifiability has been challenged or is likely to be challenged, must include an inline citation that directly supports the material." In an email interview, Jonathan Morgan, senior design researcher and coauthor of Wikimedia's "Citation Needed" study, noted accuracy isn't the only advantage. "Citations not only allow Wikipedia readers and editors to fact-check information, they also provide jumping-off points for people who want to learn more about a topic," he said.

The challenge for Wikipedia is not merely adding more citations, though; it's understanding where citations are needed in the first place. That's a laborious process in and of itself. To solve this twofold problem, Wikimedia developed a twofold solution. Step one was to create a framework for understanding where citations need to go and create a data set. Step two was to train a machine learning model classifier to scan and flag those items across Wikipedia's hundreds of thousands of articles.

How they got there

A roster of 36 English, Italian, and French Wikipedia editors were given text samples and were asked put together a taxonomy of reasons why you would need a citation, and reasons why you wouldn't. For example, if "the statement contains statistics or data" or "the statement contains technical or scientific claims," you'd need a citation. If "the statement only contains common knowledge" or "the statement is about a plot or character of a book/movie that is the main subject of the article," you would not.

With a set of guidelines in place, Wikimedia's researchers created a data set upon which to train a recurrent neural network (RNN). In the blog post, the researchers said, "We created a data set of English Wikipedia's 'featured' articles, the encyclopedia's designation for articles that are of the highest quality -- and also the most well-sourced with citations." The setup for the training was fairly simple: When a line in a given feature article had a citation, it was marked as "positive," and a line that did not have a citation was "negative." Then, based on a sequence of words in a given sentence, the RNN was able to classify citation needs with 90% accuracy, according to Wikimedia's post.

For linguistics nerds, the analysis is particularly interesting. The model understood that the word "claimed" was likely an opinion statement, and that within the topic of statistics, the word "estimated" indicated a need for a citation.

To take the process a step further, Wikimedia's researchers created a second model that could add reasons to its citation classifications. Using Amazon's Mechanical Turk, they pulled in human minds for the task and gave the volunteers some 4,000 sentences that had citations. The participants were asked to apply one of eight labels -- like "historical" or "opinion" -- to show the reason why a citation was needed. With that data in hand, the researchers modified their RNN so that it assigned an unsourced sentence into one of those eight categories.

What's next

So far, the model is trained only on English-language Wikipedia content, but Wikimedia is working on expanding it to more languages. Given how the data acquisition was performed, there are some obvious potential challenges with other languages that are structured differently than English. "We don't have to start from scratch, but the amount of work may vary by language," said Miriam Redi, research scientist at the Wikimedia Foundation and lead author on the paper. "To train our models, we use 'word-vectors,' namely language characteristics of the article text and structure. These word vectors can be easily extracted from text of any language existing in Wikipedia."

Redi said that in some cases, they would need to collect new samples from other "featured articles" and would have to rely on the Wikipedia editors who work in those languages. Morgan added that they have processes for "translating English words that we know are associated with sentences that are likely to need citations into other languages."

Even with some AI involved, the lion's share of the work falls on the shoulders of a group of dedicated volunteer Wikipedia editors. Creating a mass of hundreds of thousands of accurate citation flags is informative, but humans will need to tackle them all one at a time. But at least now they know where to start.

Ideally, the researchers believe, this AI can help Wikipedia editors understand where information needs to be verified and why, and show readers what content is especially trustworthy. Once the code is open-sourced, they hope it will encourage other volunteer software developers to make more tools that can increase the quality of Wikipedia articles.

But there are larger implications, said Morgan: "Outside the Wikimedia movement, we hope that other researchers (such as members of the Credibility Coalition) use our code and data to develop tools for detecting claims in other online news and information sources that need to be backed up with evidence."

Citations needed

How they got there

What's next

More