On Wikipedia, not all languages are created equal.
Tallying around 40 million entries in total and consuming around 30 terabytes of data, English dominates the online encyclopedia with more than five million articles. Swedish, which claims more than three million articles, is next, then Cebuano (2.19 million, though most of these are created by a bot), German (1.95 million), Dutch (1.86 million), and French (1.75 million).
As English is the world’s lingua franca, it’s perhaps unsurprising that it is so prevalent on Wikipedia. But the property’s parent organization, the Wikimedia Foundation, is making moves to plug the inherent knowledge gap that results from this linguistic imbalance.
And as Wikipedia is a crowdsourced encyclopedia, it will come as little surprise that it has long operated a crowdsourced translation program. This means that someone’s expertise in a niche topic can more easily transcend the language barrier. But finding out which topics or articles are in particular shortage in specific tongues is a challenge, which is why Wikimedia is partnering with Stanford University researchers to design a new recommendation system. This will rank Wikipedia articles in order of priority across languages. The ranking is based on a number of factors, including editor interests (using contribution history data), language proficiency, and anticipated popularity if an article was translated. For example, a native Swahili speaker is unlikely to care about the history of a U.K. baking business, but they may care about WrestleMania 32.
Wikimedia’s research arm worked with Stanford to conduct a controlled test of its new system using the French version of Wikipedia, and compared recommendations that were personalized with those that were not. Results showed that tailored recommendations “tripled the rate at which editors create articles,” without any discernible impact on the quality.
As a result of this initial test, software developers were beckoned to the proverbial table to build a prototype article-recommendation tool that filters out the suggested translations by language combination. It adopts a simpler version of the original algorithm and uses pageview and search data to identify “trending” articles that exist in one language but not another.
Wikipedia GapFinder is still a beta product, but it gives a glimpse of what Wikipedia is trying to achieve here. You select the source and target tongues, and the algorithm does the rest. You can also narrow it down manually by topic, using keywords such as “football.”
When you click on an article that is of interest, you’ll see the original Wikipedia article with a “Translate” button at the bottom. When you click that, you’ll be whisked away to the main Wikipedia interface where you’ll have to log in.
Interestingly, and perhaps crucially, the tool is open-source, with an application program interface (API) available to encourage developers to use the same underlying recommendation engine in third-party apps.
“Over the coming months, we will be monitoring the tool closely to learn more about how it’s being used by editors and how it can be further improved,” said the Wikimedia Foundation, in a blog post. “We are particularly interested in seeing how the tool can be used by larger groups participating in edit-a-thons, meetups, or other outreach events, as a handy solution to generate lists of missing articles.”
So this tool is perhaps less about generating articles than it is about helping existing editors find the best articles to translate — but these recommendations should organically lead to more articles being converted into new languages. And that is the ultimate goal.
According to Wikimedia: “The French Wikipedia may have more than 20,000 articles on individual asteroids, but if you are one of 27 million people speaking Hausa as a first language, Wikipedia doesn’t yet have an entry on the universe.”
And that tells its own story.
The Wikimedia Foundation has already been pushing to surface the best articles and make Wikipedia a go-to destination to for readers, so recommending articles to translate seems the next logical step.