Google's AI lets users search language-agnostic knowledge bases in their native tongue

Entity linking fulfills a key role in grounded language understanding. Given a text mention of an entity (e.g., the word "helpful"), an algorithm identifies the entity's corresponding entry in a knowledge base (such as a Wikipedia article). To extend its usefulness, researchers at Google propose a new technique where language-specific mentions resolve to a language-agnostic knowledge base. They describe a single entity retrieval model that covers over 100 languages and 20 million entities while ostensibly outperforming results from more limited cross-lingual tasks.

Multilingual entity linking involves linking a text snippet in some context to the corresponding entity in a language-agnostic knowledge base. Knowledge bases are essentially databases comprising information about entities -- people, places, and things. In 2012, Google launched a knowledge base, the Knowledge Graph, to enhance search results with hundreds of billions of facts gathered from sources including Wikipedia, Wikidata, and CIA World Factbook. Microsoft provides a knowledge base with over 150,000 articles created by support professionals who have resolved issues for its customers.

Knowledge bases in multilingual entity linking may include textual information like names and descriptions about each entity in one or more languages. But they make no prior assumption about the relationship between these knowledge base languages and the mention-side language.

The Google researchers used what's called enhanced dual encoder retrieval models and WikiData as their knowledge base, which canvasses a large set of diverse entities. WikiData contains names and short descriptions, but through its close integration with all Wikipedia editions, it also connects entities to rich descriptions (and other features) drawn from the corresponding language-specific Wikipedia pages.

The researchers extracted a large-scale dataset of 684 million mentions in 104 languages linked to WikiData entities, which they say is at least six times larger than datasets used in prior English-only linking work. In addition, the coauthors created a matching dataset -- Mewsli-9 -- that spans a diverse set of languages and entities, including 289,087 entity mentions appearing in 58,717 news articles from WikiNews. (Only 11% of the 82,162 distinct target entities in Mewsli-9 don't have English Wikipedia pages, setting an upper bound on systems focused on English Wikipedia entities.)

The researchers say the results show that entity linking can better reflect the real-world challenges of rare entities and/or low resource languages. "Operationalized through Wikipedia and WikiData, our experiments using enhanced dual encoder retrieval models and frequency-based evaluation provide compelling evidence that it is feasible to perform this task with a single model covering over a 100 languages," they wrote. "Our automatically extracted Mewsli-9 dataset serves as a starting point for evaluating entity linking beyond the entrenched English benchmarks and under the expanded multilingual setting."

It's unclear whether the researchers' models exhibits demographic bias, however. In a paper published earlier this year, Twitter researchers claimed to have found evidence of prejudice in popular named entity recognition models, particularly with respect to Black and other "non-white" names. But the Google coauthors leave the door open to using non-expert human raters to improve the quality of the training dataset and incorporate relational knowledge.

More