Google releases TyDi QA, a data set that aims to capture the uniqueness of languages

Google hopes to spur the development of AI capable of understanding the ways in which languages express different meanings. To this end, company researchers today detailed a data set -- TyDi QA, a question-answering data set covering 11 languages -- inspired by typological diversity, or the notion that different languages express meaning in structurally unique ways.

TyDi QA is something of a complement to the English-language Natural Questions corpus Google released last year, and it attempts to capture t he idiosyncrasies and features of tongues like Japanese and Arabic. The researchers point out, for instance, that English changes words to indicate one object (“book”) versus many (“books”), and that Arabic has a third form to indicate if there are two of something ("كتابان", kitaban) beyond just singular ("كتاب", kitab) or plural ("كتب", kutub).

"Because we selected a set of languages that are typologically distant from each other for this corpus, we expect models performing well on this dataset to generalize across a large number of the languages in the world," wrote Google Research scientist Jonathan Clark in a blog post.

TyDi QA includes over 200,000 question-answer pairs from languages representing a "diverse range" of linguistic phenomena and data challenges, many of which use non-Latin alphabets (such as Arabic, Bengali, Korean, Russian, Telugu, and Thai) and form words in complex ways (including Arabic, Finnish, Indonesian, Kiswahili, and Russian). The languages also range from those with an abundance of available data on the web (English and Arabic) to those with very little (Bengali and Kiswahili).

The questions were collected from people who wanted an answer but who didn't yet know the answer, so as to head off original questions that contained the same words as the answer. To inspire questions, the researchers showed contributors a passage from Wikipedia written in their native language. They then had them ask a question -- any question -- as long as it wasn't answered by the passage and they actually wanted to know the answer. (i.e., "Does a passage about ice make you think about popsicles in summer? Great! Ask who invented popsicles.") Importantly, the questions were written directly in each language, not translated, such that many questions were unlike those seen in an English-first corpus. (E.g., সফেদা ফল খেতে কেমন?, or "What does sapodilla taste like?")

For each of the questions, the researchers performed a Google Search for the best-matching Wikipedia article in the appropriate language and asked a person to find and highlight the answer within that article. In some languages, they found that words were represented very differently in question and answer -- so differently that they expect designing a system to successfully select an answer out of a Wikipedia article will prove to be a challenge.

To track the community's progress, they've established a leaderboard where participants can evaluate the quality of their machine learning systems. "It is our hope that this dataset will push the research community to innovate in ways that will create more helpful question-answering systems for users around the world," wrote Clark.

More