Amazon will release a conversational and knowledge data set of more than 4 million words

No, this isn't an April Fools' Day prank: Amazon plans to make available a massive number of data samples targeting natural language processing research. The Seattle company today said that in September 2019, it'll release the Topical Chat data set, a corpus of crowdsourced human conversations provided to teams competing in the annual Alexa Prize Socialbot Grand Challenge.

The Topical Chat data set consists of more than 210,000 utterances or over 4,100,000 words, Amazon says, making it one of the largest public social conversation and knowledge data sets. Each of the corpus' conversations and conversation turns are linked to knowledge provided to crowd workers, and said knowledge is collected from a range of "unstructured" and "loosely structured" text resources relating to a set of entities.

Amazon senior principal scientist Dilek Hakkani-Tur made it clear in a blog post that none of the conversations are interactions with Alexa customers.

"The goal of this collection is to enable the next steps of research in knowledge-grounded neural response generation systems, tackling hard challenges in natural conversation that are not addressed by other publicly available datasets," Hakkani-Tur said. "This will allow researchers to focus on the way humans transition between topics, knowledge-selection and enrichment, and integration of fact and opinion into dialogue ... [and support] the publication of high quality, repeatable research."

Amazon says that teams competing for the Alexa Prize will have access to an expanded version of the data set -- the aptly named Extended Topical Chat dataset -- which includes the results of ongoing collections and annotations.

Today's announcement comes roughly six months after Amazon open-sourced a data set that could be used to train AI models to identify names across languages and script types. Called a "transliteration multilingual named-entity transliteration system," it comprises nearly 400,000 names in languages like Arabic, English, Hebrew, Japanese Katakana, and Russian scraped from Wikipedia.

More