Amazon open-sources its Topical Chat data set of over 4.7 million words

Way back in April, Amazon announced its intention to publish a data set -- the Topical Chat data set -- of crowdsourced human conversations to teams competing in the annual Alexa Prize Socialbot Grand Challenge competition. It finally made good on that promise today with the release on GitHub of more than 235,000 utterances containing over 4,700,000 words, which it asserts will support "high-quality" and "repeatable" dialogue systems research.

"The goal of Topical Chat is to enable innovative research in knowledge-grounded neural response-generation systems by tackling hard challenges that are not addressed by other publicly available data sets," wrote senior principal scientist in Amazon's Alexa AI group Dilek Hakkani-Tür in a blog post. "Those challenges, which we have seen universities begin to tackle in the Alexa Prize Socialbot Grand Challenge, include transitioning between topics in a natural manner, knowledge selection and enrichment, and integration of fact and opinion into dialogue."

To compile the corpus, Hakkani-Tür and colleagues identified 300 named entities (i.e., people, places, or things) in eight different topic categories that came up frequently in conversations with Alexa Prize chatbots. The entities were clustered into groups of three based on their co-occurrence in information sources, and for each entity in a cluster, several additional sources of information were collected and divided corresponding to each cluster.

The data was then passed along to pairs of crowdsourced workers in Amazon's Mechanical Turk, who sometimes received the same information and other times got only a subset of it. Sometimes, the Alexa AI team divvied up the data so that paired workers were left with complementary knowledge.

The Mechanical Turk workers carried on instant-messaging conversations about the knowledge sets they'd received, as instructed by the researchers. For each of their own messages, they were asked to indicate the source of their information and to gauge the message's overall sentiment (e.g., happy, sad, curious, fearful, and so on), and for their chat partner's messages, they were asked to assess their quality (i.e., whether they were conversationally appropriate).

The conversations were next whittled down through a combination of manual and automatic review.

"[Our hope is that this] will allow researchers to focus on the way humans transition between topics, knowledge-selection and enrichment, and integration of fact and opinion into dialogue .. [and support] the publication of high quality, repeatable research," said Hakkani-Tür.

This week's announcement comes roughly a year after Amazon open-sourced a data set that could be used to train AI models to identify names across languages and script types. Called a "transliteration multilingual named-entity transliteration system," it comprises nearly 400,000 names in languages like Arabic, English, Hebrew, Japanese Katakana, and Russian scraped from Wikipedia.

More