The cocktail party problem, alternatively known as the dinner party problem, is the difficulty automated systems encounter when tasked with isolating audio in noisy, multisource environments. It’s widely studied, and a number of academic teams, startups, and corporate giants claim to have solved it with sophisticated machine learning algorithms. But Amazon believes there’s room for improvement, and to this end, it’s releasing a data set — the Dinner Party Corpus, or DiPCo — intended to spur research on the topic.

According to Zaid Ahmed, a senior technical program manager in the Alexa Speech group, the corpus was created with the help of Amazon volunteers who simulated a dinner-party scenario in the lab. Over the course of multiple sessions (each involving four participants), the volunteers served themselves food from a buffet table and spoke over music piped into the room. Each was outfitted with a headset microphone that captured a speaker-specific signal, and five devices with seven microphones were placed strategically around the room to feed signals directly to a laptop.

The DiPCo contains the raw audio recorded by each of the seven device microphones in each device and the headset signals, the latter of which provide references that can be used to gauge the success of speech separation signals. Additionally, the data set includes detailed transcriptions of each volunteer’s utterances.

Amazon Alexa dinner party corpus

Above: The layout of the simulated dinner party space.

Image Credit: Amazon

“The division of the data into segments with and without background music enables researchers to combine clean and noisy training data in whatever way necessary to extract optimal performance from their machine learning systems,” explained Ahmed.

The release of the DiPCo follows on the heels of FEVER, an open source corpus compiled by researchers at Amazon and the University of Sheffield intended to promote the development of fact verification systems. Separately, Amazon in September published the Topical Chat Dataset, a text-based collection of more than 235,000 utterances designed to help support high-quality, repeatable research in the field of dialogue systems.

On a somewhat related note, DiPCo’s release also comes a week after Amazon improved Alexa’s ability to interpret multiple languages and suss out emotion from voice snippets. Now, the intelligent assistant can automatically detect both Spanish and English in the U.S., French and English in Canada, and Hindi and English in India. And starting early next year in the music domain, Alexa will apologetically offer an alternative action when it detects frustration in a customer’s voice as a result of a mistake that it made.