How do assistants like Alexa discern sound? The answer lies in two Amazon research papers scheduled to be presented at this year’s International Conference on Acoustics, Speech, and Signal Processing in Aachen, Germany. Ming Sun, a senior speech scientist in the Alexa Speech group, detailed them this morning in a blog post.

“We develop[ed] a way to better characterize media audio by examining longer-duration audio streams versus merely classifying short audio snippets,” he said, “[and] we used semisupervised learning to train a system developed from an external dataset to do audio event detection.”

The first paper addresses the problem of media detection — that is, recognizing when voices captured from an assistant originate from a TV or radio rather than a human speaker. To tackle this, Sun and colleagues devised a machine learning model that identifies certain characteristics common to media sound, regardless of content, to delineate it from speech.

Their system consists of several recurrent neural networks (RNNs) — AI models that process sequenced data in order and output factors in the preceding inputs and outputs — and a feature-extracting convolutional neural network. Uniquely, the RNNs are “stacked” on top of each other in a pyramidal fashion, such that each layer has only a fraction as many components as the one beneath it.

For every five-second snippet of audio processed, the RNNs generate a single output in vector form (i.e., a mathematical representation) that represents the possibility it belongs to any of several sound categories. Meanwhile, yet another neural network — also an RNN — tracks relationships among the snippets.

The team experimented with a design that placed the higher-level RNN — the RNN responsible for making the final decision about whether media sound is present — between the middle and top layers of the other RNNs┬áso that it received input from the middle layer and passed its output to the top layer. In experiments, this was their best-performing architecture, with a reported 24% reduction in error rate.

Sussing out sound

The other paper proposes a novel approach to semisupervised learning — a technique involving training on a small amount of labeled data and a larger set of unlabeled data — in audio event detection.

Semisupervised learning tends to improve machine learning models’ predictions, Sun notes, but it sometimes exacerbates errors because the unlabeled data isn’t always correctly sorted by the AI system.

To mitigate this, he and colleagues used a “tri-training” technique in which they created three different training sets — 39,000 examples in total — by randomly sampling data from a corpus. They then trained three AI models on all three data sets and saved copies of them, which they used to label an additional 5.4 million samples. For each of the samples, they leveraged machine-labeled data to retrain the models only if both of the other models agreed on the labels.

Finally, the researchers used seven different models in total to classify the examples in the test set: the three initial models and the three retrained models and a seventh model trained to mimic the aggregate results of the first six. On samples of three sounds — dog sounds, baby cries, and gunshots — pooling the results of all six models led to reductions in error of 16%, 26%, and 19%, respectively, over a standard self-trained model. Meanwhile, the seventh model reduced error rates on the same three sample sets by 11%, 18%, and 6%, respectively.