Alexa researchers improve AI error rate up to 30% by reducing data imbalance

Imbalanced training data is a major hurdle for classifiers -- that is, machine learning systems that sort inputs into classes. (Think object-detecting security cameras and smart speakers that distinguish among various speakers.) When one category of samples disproportionately contributes to a corpus, the classifier naturally encounters it more often than others and so runs the risk of becoming biased toward it.

Researchers at Amazon's Alexa division say they've developed a technique that can reduce error rates in some data-imbalanced systems by up to 30 percent. They describe it in a recently published paper ("Deep Embeddings for Rare Audio Event Detection with Imbalanced Data") scheduled to be presented at the International Conference on Acoustics, Speech, and Signal Processing in Brighton this spring.

Typically, data scientists address the unrepresentative sample problem by "overweighting" data in underrepresented classes -- i.e., assigning more importance to it. (For instance, if a particular class has one-third as much training data as another, each of its examples would count 3 times as much.) But Ming Sun, a senior speech scientist in the Alexa Speech group and lead author of the paper, advocates a different approach. He and colleagues trained an AI system to produce embeddings for each category in the form of vectors (mathematical representations of data) and to maximize the distance between those vectors.

In order to prevent imbalance in the embeddings, data classes larger than any of the others were split into clusters roughly the size of the smallest class. And to shorten the time it took to measure the distance between data items, the system was designed to keep a running measurement of the centroid, or the point that minimizes the average distance of all points of the cluster.

"With each new embedding, our algorithm measures its distance from the centroids of the clusters, a much more efficient computation than exhaustively measuring pairwise distances," Sun explained in a blog post.

The outputs of the fully trained embedding AI were used as training data for a classifier that applied labels to input data. Tests were then run on four types of sounds from an "industry-standard" dataset: dog barks, baby cries, gunshots, and background sounds. Experiments with the embeddings involving a long short-term memory (LSTM) network showed a performance improvement of 15 percent to 30 percent, and 22 percent overall. On a larger and slower but more accurate convolutional neural net (CNN), Sun and coauthors recorded 6 percent to 19 percent error reduction, depending on the ratio of the data classes.

More