Much can be gleaned from the tone of someone’s voice, which is a natural conduit for emotion. And emotion has a range of applications: It can aid in health-monitoring by helping detect early signs of dementia or heart attack, and it has the potential to make conversational AI systems more engaging and responsive. Someday, emotion might even provide implicit feedback that could help voice assistants like Google Assistant, Apple’s Siri, and Amazon’s Alexa learn from their mistakes.
Emotion-classifying AI isn’t anything new, but traditional approaches are supervised, meaning that they ingest training data labeled according to speakers’ emotional states. Scientists at Amazon took a different approach recently, which they describe in a paper scheduled to be presented at the International Conference on Acoustics, Speech, and Signal Processing. Rather than sourcing an exhaustively annotated “emotion” corpus to teach a system, they fed an adversarial autoencoder a publicly available data set containing 10,000 utterances from 10 different speakers. The result? The neural network was up to 4% more accurate at judging valence, or emotional value, in peoples’ voices.
The research builds on the Amazon Alexa team’s ongoing effort to reliably determine users’ mood or emotional state from the sound of their voices.
As paper coauthor and Alexa Speech group senior applied scientist Viktor Rozgic explained in a blog post, adversarial autoencoders are two-part models comprising an encoder, which learns to produce a compact (or latent) representation of input speech encoding all properties of the training example, and a decoder, which reconstructs the input from the compact representation.
The researchers’ emotion representation consists of three network nodes, one for each of three emotional measures: valence, activation, (whether the speaker is alert and engaged or passive), and dominance (whether the speaker feels in control of the situation). Training is conducted in three phases, the first of which involves individually training the encoder and decoder using data without labels. In the second phase, adversarial training — a technique in which the adversarial discriminators attempt to distinguish between real representations produced by the encoder from artificial representations — is used to tune the encoder. And in the third phase, the encoder is tuned to ensure that the latent emotion representation predicts the emotional labels of the training data.
During experiments involving sentence-level feature representations “hand-engineered” to capture information about speech signals, the researchers report that their AI system achieved 3% better accuracy in assessing valence than a conventionally trained network. Moreover, they say that when the network was supplied a sequence of representations for the acoustic characteristics of 20-millisecond frames, or audio snippets, the improvement was 4%.
Amazon’s not the only company investigating improved speech-based emotion detection, it’s worth noting. MIT Media Lab spinoff Affectiva recently demonstrated a neural network, SoundNet, that can classify anger from audio data in as little as 1.2 seconds — just over the time it takes for humans to perceive anger — regardless of the speaker’s language. Meanwhile, startup Cogito’s AI is used by the U.S. Department of Veteran Affairs to analyze the voices of military veterans with PTSD to determine if they need immediate help.