Amazon’s Alexa can detect whispered speech — that’s how it knows when to whisper back. But what about AI that’s capable of sussing out frustration? Enter MIT Media Lab spinoff Affectiva’s neural network, SoundNet, which can classify anger from audio data in as little as 1.2 seconds regardless of the speaker’s language — just over the time it takes for humans to perceive anger.
Affectiva’s researchers describe it (“Transfer Learning From Sound Representations For Anger Detection in Speech“) in a newly published paper on the preprint server Arxiv.org. It builds on the company’s wide-ranging efforts to establish emotional profiles from both speech and facial data, which this year spawned an AI in-car system codeveloped with Nuance that detects signs of driver fatigue from camera feeds. In December 2017, it launched the Speech API, which uses voice to recognize things like laughing, anger, and other emotions, along with voice volume, tone, speed, and pauses.
“[A] significant problem in harnessing the power of deep learning networks for emotion recognition is the mismatch between a large amount of data required by deep networks and the small size of emotion-labeled speech datasets,” the paper’s coauthors wrote. “[O]ur trained anger detection model improves performance and generalizes well on a variety of acted, elicited, and natural emotional speech datasets. Furthermore, our proposed system has low latency suitable for real-time applications.”
SoundNet consists of a convolutional neural network — a type of neural network commonly applied to analyzing visual imagery — trained on a video dataset. To get it to recognize anger in speech, the team first sourced a large amount of general audio data — two million videos, or just over a year’s worth — with ground truth produced by another model. Then, they fine-tuned it with a smaller dataset, IEMOCAP, containing 12 hours of annotated audiovisual emotion data including video, speech, and text transcriptions.
To test the AI model’s generalizability, the team evaluated its English-trained model on Mandarin Chinese speech emotion data (the Mandarin Affective Speech Corpus, or MASC). They report that it not only generalized well to English speech data, but that it was effective on the Chinese data — albeit with a slight degradation in performance.
The researchers say that their success proves an “effective” and “low-latency” speech emotion recognition model can be significantly improved with transfer learning, a technique that leverages AI systems trained on a large dataset of previously annotated samples to bootstrap training in a new domain with sparse data — in this case, an AI system trained to classify general sounds.
“This result is promising because while emotion speech datasets are small and expensive to obtain, massive datasets for natural sound events are available, such as the dataset used to train SoundNet or Google’s AudioSet. These two datasets alone have about 15 thousand hours of labeled audio data,” the team wrote. “[Anger classification] has many useful applications, including conversational interfaces and social robots, interactive voice response (IVR) systems, market research, customer agent assessment and training, and virtual and augmented reality.”
They leave to future work tapping other large publicly available corpora, and training AI systems for related speech-based tasks, such as recognizing other types of emotions and affective states.
Affectiva’s not the only company investigating speech-based emotion detection. Startup Cogito‘s AI is used by the U.S. Department of Veteran Affairs to analyze the voices of military veterans with PTSD to determine if they need immediate help.