When you’re watching the Super Bowl between the Rams and Patriots on Sunday, your Echo speakers won’t inadvertently respond to the wake word “Alexa” when someone on the television shouts it out — even during Amazon’s splashy new Alexa commercial starring Harrison Ford and Forest Whitaker. It might seem like a small thing, but the science behind getting Amazon’s intelligent assistant to ignore its name is a bit more involved than you might think. (It wasn’t that long ago, after all, that a Burger King ad prompted smart speakers around the country to search for the ingredients in a Whopper sandwich.)
In a blog post this morning, Amazon details acoustic fingerprinting, a technique Alexa AI research scientists at the Seattle company use to “teach” Alexa what individual instances of its name sound like, so that the assistant can ignore them. The work is applied on the fly to detect when multiple Alexa-enabled devices around the globe hear the same command at roughly the same time — which Mike Rodehorst, machine learning scientist at Amazon’s Alexa Speech division, says is needed to prevent Alexa from responding to pranks, references to people named Alexa, and other TV mentions Amazon isn’t made aware of in advance.
“Our approach to matching audio recordings is based on classic acoustic-fingerprinting algorithms, like that of [Jaap] Haitsma and [Ton] Kalker in their 2002 paper ‘A Highly Robust Audio Fingerprinting System’,” he said. “Such algorithms are designed to be robust to audio distortion and interference, such as those introduced by TV speakers, the home environment, and our microphones.”
So how does it work? Fingerprinting involves deriving log filter-bank energies, or LFBEs, for an acoustic signal — cells within a grid that represent the amount of energy in overlapping frequency bands in a series of overlapping time windows. An algorithm steps through the grid in two-by-two blocks, computing 2D gradients for cells as it goes along. The positive (or negative) sign of the results summarizes the values of each individual block in a single bit, the sum total of which constitute the acoustic fingerprint.
When the fraction of the bits that make up a fingerprint differ enough, they’re deemed to match, and Alexa ignores the wake word.
Amazon fingerprints entire audio samples when they’re provided in advance, and the results are stored in the cloud. The company also “builds up” acoustic fingerprints piecemeal with audio that’s streaming to the cloud from Alexa-enabled devices, repeatedly comparing these fingerprints to others as they grow. (Clean audio is easier to process than audio with lots of background noise, Amazon says; the latter can yield a match, but it requires more data.)
Every incoming audio request to Alexa that starts with a wake word is checked in two ways. It’s first compared to a database of known fingerprinted instances of “Alexa,” which also make use of the audio that follows the wake word. Then it’s checked against a fraction of other requests coming into Alexa devices around the same time. Audio-matching requests from at least two other customers are identified as a “media event” and given increased scrutiny (and potentially declared a match). This contributes to a small cache of fingerprints, allowing Alexa to continue to ignore wake word requests even when they’re not happening simultaneously.
These fingerprinting methods — for which Amazon has patents — will together prevent as many as 80 to 90 percent of devices from responding to TV-originated Alexa statements, the company says. And they’re not the only precautionary measures in place.
On most Echo devices, every time the wake word “Alexa” is detected, the audio is compared to a small set of known instances of Alexa being mentioned in commercials. Rodehorst says that the set is generally restricted to ads the Alexa team expects to be currently airing, due to the limits of the smart speakers’ CPUs.
“Ideally, a device will identify media audio using locally stored fingerprints, so it does not wake up at all,” he says. “If it does wake up, and we match the media event in the cloud, the device will quickly and quietly turn back off.”
Separately, Amazon scientists continue to refine machine learning techniques that help Alexa distinguish sounds produced by televisions and other devices from those produced by people in the flesh. In research published last year, they describe an AI system that learns the frequency characteristics of different types of sounds and analyzes the time it takes for sounds to reach multiple microphones within an Echo speaker. This enables the system to tell the difference between moving sound sources and stationary ones, for example.
In tests, a system trained using 311 hours of recordings from volunteers improved Alexa’s media audio recognition by between 8 percent and 37 percent, depending on the audio type, Amazon says. Apparently, this performed best on any combination involving singing.