Unless you’ve been living under a rock, you’ve probably run across the Vocabulary.com audio clip that kicked off a social media “Laurel” or “Yanny” firestorm this week. Perhaps you even weighed in, offering your two cents on the elocution of the opera singer (a member of the original Broadway cast of Cats, as it turns out) in the recording. But you probably didn’t consult artificial intelligence for a second opinion. Well, not to worry: Nuance and Voxbone have saved you the trouble.
Nuance Communications, a company that specializes in natural language processing, fed its Dragon speech platform the “Laurel” or “Yanny” audio clip to put an end to the debate once and for all. According to Nils Lenke, senior director of research at Nuance, it heard “Laurel.”
The software that Voxbone tested, which included speech recognition engines from VoiceBase, CallMiner, and Gridspace, didn’t recognize “Laurel” or “Yanny” — even after three tests in a row. The first time around, they transcribed the audio as “well, well, well” or “yeah, yeah, yeah.” Engineers tried changing the dialog setting from English to Irish, Spanish, and other languages, but to no avail — they heard the clip as “well, well, well.” (Eventually, the team got the algorithms to recognize “Laurel”.)
What do you hear?! Yanny or Laurel pic.twitter.com/jvHhCbMc8I
— Cloe Feldman (@CloeCouture) May 15, 2018
In my testing, some voice assistants fared better than others. The Google Assistant (running on a Motorola Moto G5 Plus) interpreted the word as “Mary, Mary” and “yeah, yeah,” while Microsoft’s Cortana (on my PC) recognized “Laurel” immediately. (I didn’t have an iPhone handy, so the jury is out on Siri.)
Others had better luck. Joe Murphy, CEO and founder of virtual assistant analytics startup vocalize.ai, got the Google Assistant (on a Google Home speaker) and Amazon Alexa (on an Echo speaker) to detect “Laurel.”
“In the lab, we had to setup the direction of sound and timing of utterance just right, but when we got it both Alexa and Google detected Laurel,” Murphy wrote in an email.
Why the disparity between the platforms? Assuming all else to be equal, it has to do with the way voice recognition algorithms work. Transcription apps from Nuance and Voxbone, not to mention voice assistants like Apple’s Siri, Google Assistant, and Microsoft’s Cortana, break human speech down into tiny, bite-sized parts called phonemes. Algorithms analyze the order of these phonemes to pair spoken words with text, taking into account the syntax and context of those words in ambiguous cases.
Simple enough, right? Not so fast. In some voice recognition setups, programmers have to manually connect the speech patterns of words with text. The algorithms, then, are only as good as their word bank: If a word or word association isn’t in the database, it won’t be transcribed properly. (Such was likely the case with Voxbone’s system.)
It just goes to show that algorithms, like humans, bring their own biases to the table.