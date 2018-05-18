Unless you’ve been living under a rock, you’ve probably run across the Vocabulary.com audio clip that kicked off a social media “Laurel” or “Yanny” firestorm this week. Perhaps you even weighed in, offering your two cents on the elocution of the opera singer (a member of the original Broadway cast of Cats, as it turns out) in the recording. But you probably didn’t consult artificial intelligence for a second opinion. Well, not to worry: Nuance and Voxbone saved you the trouble.

Nuance Communications, a company that specializes in natural language processing, fed its Dragon speech platform the “Laurel” or “Yanny” audio clip to put an end to the debate once and for all. According to Nils Lenke, senior director of research at Nuance, it heard “Laurel.”

Voxbone’s software didn’t recognize “Laurel” or “Yanny” — even after three tests in a row. The first time around, its voice machine tech transcribed the audio as “well, well, well” or “yeah, yeah, yeah.” Engineers tried changing the dialog setting from English to Irish, Spanish, and other languages, but to no avail — it heard the clip as “well, well, well”.

What do you hear?! Yanny or Laurel pic.twitter.com/jvHhCbMc8I — Cloe Feldman (@CloeCouture) May 15, 2018

In my informal testing, some voice assistants fared better than others. The Google Assistant (running on a Motorola Moto G5 Plus) interpreted the word as “mary mary” and “yeah yeah,” while Microsoft’s Cortana (on my PC) recognized “yanny” immediately. (I didn’t have an iPhone handy, so the jury is out on Siri.)

Why the disparity between the platforms? Assuming all else equal, it has to do with the way voice recognition algorithms work. Transcription apps from Nuance and Voxbone, not to mention voice assistants like Apple’s Siri, the Google Assistant, and Microsoft’s Cortana, break human speech down into tiny, bite-sized parts called phonemes. Algorithms analyze the order of these phonemes to pair spoken words with text, taking into account the syntax and context of those words in ambiguous cases.

Simple enough, right? Not so fast. In some voice recognition setups, programmers have to manually connect the speech patterns of words with text. The algorithms, then, are only as good as their word bank: if a word or word association isn’t in the database, it won’t be transcribed properly. (Such was likely the case with Voxbone’s system.)

It just goes to show that algorithms, not just humans, bring their own biases to the table.