Why companies like Amazon manually review voice data

Last week, Bloomberg revealed unsavory details about Alexa's ongoing development that were known within some circles but hadn't previously been reported widely: Amazon employs thousands of contract workers in Boston, Costa Rica, India, Romania, and other countries to annotate thousands of hours of audio each day from devices powered by its assistant. "We take the security and privacy of our customers' personal information seriously," an Amazon spokesman told the publication, adding that customers can opt not to supply their voice recordings for feature development.

Bloomberg notes that Amazon doesn't make explicitly clear in its marketing and privacy policy materials that it reserves some audio recordings for manual review. But what about other companies?

Manual review: a necessary evil?

Today, most speech recognition systems are aided by deep neural networks -- layers of neuron-like mathematical functions that self-improve over time -- that predict phonemes, or perceptually distinct units of sound. Unlike automatic speech recognition (ASR) techniques of old, which relied on hand-tuned statistical models, deep neural nets translate sound in the form of segmented spectrograms, or representations of the spectrum of frequencies of sound, into characters.

Joe Dumoulin, chief technology innovation officer at Next IT, told Ars Technica in an interview that it takes 30-90 days to build a query-understanding module for a single language, depending on how many intents it needs to cover. That's because during a typical chat with an assistant, users often invoke multiple voice apps in successive questions, and these apps repurpose variables like "town" and "city." If someone asks for directions and follows up with a question about a restaurant's location, a well-trained assistant needs to be able to suss out which thread to reference in its answer.

Moreover, most speech recognition systems tap a database of phones -- distinct speech sounds -- strung together to verbalize words. Concatenation, as it's called, requires capturing the complementary diphones (units of speech comprising two connected halves of phones) and triphones (phones with half of a preceding phone at the beginning and a succeeding phone at the end) in lengthy recording sessions. The number of speech units can easily exceed a thousand; in a recent experiment, researchers at Alexa developed an acoustic model using 7,000 hours of manually annotated data. The open source LibriSpeech corpus contains over 1,000 hours of spoken English derived from audiobook recordings, while Mozilla's Common Voice data set comprises over 1,400 hours of speech from 42,000 volunteer contributors across 18 languages.

"As much as we want to believe that there have been breakthrough advances in Artificial Intelligence many of the most advanced implementations of this technology, like Alexa, require a human in the loop," University of Washington assistant professor Nicholas Weber told VentureBeat in an email. "Of course, human intervention is necessary for verification and validation of the AI's reasoning. Many of us implicitly know this, but there are large numbers of the population that don't know AI's limitations."

Viewed through the lens of privacy, though, the difference between that data and the voice samples Amazon's contract workers handle is quite stark, according to Mayank Varia, a research associate professor at Boston University. In an email exchange with VentureBeat, he said that it stretches the definition of "anonymized."

"When [an] Amazon spokesperson says 'employees do not have direct access to information that can identify the person,' what they likely mean is that when Amazon provides the worker with a copy of your audio recording, they do not also provide your Amazon username or any other identifier along with the sound clip," he said via email. "But in some sense this is inconsequential: The sound clip probably reveals more about you than your Amazon username would. In particular, you could be having a conversation in which you say your name.

"I highly doubt Amazon would bother to scrub that from the audio before handing it to their workers," Varia added.

Privacy-preserving ways to collect speech data

Some companies handle voice collection more delicately than others, clearly. But is it necessary to begin with? Might there be a better, less invasive means of improving automatic voice recognition models? Varia believes so.

"It is possible (and increasingly somewhat feasible) to transform any existing automated system into a privacy-preserving and automated system, using technologies like secure multiparty computation (MPC) or homomorphic encryption," he said.

There's been some progress on that front. In March, Google debuted TensorFlow Privacy, an open source library for its TensorFlow machine learning framework that's designed to make it easier for developers to train AI models with strong privacy guarantees. Specifically, it optimizes models by using a modified stochastic gradient descent technique -- the iterative method for optimizing the objective functions in AI systems -- that averages together multiple updates induced by training data examples and clips each of these updates, then adds anonymizing noise to the final average.

TensorFlow Privacy can prevent the memorization of rare details, Google says, and guarantee that two machine learning models are indistinguishable whether or not a user's data was used in their training.

In a somewhat related development, late last year Intel open-sourced HE-Transformer, a "privacy-preserving" tool that allows AI systems to operate on sensitive data. It's a backend for nGraph, Intel's neural network compiler, and it's based on Microsoft Research's Simple Encrypted Arithmetic Library (SEAL).

But Varia says that these and other crypto technologies aren't a magic bullet.

"[T]hey cannot transform a manual process into a computerized one," he said. "If Amazon believes that computers have already failed to classify these particular audio samples, then privacy-preserving computers won't fare any better."

Weber says that regardless, companies should be more transparent about their data collection and review processes, and that they should offer explanations for the limitations of their AI systems. Consumers agree, it would seem -- based on a survey of 4,500 people Episerver conducted late last year, 43% said they'd refrain from using voice-assisted devices like Alexa due to security concerns, and OpenVPN reports that 35% don't use an intelligent assistant because they feel it invades their privacy.

"We should understand when a human intervention is required, and on what grounds that decision is justified. We should not have to depend on a close reading of a terms of service document," Weber said. "[F]inally, technology companies should be proactive about AI that depends upon human-in-the-loop decision making -- even if that decision making is about quality assurance. They should offer [...] justifications rather than creating black box technologies and waiting for investigative journalists to uncover their [AI's] inner workings."

It's clear that manual annotation is here to stay -- at least for now. It's how data scientists at conglomerates like Amazon, Microsoft, and Apple improve the performance of voice assistants such as Alexa, Cortana, and Siri, and how they develop new features for those assistants and expand their language support. But even after privacy-preserving techniques like homographic encryption become the norm, transparency will remain the best policy. Without it, there can't be trust, and without trust, the smart speaker sitting on your kitchen counter becomes a little creepier than it was before.