In a study accepted to the 2020 International Conference on Machine Learning last week, researchers at the Chalmers University of Technology and the RISE Research Institutes of Sweden propose a privacy-preserving technique that learns to obfuscate attributes like gender in speech data. They use a model that’s trained to filter sensitive information in recordings and then generate new and private information independent of the filtered details, ensuring sensitive information remains hidden without sacrificing realism or utility.

Maintaining privacy with voice assistants is a challenging task, given state-of-the-art AI techniques have been used to infer attributes like intention, gender, emotional state, and identity from timbre, pitch, and speaker style. Recent reporting revealed that accidental voice assistant activations exposed private conversations; the risk is such that law firms, including Mishcon de Reya, have advised staff to mute smart speakers when they talk about client matters at home. Google Assistant, Siri, Cortana, and other major voice recognition platforms allow users to delete recorded data, but this requires some — and in several cases substantial — effort.

The researcher’s solution employs a generative adversarial network (GAN) called PCMelGAN, a two-part AI model consisting of a generator that creates samples and a discriminator that attempts to differentiate between the generated samples and real-world samples. It maps speech recordings to mel spectrograms, or representations of the spectrum of frequencies of the audio signal as it varies over time, and passes them through a filter that removes sensitive information and a generator that adds synthetic information in its place. PCMelGAN then inverts the mel spectrogram output into audio in the form of a raw waveform.

In experiments, the researchers trained PCMelGAN on 10,000 samples from the open source AudioMNIST data set, which comprises 30,000 audio recordings of the digits one through nine spoken in the English language. They measured privacy by determining whether a classifier could predict a speaker’s gender with better than 50% accuracy after five runs on the spectrograms and the raw audio.

Here’s a recording of someone saying “four”:

And here’s PCMelGAN’s output:

Here’s someone saying “six”:

And here’s PCMelGAN’s output:

According to the researchers, the results show PCMelGAN makes it empirically difficult for adversaries to, for example, infer a speaker’s gender but retains qualities such as intonation and content. “The proposed method can successfully obfuscate sensitive attributes in speech data and generates realistic speech independent of the sensitive input attribute. Our results for censoring the gender attribute on the AudioMNIST data set demonstrate that the method can maintain a high level of utility,” they wrote. “As more data is collected in various settings across organizations, companies, and countries, there has been an increase in the demand [for] user privacy.”


The audio problem: Learn how new cloud-based API solutions are solving imperfect, frustrating audio in video conferences. Access here