Imperial College London researchers claim they’ve developed a voice analysis method that supports applications like speech recognition and identification while removing sensitive attributes such as emotion, gender, and health status. Their framework receives voice data and privacy preferences as auxiliary information and uses the preferences to filter out sensitive attributes which could otherwise be extracted from recorded speech.

Voice signals are a rich source of data, containing linguistic and paralinguistic information including age, likely gender, health status, personality, mood, and emotional state. This raises concerns in cases where raw data is transmitted to servers; attacks like attribute inference can reveal attributes not intended to be shared. In fact, the researchers assert attackers could use a speech recognition model to learn further attributes from users, leveraging the model’s outputs to train attribute-inferring classifiers. They posit such attackers could achieve attribute inference accuracy ranging from 40% to 99.4% — three or four times better than guessing at random — depending on the acoustic conditions of the inputs.

The team aims to limit the success of inference attacks with a two-phase approach. In the first phase, users adjust their privacy preferences, where each of the preferences is associated with tasks (for example, speech recognition) that can be performed on voice data. In the second phase, the framework learns disentangled representations in the voice data to drive dimensions reflecting the independent factors for a particular task. The framework can generate three output types: speech embeddings (i.e., numerical representations of speech), speaker embeddings (numerical representations of users), or speech reconstructions produced by concatenating the speech embeddings with synthetic identities.

In experiments, the researchers used five public data sets (IEMOCAP, RAVDESS, SAVEE, LibriSpeech, and VoxCeleb) recorded for various purposes including speech recognition, speaker recognition, and emotion recognition to train, validate, and test the framework. They found they could achieve high speech recognition accuracy while hiding a speaker’s identity using the framework, but that recognition accuracy slightly increased depending on the preferences specified. That being the case, the coauthors expressed confidence this could be addressed with constraints in future work.

“It is clear that [things like the] change in the energy located in each pitch class for each frame reflects the success of the proposed framework in changing the prosodic representation related to the user’s emotion [and other attributes] to maintain his or her privacy,” the researchers wrote in a preprint paper. “Protecting users’ privacy where speech analysis is concerned continues to be a particularly challenging task. Yet, our experiments and findings indicate that it is possible to achieve a fair level of privacy while maintaining a high level of functionality for speech-based systems.”

The researchers plan to focus on extending their framework to provide controls depending on the devices and services with which users are interacting. They also intend to explore privacy-preserving, interpretable, and customizable applications enabled by disentangled representations.

This latest study follows a paper by Chalmers University of Technology and the RISE Research Institutes of Sweden researchers proposing a privacy-preserving technique that learns to obfuscate attributes like gender in speech data. Like the Imperial College London team, they used a model that’s trained to filter sensitive information in recordings and then generate new and private information independent of the filtered details, ensuring that sensitive information remains hidden without sacrificing realism or utility.