PINs and text messages can be inferred from smart speaker recordings, study shows

Malicious attackers can extract PIN codes and text messages from audio recorded by smart speakers from up to 1.6 feet away. That's according to a new study authored by researchers at the University of Cambridge, which showed that it's possible to capture virtual keyboard taps with microphones located on nearby devices, like Alexa- and Google Assistant-powered speakers.

Amazon Echo, Google Home, and other smart speakers pack microphones that are always on in the sense that they process audio they hear in order to detect wake-up phrases like "OK Google" and "Alexa." These wake-phrase detectors occasionally send audio data to remote servers. Studies have found that up to a minute of audio can be uploaded to servers without any keywords present, either by accident or absent privacy controls. Reporting has revealed that accidental activations have exposed contract workers to private conversations, and researchers say these activations could reveal sensitive information like passwords if a victim happens to be within range of the speaker.

The researchers assume for the purposes of the study that a malicious attacker has access to the microphones on a speaker. (They might make a call or tamper with the speaker, for instance, or gain access to the speaker's raw audio logs.) They also assume the device from which the attacker wishes to extract information is held close to the speaker's microphones and that the make and model are known to the attacker.

In experiments, the researchers used a ReSpeaker, a six-microphone accessory for the Raspberry Pi designed to run Alexa on the Pi while providing access to raw audio. As the authors note, the setup is similar to the Amazon Echo minus the center microphone, which all of the Echo models lack.

Taps in audio recordings on the "victim" device -- in this case an HTC Nexus 9 tablet, a Nokia 5.2 smartphone, and a Huawei Mate20 Pro -- can be recognized using microphones by a short one- to two-millisecond spike with frequencies between 1000Hz and 5,500Hz, followed by a longer burst of frequencies around 500Hz, according to the authors. The sound waves propagate in solid material like smartphone screens and the air, making them easy for a microphone to pick up.

The team trained an AI model to filter taps and distinguish actual taps from false positives in recordings. Then they created a separate set of classifiers to identify potential digits and letters from the taps detected by the first classifier. Given just 10 guesses, the results suggest that five-digit PINs can be guessed up to 15% of the time and that text can be inferred with 50% accuracy.

The researchers note that their proposed attack might not be possible on Alexa and Google Assistant devices because neither Amazon nor Google allows third-party skills to access raw audio recordings. Moreover, phone cases or screen protectors could alter the tap acoustics and provide some measure of protection against snooping. But they assert that their work demonstrates how any device with a microphone and audio log access could be exploited to collect sensitive information.

"This shows that remote keyboard-inference attacks are not limited to physical keyboards but extend to virtual keyboards too," they wrote in a paper describing their work. "As our homes become full of always-on microphones, we need to work through the implications."