Google today launched Looking-to-Listen, a new audiovisual speech enhancement feature in YouTube Stories captured with iOS devices. Leveraging AI and machine learning, it allows creators to take better selfie videos by automatically boosting their voices and reducing background noise, the company says.

While smartphone video quality continues to improve with every generation, audio quality remains stagnant. Little attention has been paid to, for example, making the speech of people in videos with multiple subjects speaking and background noise less muddled, distorted, and difficult to understand.

That’s why two years ago, Google developed a machine learning technology that employs both visual and audio cues to isolate the speech of a video’s subject. By training the model on a large-scale collection of YouTube content, researchers at the company were able to capture correlations between speech and visual signals like mouth movements and facial expressions. These correlations can be used to separate one person’s speech in a video from another’s or to separate speech from loud background noises.

According to Google software engineer Inbar Mosseri and Google Research scientist Michael Rubinstein, getting this technology into YouTube Stories wasn’t an easy feat. Over the past year, the Looking-to-Listen team worked with YouTube video creators to learn how they’d like to use the feature and in what scenarios, and what balance of speech and background sounds they’d like their videos to retain. The Looking-to-Listen model also had to be streamlined to run efficiently on mobile devices; all processing is done on-device within the YouTube app to minimize processing time and preserve privacy. And the technology had to be put through testing to ensure it performed consistently well across different recording conditions.

YouTube Stories Looking-to-Listen

Above: Looking-to-Listen’s system architecture.

Image Credit: Google

Looking-to-Listen works by first isolating video thumbnail images that contain the faces of speakers from a given stream. A component outputs visual features learned for the purpose of speech enhancement, extracted from the face thumbnails as the video is being recorded. After the recording completes, the audio and the computed features are streamed to an audiovisual separation model that produces the isolated and enhanced speech.

Mosseri and Rubinstein say that various architectural optimizations and improvements successfully reduced Looking-to-Listen’s running time from 10 times real-time on a desktop to 0.5 times performance using only an iPhone processor. Moreover, it brought the system’s size down from 120MB to 6MB. The result is that enhanced speech is available within seconds after YouTube Stories recordings finish.

Looking-to-Listen doesn’t remove all background noise — Google says the users it surveyed preferred to keep sounds for ambiance — and the company claims the technology treats speakers of different appearances fairly. In a series of tests, the Looking-to-Listen team found the feature performed well across speakers of different ages, skin tones, spoken languages, voice pitches, visibility, head pose, facial hair, and accessories (like glasses).

YouTube creators eligible for YouTube Stories creation can record a video on iOS and select “Enhance speech” from the volume controls editing tool, which will immediately apply speech enhancement to the audio track and play back the enhanced speech in a loop. They can then compare the original video with the enhanced version.