How Google built its AI-powered Hum to Search feature

In October, Google announced it would let users search for songs by simply humming or whistling melodies, initially in English on iOS and in more than 20 languages on Android. At the time, the search giant only hinted at how the new Hum to Search feature worked. But in a blog post today, Google detailed the underlying systems that enable Google Search to find songs using only hummed renditions.

Identifying songs from humming is a longstanding challenge in AI. With lyrics, background vocals, and a range of instruments, the audio of a musical or studio recording can be quite different from a hummed version. When someone hums their interpretation of a song, the pitch, key, tempo, and rhythm often vary slightly or significantly from the original. That's why so many existing approaches to query by humming match the hummed tune against a database of preexisting hummed or melody-only versions of a song instead of identifying the song directly.

By contrast, Google's Hum by Search matches a hummed melody directly to the original recordings without relying on a database of recordings paired with hummed versions of each. Google notes that this approach allows Hum to Search to be refreshed with millions of original recordings from across the world, including the latest releases.

This is just one example of how Google is applying AI to improve the Search experience. A recent algorithmic enhancement to Google's spellchecker feature enabled more accurate and precise spelling suggestions. Search now leverages AI to capture the nuances of the webpage content it indexes. And Google says it is using computer vision to highlight notable points in videos within Search, like a screenshot comparing different products or a key step in a recipe.

Matching melodies

Hum to Search builds on Google's extensive work in music recognition. In 2017, the company launched Now Playing with its Pixel smartphone lineup, which uses an on-device, offline machine learning algorithm and a database of song fingerprints to recognize music playing nearby. As it identifies a song, Now Playing records the track name in an on-device history. And if a Pixel is idle and charging while connected to Wi-Fi, a Google server sometimes invites it to join a "round" of computation with hundreds of other Pixel phones. The result enables Google engineers to improve the Now Playing song database without any phone revealing which songs were heard.

Google refined this technology in Sound Search, which provides a server-based recognition service to let users more quickly and accurately find over 100 million songs. Sound Search was built before the widespread use of machine learning algorithms, but Google revamped it in 2018 using scaled-up versions of the AI models powering Now Playing. Google also began weighing Sound Search's index based on popularity, lowering the threshold for popular songs and raising it for obscure songs.

But matching hummed tunes with songs required a novel approach. As Google explains, it had to develop a model that could learn to focus on the dominant melody of a song while ignoring vocals, instruments, and voice timbre; differences stemming from background noises; and room reverberations.

A humming model

For Hum to Search, Google modified the music recognition models leveraged in Now Playing and Sound Search to work with hummed recordings. Google trained these retrieval models using pairs of hummed or sung audio with recorded audio to produce embeddings (i.e., numerical representations) for each input. In practice, the modified models produce embeddings with pairs of audio containing the same melody close to each other (even if they have different instrumental accompaniment and singing voices) and pairs of audio containing different melodies far apart. Finding the matching song is a matter of searching for similar embeddings from Google's database of recordings.

Because training the models required song pairs -- recorded songs and sung songs -- the first barrier was obtaining enough training data. Google says its initial dataset consisted of mostly sung music segments (very few of which contained humming) and that it made the models more robust by augmenting the audio during training. It did this by varying the pitch or tempo of the sung input randomly, for example.

The resulting models worked well enough for people singing, but not for those humming or whistling. To rectify this, Google generated additional training data by simulating "hummed" melodies from the existing audio dataset using SPICE, a pitch extraction model developed by the company's wider team as part of the FreddieMeter project. FreddieMeter uses on-device machine learning models developed by Google to see how close a person's vocal timbre, pitch, and melody are to the artist Freddie Mercury.

SPICE extracts the pitch values from given audio, which researchers at Google used to generate a melody consisting of discrete tones. The company later refined this approach by replacing the simple tone synthesizer with a model that generates audio resembling an actual hummed or whistled tune.

Here's generated humming:

[audio wav="https://venturebeat.com/wp-content/uploads/2020/11/humming_generated.wav"][/audio]

And here's generated whistling:

[audio wav="https://venturebeat.com/wp-content/uploads/2020/11/whistling_generated.wav"][/audio]

As a final step, Google researchers compared training data by mixing and matching the audio. For example, if there was a similar clip from two different singers, they would align those two clips with their preliminary models. This enabled the researchers to show the model an additional pair of audio clips that represent the same melody.

"We've found that we could improve the accuracy of our model by taking [this] additional training data ... into account, namely by formulating a general notion of model confidence across a batch of examples," Google explained. "This helps the machine improve learning behavior, either when it finds a different melody that is too easy ... or because it is too hard in that, given its current state of learning."

Hum to Search taps all these techniques to show the most likely matches based on a given tune. Users can select the best match and explore information about the song and artist, view any accompanying music videos, or listen to the song on their favorite music app. They can also find the lyrics, read analysis, and check out other recordings of the song when available.

Matching melodies

A humming model

More