Even casual music fans can distinguish songs by category without great difficulty, but that’s not the case for computers. Most audio-based music classification and tagging systems use categorical supervised learning — in other words, learning a function that maps songs to genres based on example pairs — with a fixed set of labels that intrinsically can’t handle unseen labels, such as newly added genres.

That’s why a team of scientists at Naver Corp, an internet content service company headquartered in South Korea, investigated a zero-shot alternative in a paper (“Zero-Shot Learning for Audio-based Music Classification and Tagging”) published on the preprint server Arxiv.org. Their AI classification system learns how to recognize songs without any labeled training data by taking into account side information about musical instruments, words in descriptions about songs, and more.

The researchers settled on two types of side information at the outset of the study: human-labeled attribute information and general word semantic information. The former, they note, can be used as binary outputs to train a classifier and infer unseen classes based on a learned hierarchy or other relationships. Semantic spaces, on the other hand, have a large set of words to predict unseen labels.

The team’s AI model ingested audio mel-spectrograms (representations of the short-term power spectrum of a sound) and passed them to a convolutional neural network, which was trained directly with semantic embedding from ground truth annotations. Essentially, the model took audio from one module and randomly selected words from audio annotations using a semantic lookup table composed of human-labeled attributes data or general word semantic space.

In the first of several experiments, the researchers tapped two data sets — Free Music Archive and OpenMIC-2018 — containing audio files and genre annotations, and they filtered the audio files to have both genre and instrument annotations (e.g., “bass,” “acoustic,” “vocal”) and randomly split the labels into seen and unseen ones. Then they took the annotations from 20 different instruments in the OpenMIC-2019 data set to create instrument vectors (mathematical representations) of songs according to the genre labels.

In a second test, the team used a publicly available pretrained machine learning model with the Million Song Dataset and Last.fm tag annotations (e.g., “punk” and “metal” from Nirvana’s “Smells Like Teen Spirit”), the latter of which they randomly divided into seen and unseen labels.

The researchers claim that in both tests the model managed to associate music audio with unseen labels using side information. They say this allows the model to use a “rich vocabulary of words” to describe music, and they leave to future work the use of lyrics as side information and the option of training AI models to contain more musical context (like text descriptions of playlist or music articles).