Researchers propose LEAF, a frontend for developing AI classification algorithms

In machine learning, mel-filterbanks -- fixed, hand-engineered representations of sound -- are often used to train algorithms that classify sound. Decades after the design of mel-filterbanks, research shows that they exhibit desirable mathematical properties for representation learning; in other words, they represent strong audio features. But the design of mel-filterbanks is also flawed by biases, and these biases can be detrimental for tasks that require fine-grained resolution at high frequencies.

In a step toward an AI-forward alternative, researchers at Google developed LEAF, a frontend that that breaks down mel-filterbanks into several components -- filtering, pooling, and compression/normalization -- to create audio classification models ostensibly with minimal biases. The researchers claim that LEAF can learn a single set of parameters that outperforms mel-filterbanks, suggesting it can be used for general-purpose audio classification tasks.

LEAF has real-world implications given that the global sound recognition market was valued at $66.5 million in 2018, according to Grand View Research. Apart from voice and speech recognition, the sense of hearing has become essential in AI; sound enables AI to understand the context and differentiate among various events occurring in an environment. For example, in case of an intrusion, an event management system with an AI-powered sound-sensing technology could turn lights on and play loud music to deter a breach, along with sending alerts to homeowners. LEAF could make it easier to create those sorts of products without having to painstakingly handcraft sound representations.

In experiments, the researchers used LEAF to develop independent single-task supervised models on eight distinct classification problems, including acoustic scene classification, birdsong detection, emotion recognition, speaker identification, musical instrument and pitch detection, keyword spotting, and language identification. They say that the models created with LEAF outperformed or nearly outperformed all alternatives or matched the accuracy of other frontends.

In the near future, the team plans to release the source code for their models and baselines as well as pretrained frontends. "In this work, we argue that a credible alternative to mel-filterbanks for classification should be evaluated across many tasks, and propose the first extensive study of learnable frontends for audio over a wide and diverse range of audio signals, including speech, music, audio events, and animal sounds," they wrote in a paper describing their work. "By breaking down mel-filterbanks into three components ... we propose LEAF, a novel frontend that is fully learnable in all its operations, while being controlled by just a few hundred parameters. [T]hese findings are replicated when training a different model for each individual task. We also confirm these results on a challenging, large-scale benchmark."

More