TikTok maker's new AI SALMONN understands all audio, not just music and voices

Researchers from Tsinghua University and ByteDance have developed a new artificial intelligence system called SALMONN that allows machines to understand and reason about audio inputs like speech, sounds, and music.

In a research paper published on arXiv, the scientists describe SALMONN as "a large language model (LLM) enabling speech, audio event, and music inputs." The system merges two specialized AI models—one for processing speech and one for general audio—into a single LLM that can generate text responses to audio prompts.

"Instead of speech-only input or audio-event-only input, SALMONN can perceive and understand all kinds of audio inputs and therefore obtains emerging capabilities such as multilingual speech recognition & translation and audio-speech co-reasoning," the paper states. "This can be regarded as giving the LLM 'ears' and cognitive hearing abilities."

An AI Model That Hears and Understands

_{Credit: arxiv.org}

The researchers demonstrated SALMONN's abilities on a range of audio inputs, including clips of speech, gunshots, duck noises and music. When prompted with each sound clip, the system generated appropriate descriptive text responses, showcasing an understanding of the audio content.

"The text prompt is used to instruct SALMONN to answer open-ended questions about the general audio inputs and the answers are in the LLM text responses," explains the paper.

According to the researchers, this technique of cognitive audio question-answering represents a major leap over traditional AI speech and audio systems that are limited to basic transcription.

“Compared with traditional speech and audio processing tasks such as speech recognition and audio caption, SALMONN leverages the general knowledge and cognitive abilities of the LLM to achieve a cognitively oriented audio perception, which dramatically improves the versatility of the model and the richness of the task,” the paper states.

The researchers suggest SALMONN also possesses cross-modal abilities, such as following spoken instructions, without any explicit training in speech-to-text translation.

“SALMONN only uses training data based on textual commands, listening to spoken commands is also a cross-modal emergent ability,” they write.

While the current capabilities are promising, the researchers acknowledge the model has limitations in terms of reasoning depth. However, they are optimistic about the future potential, stating that SALMONN “makes a step towards hearing-enabled artificial general intelligence.”

Potential Impact of SALMONN on Enterprise Data Analysis

For technical decision makers, this development could herald a new era of voice-activated data analysis and business intelligence. The ability of SALMONN to understand and interpret a wide range of audio inputs could revolutionize how businesses interact with data, removing the need for traditional text-based input and opening up new possibilities for voice-activated analytics and data-driven decision making.

Furthermore, the team has released a web-based demo, allowing users to experience the capabilities of SALMONN firsthand. The model is also available on Hugging Face, a leading platform for hosting and sharing machine learning models.

In the rapidly evolving world of artificial intelligence, the unveiling of SALMONN serves as an interesting glimpse into the future of machine learning and cognitive computing. It underscores the commitment of ByteDance and Tsinghua University to push the boundaries of what AI can achieve. As we move closer to a world where AI can not only "see" through computer vision but also "hear" through cognitive audio processing, the implications for businesses and consumers alike are profound.

An AI Model That Hears and Understands

Potential Impact of SALMONN on Enterprise Data Analysis

More