China's Alibaba challenges U.S. tech giants with open source Qwen3-Omni AI model accepting text, audio, image and video

U.S. tech giants are facing a reckoning from the East.

Even as Nvidia pledged today to invest a staggering $100 billion into its own customer OpenAI's data centers — a move that raised eyebrows across tech and business spheres — Chinese search giant Alibaba's Qwen team of AI researchers debuted what may be its most impressive model yet: Qwen3-Omni, an open source large language model (LLM) that the company bills as the first "natively end-to-end omni-modal AI unifying text, image, audio & video in one model."

To be clear: Qwen3-Omni can accept and analyze inputs of text, image, audio and video from a user, but it only outputs text and audio — still a very impressive feat.

Of course, OpenAI's GPT-4o started the trend of "omni" models when it debuted back in 2024, but that model only unified text, image, and audio.

Google's Gemini 2.5 Pro from March 2025 can also analyze video, but, like OpenAI's GPT-4o, it is proprietary ("closed source"), meaning you have to pay to use it, unlike Qwen3-Omni, which can be downloaded, modified, and deployed for free under an enterprise-friendly Apache 2.0 license — even for commercial applications.

Google's open source, Apache 2.0 licensed Gemma 3n from May 2025 is probably the closest competitor — as it also accepts video, audio, text, and images as input — but only outputs text.

Unlike earlier systems that bolted speech or vision onto text-first models, Qwen3-Omni integrates all modalities from the start, allowing it to process inputs and generate outputs while maintaining real-time responsiveness.

Alibaba Cloud has introduced three distinct versions of Qwen3-Omni-30B-A3B, each serving different purposes.

The Instruct model is the most complete, combining both the Thinker and Talker components to handle audio, video, and text inputs and to generate both text and speech outputs.
The Thinking model focuses on reasoning tasks and long chain-of-thought processing; it accepts the same multimodal inputs but limits output to text, making it more suitable for applications where detailed written responses are needed.
The Captioner model is a fine-tuned variant built specifically for audio captioning, producing accurate, low-hallucination text descriptions of audio inputs.

Together, these three versions allow developers to select between broad multimodal interaction, deep reasoning, or specialized audio understanding, depending on their needs.

Qwen3-Omni is available now on Hugging Face, Github, and via Alibaba's API as a faster "Flash" variant.

Architecture and Design

At its core, Qwen3-Omni uses a Thinker–Talker architecture, where a "Thinker" component handles reasoning and multimodal understanding while the "Talker" generates natural speech in audio. Both rely on Mixture-of-Experts (MoE) designs to support high concurrency and fast inference .

Talker is decoupled from Thinker’s text representations and instead conditions directly on audio and visual features. This enables more natural audio-video-coordinated speech, such as maintaining prosody and timbre during translation .

It also means external modules like retrieval or safety filters can intervene in the Thinker’s outputs before Talker renders them as speech.

Speech generation is supported by a multi-codebook autoregressive scheme and a lightweight Code2Wav ConvNet, which together reduce latency while preserving vocal detail . Streaming performance is central: Qwen3-Omni achieves theoretical end-to-end first-packet latencies of 234 milliseconds for audio (0.234 seconds) and 547 milliseconds for video (0.547 seconds), remaining under one real-time factor (RTF) even with multiple concurrent requests .

The model supports 119 languages in text, 19 for speech input, and 10 for speech output, covering major world languages as well as dialects like Cantonese.

Context and Limits

Context length: 65,536 tokens in Thinking Mode; 49,152 tokens in Non-Thinking Mode
Maximum input: 16,384 tokens
Maximum output: 16,384 tokens
Longest reasoning chain: 32,768 tokens
Free quota: 1 million tokens (across all modalities), valid for 90 days after activation

Pricing via API

Through Alibaba's API, billing is calculated per 1,000 tokens. Thinking Mode and Non-Thinking Mode share the same pricing, although audio output is only available in Non-Thinking Mode.

Input Costs:

Text input: $0.00025 per 1K tokens (≈ $0.25 per 1M tokens)
Audio input: $0.00221 per 1K tokens (≈ $2.21 per 1M tokens)
Image/Video input: $0.00046 per 1K tokens (≈ $0.46 per 1M tokens)

Output Costs:

Text output:
- $0.00096 per 1K tokens (≈ $0.96 per 1M tokens) if input is text only
- $0.00178 per 1K tokens (≈ $1.78 per 1M tokens) if input includes image or audio
Text + Audio output:
- $0.00876 per 1K tokens (≈ $8.76 per 1M tokens) — audio portion only; text is free

How Qwen3-Omni Was Built

Training Qwen3-Omni involved both large-scale pretraining and extensive post-training.

Audio Transformer (AuT) serves as the audio encoder. Built from scratch, AuT was trained on 20 million hours of supervised audio, including 80% Chinese and English ASR data, 10% ASR from other languages, and 10% audio understanding tasks . The result is a 0.6B-parameter encoder optimized for both real-time caching and offline tasks.

Pretraining followed three stages:

Encoder Alignment (S1): Vision and audio encoders were trained separately while the LLM was frozen, preventing degradation of perception.
General Training (S2): A dataset of around 2 trillion tokens was used, spanning 0.57T text, 0.77T audio, 0.82T images, and smaller shares of video and audio-video data .
Long Context (S3): The maximum token length was extended from 8,192 to 32,768, with more long audio and video included, strengthening the model’s ability to handle extended sequences .

Post-training also involved multiple steps. For Thinker, this included supervised fine-tuning, strong-to-weak distillation, and GSPO optimization using both rule-based and LLM-as-a-judge feedback . Talker underwent a four-stage training process that combined hundreds of millions of multimodal speech samples with continual pretraining on curated data, aimed at reducing hallucinations and improving speech quality .

Benchmark Results

Across 36 benchmarks, Qwen3-Omni achieves state-of-the-art on 22 and leads open-source models on 32.

Qwen3-Omni benchmarks. Credit: Alibaba Qwen Team

Text and Reasoning: It posts 65.0 on AIME25, far above GPT-4o (26.7), and 76.0 on ZebraLogic, exceeding Gemini 2.5 Flash (57.9). WritingBench results show it at 82.6, above GPT-4o (75.5).
Speech and Audio: On Wenetspeech, Qwen3-Omni records 4.69 and 5.89 WER, well ahead of GPT-4o’s 15.30 and 32.27. Librispeech-other drops to 2.48 WER, the lowest among peers. Music benchmarks also highlight its strength: GTZAN at 93.0 and RUL-MuchoMusic at 52.0, both above GPT-4o.
Image and Vision: HallusionBench scores reach 59.7, MMMU_pro comes in at 57.0, and MathVision_full at 56.3, all higher than GPT-4o.
Video: On MLVU, Qwen3-Omni achieves 75.2, surpassing Gemini 2.0 Flash at 71.0 and GPT-4o at 64.6.

Together, these results highlight Qwen3-Omni’s balance of maintaining text and vision quality while excelling in speech and multimodal tasks.

Applications and Use Cases

Alibaba Cloud highlights a broad set of application scenarios for Qwen3-Omni. These range from multilingual transcription and translation to audio captioning, OCR, music tagging, and video understanding.

Imagine a tech support AI agent that can review in near realtime live video streamed from a customer's webcam or phone, and then provide automated guidance on how to troubleshoot their device (such as a printer, refrigerator, dishwasher etc.) or application ("how to cancel my subscription?") for example.

More interactive possibilities include video navigation and audiovisual dialogue, where the model draws on both sound and imagery for real-time interaction .

Through system prompts, developers can fine-tune how Qwen3-Omni behaves, from conversation style to persona. This flexibility supports deployment in consumer-facing assistants, enterprise transcription systems, and domain-specific analysis tools.

Licensing and Enterprise Impact

Qwen3-Omni is released under the Apache 2.0 license, a permissive framework that allows enterprises to adopt and adapt the technology freely. Apache 2.0 grants rights for commercial use, modification, and redistribution, without requiring derivative works to be open-sourced.

It also includes a patent license, reducing legal risks when integrating the model into proprietary systems.

For businesses, this means Qwen3-Omni can be embedded in products or workflows without licensing fees or compliance concerns. Enterprises can fine-tune the models for specific industries or local regulations while benefiting from continued community contributions.

What's next for Qwen?

Qwen3-Omni represents Alibaba Cloud’s push to broaden multimodal AI beyond research into enterprise-ready contexts.

With its Thinker–Talker design, extensive training pipeline, and Apache 2.0 licensing, the system offers both technical performance and practical accessibility.

As Lin put it, “This might bring some changes to the landscape of opensource Omni models! Hope you enjoy it!” By combining real-time interaction with open availability, Qwen3-Omni signals a new stage for multimodal AI adoption, where enterprises and developers alike can integrate powerful multimodal systems into their workflows without barriers.