Qwen3-Next debuts with impressively efficient performance on just 3B active parameters

DeepSeek, who?

When it comes to Chinese AI startups, Alibaba’s Qwen team of AI researchers is still going strong with its release of powerful, cutting-edge open source large language models (LLM) available to virtually all enterprises and researchers around the globe who want to take, modify, or use them as-is for commercial activities — all for free.

The latest is Qwen3-Next, a new pair of LLMs released by the researchers this week, following a blockbuster summer of releasing many different new open source models that near the performance of US leaders OpenAI, Google, Anthropic and more at a fraction of the cost and with far more enterprise and developer control and optionality.

The Qwen3-Next release includes two post-trained variants — Instruct and Thinking — both permissively licensed under Apache 2.0, both available now on Hugging Face and to use directly through Qwen Chat (its ChatGPT/Claude rival).

Cutting-edge new ML techniques

But Qwen3-Next is special for multiple reasons: it marks the team’s first departure from the earlier Qwen3 architecture, and introduces a hybrid design that blends Gated DeltaNet with Gated Attention.

The first technique can be thought of as a “fast reader.” Instead of rereading everything word for word, it updates its understanding gradually as new text comes in. This makes it much more efficient for handling very long passages. In Qwen3-Next, about three-quarters of the model’s layers use this faster style of processing.

Meanwhile, the second technique, Gated Attention, plays the role of a “careful checker.” It uses a more traditional approach that looks at the relationships between words in greater detail. Qwen’s researchers added a gate that helps filter out noise, which makes this process more stable and accurate, especially for tricky reasoning tasks. Only about one-quarter of the layers use this method, so the model isn’t slowed down too much.

By combining these two, Qwen3-Next avoids the pitfalls of choosing one extreme or the other. If the model only used the fast method, it might miss important details; if it only used the careful method, it would be too slow on long documents. Together, the hybrid gives the model both speed and recall.

Qwen researcher Junyang Lin explained on X that the team has been experimenting with hybrid models and linear attention for about a year, describing the process as “a lot of trials and errors,” and noting that the attention gate turned out to be “something just like a free lunch to get benefits.”

Sparse, efficient, and more affordable

Qwen3-Next also pushes sparsity further than before, activating only 3 billion of its 80 billion parameters per token.

A token is the model’s basic unit of language — it might be a whole word, part of a word, a number, or even a chunk of code. Think of it as the alphabet in which the model “reads” and “writes.”

Parameters, by contrast, are the model’s internal switches and weights that guide how it interprets those tokens and decides what to say next. In general, more parameters can mean a more capable model. But Qwen’s design shows that you don’t need to turn them all on at once.

In fact, by reducing the number of parameters needed to handle any input or output token, Qwen is achieving much higher efficiency, lowering the energy and compute requirements (and therefore costs) needed to run the model.

Lin acknowledged the heavy costs of testing such architectures, which require full pre-training and reinforcement learning, but said the team proved it works: “We have been doing experiments on hybrid models and linear attention for about a year. We believe that our solution should be at least a stable and solid solution to a new model architecture for super long context.”

On top of that, the models support a native 256,000-token context window — equivalent to OpenAI's GPT-5, or a 600-800 page novel in terms of the volume of information that can be exchanged in any one input/output interaction with a user — with validation extending up to 1 million tokens using RoPE scaling methods.

Pricing is another differentiator. On Alibaba Cloud, per-token rates are $0.5/$6 per million input/output tokens for the reasoning variant and $0.5/$2 for the non-reasoning variant. That represents at least a 25% reduction compared with Qwen3-235B, making Qwen3-Next not just more efficient to train and run, but also cheaper to deploy at scale.

The models are available now through Hugging Face, ModelScope, Kaggle and Alibaba Cloud.

Sparse yet stable Mixture-of-Experts design

Qwen3-Next employs an ultra-sparse MoE structure, expanding to 512 experts compared to Qwen3’s 128.

Ten routed experts plus one shared expert balance computational efficiency with performance.

This setup steadily reduces training loss while maintaining stable results.

The team highlights several adjustments aimed at training stability. Qwen3-Next replaces QK-Norm with Zero-Centered RMSNorm and applies weight decay to normalization weights. MoE routers are normalized during initialization to prevent early bias in expert selection. These changes help both small-scale experiments and large-scale training runs proceed more reliably.

Another feature is native multi-token prediction (MTP), designed to support speculative decoding with higher acceptance rates. Optimizations for multi-step inference further improve real-world decoding efficiency.

Performance gains

The Qwen3-Next-80B-A3B base model activates only a fraction of its parameters during inference, yet outperforms Qwen3-32B on most benchmarks.

Training efficiency is emphasized: the new model was trained on 15 trillion tokens, a subset of Qwen3’s 36 trillion-token corpus, using less than 10% of the compute cost required for Qwen3-32B.

In inference, Qwen3-Next delivers significant speed improvements. At context lengths of 32,000 tokens and beyond, throughput is over 10 times higher than Qwen3-32B in both prefill and decode stages.

On reasoning and coding tasks, Qwen3-Next shows competitive or superior results. The Qwen3-Next-80B-A3B-Thinking model outperforms Qwen3-30B-A3B-Thinking and Qwen3-32B-Thinking, while surpassing the closed-source Gemini-2.5-Flash-Thinking on multiple benchmarks.

The Instruct variant performs close to Qwen3’s flagship 235-billion parameter model in long-context scenarios, natively handling up to 256,000 tokens and validated to 1 million tokens with scaling techniques.

Independent benchmarks support Qwen's gains. According to third-party AI benchmarking firm Artificial Analysis, Qwen3-Next’s reasoning-focused variant scores 54 on the Artificial Analysis Intelligence Index, placing it alongside DeepSeek V3.1 (Reasoning) in intelligence, but with far fewer active parameters.

The non-reasoning version scores 45, in line with models such as gpt-oss-20B and Llama Nemotron Super 49B v1.5.

Artificial Analysis also points out that the models are text-only, with no multimodal capabilities. Still, at 80B parameters in FP8 precision, they can fit on a single Nvidia H200 GPU — an accessibility win for enterprises and labs without massive compute clusters.

Developer access and licensing

Both Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking are released under the Apache 2.0 license, providing broad rights for modification and commercial use.

They are integrated into Hugging Face Transformers and supported by inference frameworks such as SGLang and vLLM, both of which enable OpenAI-compatible API endpoints.

The Qwen team also highlights integration with Qwen-Agent, which streamlines tool use in applications.

Looking Ahead

Qwen3-Next represents a pivot toward architectures designed for both efficiency and scalability.

By reducing the number of active parameters and optimizing for long contexts, the Qwen team positions this release as a practical step forward for developers.

Work on Qwen3.5 is already planned, with the goal of building on this architecture to achieve higher levels of performance and productivity.