Open source Mamba 3 arrives to surpass Transformer architecture with nearly 4% improved language modeling, reduced latency

The generative AI era began for most people with the launch of OpenAI's ChatGPT in late 2022, but the underlying technology — the "Transformer" neural network architecture that allows AI models to weigh the importance of different words in a sentence (or pixels in an image) differently and train on information in parallel — dates back to Google's seminal 2017 paper "Attention Is All You Need."

Yet while Transformers deliver unparalleled model quality and have underpinned most of the major generative AI models used today, they are computationally gluttonous. They are burdened by quadratic compute and linear memory demands that make large-scale inference an expensive, often prohibitive, endeavor. Hence, the desire by some researchers to improve on them by developing a new architecture, Mamba, in 2023, which has gone on to be included in hybrid Mamba-Transformer models like Nvidia's Nemotron 3 Super.

Now, the same researchers behind the original Mamba architecture including leaders Albert Gu of Carnegie Mellon and Tri Dao of Princeton have released the latest version of their new architecture, Mamba-3, as a language model under a permissive Apache 2.0 open source license — making it immediately available to developers, including enterprises for commercial purposes. A technical paper has also been published on arXiv.org.

This model signals a paradigm shift from training efficiency to an "inference-first" design. As Gu noted in the official announcement, while Mamba-2 focused on breaking pretraining bottlenecks, Mamba-3 aims to solve the "cold GPU" problem: the reality that during decoding, modern hardware often remains idle, waiting for memory movement rather than performing computation.

Perplexity (no, not the company) and the newfound efficiency of Mamba 3

Mamba, including Mamba 3, is a type of State Space Model (SSM).

These are effectively a high-speed "summary machine" for AI. While many popular models (like the ones behind ChatGPT) have to re-examine every single word they’ve already seen to understand what comes next—which gets slower and more expensive the longer the conversation lasts—an SSM maintains a compact, ever-changing internal state. This state is essentially a digital "mental snapshot" of the entire history of the data.

As new information flows in, the model simply updates this snapshot instead of re-reading everything from the beginning. This allows the AI to process massive amounts of information, like entire libraries of books or long strands of DNA, with incredible speed and much lower memory requirements.

To appreciate the leap Mamba-3 represents, one must first understand perplexity, the primary metric used in the research to measure model quality.

In the context of language modeling, perplexity is a measure of how "surprised" a model is by new data.

Think of a model as a professional gambler. If a model has high perplexity, it is unsure where to place its bets; it sees many possible next words as equally likely.

A lower perplexity score indicates that the model is more "certain"—it has a better grasp of the underlying patterns of human language. For AI builders, perplexity serves as a high-fidelity proxy for intelligence.

The breakthrough reported in the Mamba-3 research is that it achieves comparable perplexity to its predecessor, Mamba-2, while using only half the state size. This means a model can be just as smart while being twice as efficient to run.

A new philosophy

Mamba 3 architecture diagram. Credit: Tri Dao

The philosophy guiding Mamba-3 is a fundamental shift in how we think about AI "intelligence" versus the speed of the hardware it runs on. While the previous generation, Mamba-2, was designed to be trained at record-breaking speeds, Mamba-3 is an "inference-first" architecture — inference referring to the way AI models are served to end users, through websites like ChatGPT or Google Gemini, or through application programming interfaces (APIs).

Mamba 3's primary goal is to maximize every second the computer chip (GPU) is active, ensuring that the model is thinking as hard as possible without making the user wait for an answer.

In the world of language models, every point of accuracy is hard-won. At the 1.5-billion-parameter scale, the most advanced "MIMO" variant of Mamba-3 achieved a 57.6% average accuracy across benchmarks, representing a 2.2-percentage-point leap over the industry-standard Transformer.

Mamba 3 accuracy benchmark chart — Mamba 3 benchmark comparison chart. Credit: Aakash Lahoti, Kevin Y. Li, Berlin Chen, Caitlin Wang, Aviv Bick, J. Zico Kolter, Tri Dao, Albert Gu

While a two-point jump might sound modest, it actually represents a nearly 4% relative increase in language modeling capability compared to the Transformer baseline. Even more impressively, as alluded to above, Mamba-3 can match the predictive quality of its predecessor while using only half the internal "state size," effectively delivering the same level of intelligence with significantly less memory lag.

For years, efficient alternatives to Transformers suffered from a "logic gap"—they often failed at simple reasoning tasks, like keeping track of patterns or solving basic arithmetic, because their internal math was too rigid. Mamba-3 solves this by introducing complex-valued states.

This mathematical upgrade acts like an internal compass, allowing the model to represent "rotational" logic. By using this "rotary" approach, Mamba-3 can near-perfectly solve logic puzzles and state-tracking tasks that its predecessors could only guess at, finally bringing the reasoning power of linear models on par with the most advanced systems.

The final piece of the puzzle is how Mamba-3 interacts with physical hardware. Most AI models today are "memory-bound," meaning the computer chip spends most of its time idle, waiting for data to move from memory to the processor.

Mamba-3 introduces a Multi-Input, Multi-Output (MIMO) formulation that fundamentally changes this dynamic. By performing up to four times more mathematical operations in parallel during each step, Mamba-3 utilizes that previously "idle" power. This allows the model to do significantly more "thinking" for every word it generates without increasing the actual time a user spends waiting for a response. More on these below.

Three new technological leaps

The appeal of linear models has always been their constant memory requirements and linear compute scaling.

However, as the Mamba 3 authors point out, there is "no free lunch". By fixing the state size to ensure efficiency, these models are forced to compress all historical context into a single representation—the exact opposite of a Transformer’s ever-growing KV cache. Mamba-3 pulls three specific levers to make that fixed state do more work.

1. Exponential-Trapezoidal Discretization

State Space Models are fundamentally continuous-time systems that must be "discretized" to handle the discrete sequences of digital data.

Previous iterations relied on "Exponential-Euler" discretization—a heuristic that provided only a first-order approximation of the system.

Mamba-3 introduces a generalized trapezoidal rule, providing second-order accurate approximation. This isn't just a mathematical refinement; it induces an "implicit convolution" within the core recurrence.

By combining this with explicit B and C bias terms, the researchers were able to remove the short causal convolution that has been a staple of recurrent architectures for years.

2. Complex-Valued SSMs and the "RoPE Trick"

One of the most persistent criticisms of linear models has been their inability to solve simple state-tracking tasks, such as determining the parity of a bit sequence.

This failure stems from restricting the transition matrix to real numbers, which prevents the model from representing "rotational" dynamics.Mamba-3 overcomes this by viewing the underlying SSM as complex-valued.

Using what the team calls the "RoPE trick," they demonstrate that a complex-valued state update is mathematically equivalent to a data-dependent rotary embedding (RoPE) applied to the input and output projections.

This allows Mamba-3 to solve synthetic reasoning tasks that were impossible for Mamba-2.

3. MIMO: Boosting Arithmetic Intensity

The most significant leap in inference efficiency comes from the transition from Single-Input, Single-Output (SISO) to Multi-Input, Multi-Output (MIMO) SSMs.

In a standard SSM, the state update is an outer-product operation that is heavily memory-bound.By switching to a matrix-multiplication-based state update, Mamba-3 increases the "arithmetic intensity" of the model—the ratio of FLOPs to memory traffic.

This allows the model to perform more computation during the memory-bound decoding phase. Essentially, Mamba-3 utilizes the "idle" compute cores of the GPU to increase model power for "free," maintaining the same decoding speed as its simpler predecessors.

What Mamba 3 means for enterprises and AI builders

For enterprises, Mamba-3 represents a strategic shift in the total cost of ownership (TCO) for AI deployments.

Cost vs. Performance: By matched-parameter performance, Mamba-3 (MIMO) matches the perplexity of Mamba-2 while using half the state size. For enterprise deployment, this effectively doubles the inference throughput for the same hardware footprint.
Agentic Workflows: As organizations move toward parallel, agentic workflows (like automated coding or real-time customer service agents), the demand for low-latency generation increases exponentially. Mamba-3 is designed specifically to prevent GPU hardware from sitting "cold" during these tasks.
The Hybrid Advantage: The researchers predict that the future of enterprise AI lies in hybrid models. By interleaving Mamba-3 with self-attention, organizations can combine the efficient "memory" of SSMs with the precise "database" storage of Transformers.

Barries to adoption — and how Mamba 3 is mitigating them

The Mamba architecture, while theoretically superior in scaling, has yet to displace the Transformer primarily due to the "Tensor Core Paradox" and a significant gap in ecosystem maturity.

Despite its linear O(L) complexity, early Mamba iterations struggled to utilize Nvidia's Tensor Cores, which are purpose-built for the dense matrix multiplications that define Transformers. While Transformers typically achieve 80–90% hardware utilization, the recurrent, scalar-heavy nature of SSMs often left Mamba-1 peaking at just 10–15%.

This "arithmetic intensity" gap meant that, in practice, Transformers were often faster to train despite their worse theoretical complexity.

Although Mamba-3 has largely mitigated this through Multi-Input Multi-Output (MIMO) updates and State Space Duality—mathematically reframing SSMs as "linear attention" that Tensor Cores can digest—the hardware-software stack for Transformers is simply too entrenched to be unseated overnight.

Beyond raw hardware utilization, Mamba faces a "logic and loyalty" constraint. Most industrial scaling laws, safety frameworks, and optimization libraries (like FlashAttention, GGUF quantization, and LoRA) were built exclusively for the Transformer's KV-cache paradigm.

Quantizing Mamba remains notoriously difficult; standard 4-bit methods often cause double-digit accuracy drops because Mamba’s parallel scan mechanism amplifies numerical outliers that Transformers can easily ignore.

Furthermore, pure Mamba models still lag in In-Context Learning (ICL)—the ability to "learn" from examples in a prompt. Research indicates that while Mamba-3 is nearly 4% more efficient in raw language modeling, it still struggles with the "copying" and multi-hop reasoning tasks that Attention handles natively.

Consequently, the industry is not "switching" to Mamba so much as "absorbing" it; the current state of the art has pivoted toward Hybrid Architectures (like Nvidia’s Nemotron-3 family) that interleave Mamba layers for long-context efficiency with just enough Attention layers to maintain "database-like" retrieval and reasoning.

Availability, licensing, and usage

Mamba-3 is not merely a theoretical research paper; it is a fully realized, open-source release available for immediate use with model code published on Github.

The project is released under the Apache-2.0 License. This is a permissive, business-friendly license that allows for free usage, modification, and commercial distribution without requiring the disclosure of proprietary source code.

This release is good for developers building long-context applications, real-time reasoning agents, or those seeking to reduce GPU costs in high-volume production environments.

Leading the State Space Models (SSM) revolution

The release was met with enthusiasm on social media, particularly regarding the "student-led" nature of the project. Gu, whose X/Twitter bio describes him as "leading the ssm revolution," gave full credit to the student leads, including Aakash Lahoti and Kevin Y. Li

Gu’s thread highlighted the team’s satisfaction with the design:

"We’re quite happy with the final model design! The three core methodological changes are inspired by (imo) some elegant math and methods."

As agentic workflows push inference demand "through the roof," the arrival of Mamba-3 suggests that the future of AI may not just be about having the biggest model, but about having the most efficient one.

Mamba-3 has successfully re-aligned the SSM with the realities of modern hardware, proving that even in the age of the Transformer, the principles of classical control theory still have a vital role to play.