Two trends have dominated AI large language model (LLM) releases in recent months: smaller models and reasoning models.

The former are advantageous because they can run on all kinds of hardware and don't need an internet connection, which is helpful for those concerned about privacy, security and high run-time costs. Plus, they can be more easily and rapidly customized to specific enterprise tasks.

Meanwhile, reasoning models have been all the rage since OpenAI's "o" series debuted last year and really took off with DeepSeek R1 in January. Still, the basic principle for all of these types of LLMs is that they reflect on their outputs and try to self-correct before responding to a user in an effort to improve accuracy and performance, especially on harder and multi-step problems like putting together a "deep research" report.

Now, French AI company Mistral has released a new pair of models that capitalize on both of these trends.

The company said on X that its new Magistral Small 1.2 and Magistral Medium 1.2 LLMs are "minor" updates to its Magistral 1.1 series.

But the updates may actually be more significant than this well-funded European AI darling is letting on: both models are equipped with a vision encoder, allowing them to analyze imagery submitted by users. And they both offer performance improvements on key benchmarks, as well as enhanced usability features.

Perhaps most impressively, as pointed out by Hugging Face ML Growth Lead Ahsen Khaliq (@_akhaliq on X), the Magistral Small 1.2 version with 24 billion parameters — when quantized, or having its internal settings represented by a smaller number of bits, saving space and reducing energy requirements in exchange for some degree of accuracy — can even "be deployed locally, fitting within a single [Nvidia] RTX 4090 [GPU] or a 32GB RAM [Apple] MacBook..."

Khaliq further updated his Hugging Face vibe coding web application, AnyCoder, to make Magistral Medium 1.2 the default LLM powering the app.

The code is available now for download via Hugging Face and you can chat directly with the models on the web at Le Chat, Mistral's chatbot website and ChatGPT competitor.

For developers, the models are available through the Mistral API under "magistral-small-2509" and "magistral-medium-2509", respectively.

Ultimately, if you're a developer at a large enterprise or indie developer, or a researcher, these models are worth considering as options for language, math, coding, reasoning and image analysis tasks — like writing alt-text image descriptions and captions when publishing content.

API pricing comparison

Mistral’s API pricing is $2 per million input tokens and $5 per million output tokens for Magistral Medium, and $0.50 input / $1.50 output per million tokens for Magistral Small. Here's how it compares to others.

  • Magistral Small ($0.50 input / $1.50 output) is cheaper than many mid‑tier offerings (e.g., Anthropic Sonnet) but more expensive than the lightest DeepSeek “chat” mode for input tokens. For output, it’s competitive, but DeepSeek still has a lower cost for very basic output in some modes.

  • Magistral Medium ($2 / $5) sits somewhere between mid‑light and mid‑heavy for most providers. It’s much cheaper than the highest-tier models (Opus, OpenAI full-scale GPT -5, etc.), but noticeably more expensive than ultra-light options.

  • DeepSeek appears by far the most cost‑efficient on many fronts, especially for input tokens where caching helps. If your usage is input-heavy (prompts, document context) but output isn’t massive, DeepSeek's “chat” mode could be much cheaper.

  • Anthropic for high power (Opus, etc) is quite expensive, especially on the output side.

  • Alibaba / Qwen pricing is more variable; for some Qwen models, you get relatively low input costs, but output might scale up depending on which variant you use.

Following up on a midsummer launch

Mistral introduced the Magistral model family in June 2025 as its first step into reasoning-focused AI. The debut lineup included a proprietary Magistral Medium for enterprise clients and a fully open-source Magistral Small, licensed under Apache 2.0.

This dual-release strategy aimed to balance commercial viability with community access. Magistral Small served as a direct signal that Mistral was reaffirming its open-source commitment, especially after criticism surrounding earlier closed releases like Medium 3.

Benchmarks from the launch period showed the Magistral models to be competitive across coding and math tasks, with Magistral Medium scoring as high as 90%+ on AIME-24 using majority voting.

Its design emphasized traceable reasoning chains, multilingual fluency, and high token throughput—features targeted at industries where verifiability and performance matter.

Measurable performance gains backed by benchmarks

With the release of the 1.2 updates, Mistral AI is now providing more than just incremental gains—it’s achieving top-tier scores across a range of public benchmarks, as visualized in the latest benchmark comparison charts.

On the AIME24 mathematics benchmark, Magistral Medium 1.2 scores 91.82%, slightly edging out Deepseek-R1 (91.40%) and outperforming Magistral Medium 1.0 (73.59%) by a wide margin.

Magistral Medium 1.2 benchmarks chart 1

Magistral Medium-1.2 benchmarks chart. Credit: Mistral

Similar improvements appear across other tasks:

  • AIME25: 83.48% for Magistral Medium 1.2, beating Deepseek (79.4%) and Medium 1.0 (64.95%)

  • HMMT25: 76.66%, up from 64.95%

  • GPQA Diamond: 76.26%, vs. 70.83% in v1.0 and slightly behind Deepseek (81%)

  • LiveCodeBench v5: 75.00%, a jump from 59.36% in v1.0

  • LiveCodeBench v6: 68.50%, significantly higher than Medium 1.0’s 50.29%

  • HLE Text-only (high-level evaluation): 11.76%, nearly 3 points above version 1.0 (8.99%)

While Qwen3-235B-A22B-Thinking leads narrowly in HMMT25, GPQA Diamond, and LiveCodeBench v6, Magistral Medium 1.2 holds its own and consistently outperforms its previous versions and most rivals on AIME and LiveCode coding tasks.

Magistral Small 1.2 also shows clear gains over its 1.0 and 1.1 versions and performs competitively against much larger models.

Magistral Small 1.2 benchmarks

Magistral Small-1.2 performance benchmarks. Credit: Mistral

Key benchmark comparisons:

  • AIME24: 86.14%, compared to Qwen3-32B at 81.40%, and Small 1.0 at 70.68%

  • AIME25: 77.34%, versus Qwen3-32B at 72.90%, and Small 1.0 at 62.76%

  • GPQA Diamond: 70.07%, modestly ahead of Qwen3-32B (68.40%) and Small 1.0 (68.18%)

  • LiveCodeBench v5: 70.02%, up from 55.84%

  • LiveCodeBench v6: 61.60%, a rise from 47.36%

  • HLE Text-only: 7.46%, compared to 6.44% in the previous version

While the Qwen3-30B-A3b-Thinking model leads in some tasks, such as AIME25 and GPQA, Magistral Small 1.2 consistently outperforms its predecessor and rivals, like Qwen3-32B, especially in code-related benchmarks like LiveCodeBench.

Multimodal Inputs and Vision Reasoning

A major feature of the 1.2 updates is support for multimodal inputs.

Both models are now equipped with a vision encoder that allows them to interpret and reason across text and images.

This addition expands the range of tasks the models can handle, including visual question answering, code diagram interpretation, and layout analysis.

Improved reasoning, format and tool use

Mistral emphasizes enhanced reasoning structure and output formatting in 1.2. Responses are now more natural and concise, especially for simple prompts. Support for LaTeX and Markdown is improved, reducing friction for developers working on technical tasks.

The models are also more adept at using external tools such as web search, code interpreters, and image generators, with better logic around when and how to use these tools.

Both models introduce special [THINK] and [/THINK] tokens to enclose reasoning traces for easier developer review, a design that structures model outputs into internal reasoning followed by a final answer—useful for traceability and debugging.

Licensing, deployment and integration

The Apache 2.0 license remains in place for both models, allowing full commercial and non-commercial use without restrictions, making it a useful option for enterprises — especially those who have security concerns or reservations about using the plethora of new Chinese open source models that have hit the scene over the summer.

Mistral also provides compatibility with several frameworks and tools including:

  • vLLM (recommended)

  • Transformers

  • llama.cpp

  • LM Studio

  • Kaggle

  • Axolotl and Unsloth for fine-tuning

Recommended parameters for optimal use include:

  • top_p: 0.95, temperature: 0.7, max_tokens: 131072

Multilingual Support

Magistral 1.2 models support over two dozen languages — including French, German, Arabic, Japanese and Chinese — and are designed to maintain quality outputs up to a 128k context window. However, Mistral notes performance is optimal under 40k tokens.

This enables use cases like document analysis, long code reviews and multilingual dialogue generation, further expanding the model’s potential applications.

With Magistral 1.2, Mistral continues its dual-path strategy: delivering open, efficient models for developers, while scaling enterprise-ready tools with measurable advantages in reasoning, performance and flexibility. The benchmark results speak to consistent progress—and a clearer competitive footing in the evolving LLM landscape.

Let me know if you’d like this adapted for a different channel (e.g. internal newsletter, landing page copy, or investor brief), or need specific tables or charts extracted for inclusion.