DeepSeek-V3.1-Terminus launches with improved agentic tool use and reduced language mixing errors

The whale hath returned!

DeepSeek, the Chinese AI startup spun off of Hong Kong high-frequency trading firm High Flyer Capital Management (and which uses a whale icon for its logo), is back today with a new large language model update: DeepSeek-V3.1-Terminus, an upgraded version of its V3.1 model released almost exactly two months ago, designed to improve performance and reduce user-reported errors.

Available immediately via Hugging Face, through the DeepSeek iOS and Android apps, DeepSeek's application programming interface (API), and quickly being added to third-party open source tools AnyCoder on Hugging Face and NovitaLabs serverless API, Terminus delivers noticeably stronger performance in agentic tool use tasks, including coding and search-based evaluations, and cuts down on its predecessor 3.1's tendency of inserting Chinese words into English responses.

Background on DeepSeek V3, 3.1 and now, Terminus

Terminus is based off the DeepSeek V3 model family that initially debuted in December 2024, but which was quickly overshadowed a month later by the release of DeepSeek R1 in January 2025 due it its high performance on third-party benchmarks, particularly in coding, math, and tasks requiring multi-step "reasoning" or thinking through a problem before responding.

While R1 excels at logic, math, structured problem solving, it costs more per use than V3, and can be slower because it works through problems in more detail, trading speed for precision.

DeepSeek‑V3, by contrast, is more of a workhorse for general business uses. It’s efficient, good enough (strong performance) in many domains—writing, summarization, customer‑facing chat, basic code, general reasoning. It’s less expensive to run, faster on simpler tasks, and more versatile across varied scenarios. But when the problem requires deep logic or multi‑step reasoning, it doesn’t reach the same level of precision as R1.

When the first update to the V3 family, DeepSeek V3.1, was released in August 2025, it quickly made headlines for its scale and accessibility.

As reported by VentureBeat, the 685-billion-parameter model matched or exceeded performance benchmarks of proprietary U.S.-based systems while remaining fully open-source under a permissive and enterprise-friendly MIT License, which allows for commercial usage.

The release was widely seen as a strategic challenge to closed-model approaches, and it signaled China’s growing influence in frontier AI development.

Now, DeepSeek V3.1-Terminus continues to push DeepSeek's general purpose LLM further, and offers reasoning as well, all still under a commercially viable MIT License.

Refinements Based on User Feedback

The Terminus release focuses on two key areas of improvement: language consistency and agentic tool effectiveness.

According to DeepSeek, previous models occasionally mixed Chinese and English text or produced abnormal characters, issues that the Terminus version aims to resolve.

The update also strengthens DeepSeek's own "Code Agent" and "Search Agent," both task-specific frameworks that allow users to focus the underlying Terminus LLM on generating code and searching/synthesizing information from the web, respectively.

These refinements are reflected in benchmark results shared by the company. In agentic tool use tasks, Terminus shows clear improvements.

The model outperforms its predecessor in SimpleQA (96.8 vs. 93.4), BrowseComp (38.5 vs. 30.0), SWE Verified (68.4 vs. 66.0), SWE-bench Multilingual (57.8 vs. 54.5), and Terminal-bench (36.7 vs. 31.3). These gains suggest enhanced performance in real-world use cases where models must interact with tools or external systems.

DeepSeek-V3.1-Terminus benchmarks — Benchmark results for DeepSeek-v3.1-Terminus

On pure reasoning tasks without tool use, the results are more nuanced. The model shows modest increases in GPQA-Diamond (80.7 vs. 80.1) and Humanity’s Last Exam (21.7 vs. 15.9), with negligible differences elsewhere.

Interestingly, a small drop is noted in the Codeforces benchmark (2046 vs. 2091), a test commonly used to evaluate coding proficiency.

Two Modes: Chat and Reasoner

DeepSeek-V3.1-Terminus is offered in two operational modes:

deepseek-chat (Non-thinking Mode)
deepseek-reasoner (Thinking Mode)

Both versions support a context length of 128,000 tokens — smaller than the frontier of 2 million from Grok 4 Fast and 1 million from Google Gemini 2.5 Pro, as well as the 256,000 from OpenAI's GPT-5 — but it does still allow for about 300-400 pages of text to be exchanged in a single input/output interaction.

The chat mode offers function calling, FIM (Fill-in-the-Middle) completion, and JSON output, while the reasoner mode omits function calling and FIM, focusing instead on deeper contextual reasoning.

If a request to the reasoner model includes tool usage, it is automatically rerouted to the chat model.

The maximum output token lengths also vary:

Chat mode supports up to 8,000 tokens (default: 4,000)
Reasoner mode supports up to 64,000 tokens (default: 32,000)

API Pricing Structure

The pricing for both modes on DeepSeek's API is based on token usage and distinguishes between cache hits and misses, that is, if it can call upon previously stored information (cache hit) reducing new tokens from having to be inputted:

1M Input Tokens (Cache Hit): $0.07
1M Input Tokens (Cache Miss): $0.56
1M Output Tokens: $1.68

Token billing is based on the sum of input and output tokens. If both a topped-up and granted balance are available, the granted balance is used first.

However, given DeepSeek is a Hong Kong-based firm, enterprises in the U.S./West should perform their due diligence before relying on the API. Of course, the model is also available for them to use, download, modify and customize from Hugging Face — which would heavily reduce any concerns about security or data usage policies. Yet in this case, the company would need to host or rent a host for the model inferencing capability.

For developers interested in self-hosting, the model maintains the same architecture as DeepSeek-V3.1. Updated inference demo code is included in the repository to facilitate local deployment.

A known technical issue remains in the current checkpoint: the self_attn.o_proj parameter does not yet conform to the UE8M0 FP8 scale data format. DeepSeek notes this will be corrected in a future release.

What's Next for DeepSeek?

The release of DeepSeek-V3.1-Terminus reflects a continued effort by the company to iterate based on community input.

While most improvements are incremental, the enhanced agent performance and expanded feature set in the chat mode are likely to benefit developers and researchers seeking stable, tool-integrated language model capabilities.

As DeepSeek builds on the momentum from its earlier V3.1 release, the company continues to test the boundaries of both technical achievement and accessibility.

With ongoing interest from the global research and developer communities, its open-source approach remains a key differentiator in the evolving AI landscape.

Meanwhile, chatter has already started on social that DeepSeek V4 is in the works. And of course, we're all waiting for a successor to DeepSeek R1, the presumed DeepSeek R2. However, some commentators have previously alleged DeepSeek's continued focus on its V3 series is evidence the company has hit development challenges in training more powerful models — despite this release and the release of DeepSeek R1-0528 back in May 2025.