Meta’s DeepConf offers a dial to balance LLM reasoning cost and accuracy

A new test-time scaling technique from Meta AI and UC San Diego provides a set of dials that can help enterprises maintain the accuracy of large language model (LLM) reasoning while significantly reducing inference costs. The method, called DeepConf, leverages the internal confidence signals of models to dynamically filter out low-quality reasoning paths.

In tests on challenging reasoning benchmarks, DeepConf achieved up to 99.9% accuracy and reduced the number of generated tokens by as much as 84.7% compared to standard methods. Because it applies to open models, requires no additional training, and can be integrated into existing serving frameworks, DeepConf presents a useful tool for real-world applications built on open-source LLMs.

The high cost of thinking

To improve the reasoning abilities of LLMs on complex tasks, developers often use test-time scaling methods. One popular approach is “self-consistency with majority voting,” where the model is given the same prompt multiple times and the most common answer is chosen as the final response. While effective, this technique comes with significant computational overhead. Generating hundreds of reasoning traces for a single prompt increases inference costs significantly, making it impractical for many applications.

This approach can also yield diminishing returns as more traces are added. The core issue is that standard majority voting treats every reasoning path equally, regardless of its quality. This can lead to suboptimal results when a few low-quality but similar answers dominate the voting process and outvote the correct answer.

Previous research has explored calculating a "global confidence" score for an entire reasoning trace based on the confidence levels of the model’s inner activations. This helps filter out some low-quality paths, but it has its own limitations. A global score can obscure critical breakdowns that occur at specific steps in the reasoning process. More importantly, it requires the model to generate the complete trace before it can be evaluated, which prevents any potential for early termination and cost savings.

Thinking with confidence

Deep Think with Confidence (DeepConf), the new technique by Meta AI and UCSD, uses a more nuanced, confidence-aware filtering system based on local confidence measurements. Instead of a single global score, the researchers introduced several alternative metrics to provide a more fine-grained assessment of an answer’s quality based on the model’s confidence in different parts of its response.

DeepConf confidence schemes — DeepConf uses different metrics to measure confidence in reasoning traces (source: arXiv)

These metrics include "group confidence," which calculates confidence across different segments of tokens generated by the model, and "tail confidence," which focuses on the final portion of the reasoning. Another metric is "lowest group confidence," which identifies the single least-confident segment in a reasoning path. The intuition is that a line of reasoning is only as strong as its weakest link; a sharp drop in confidence often signals a critical error, even if the rest of the trace appears confident.

DeepConf operates in two primary modes. In "offline thinking," all reasoning traces are generated first. DeepConf then uses its confidence metrics to either give the answers weights (giving more influence to high-confidence traces) or filter out the least confident traces entirely before a final vote is taken. This addresses the weakness of vanilla majority voting.

In "online thinking," the system evaluates trace quality in real-time as tokens are being generated. If the group confidence drops below a predetermined threshold, the model immediately stops generating that trace and moves on. This dynamic termination is especially valuable in resource-constrained environments or applications where quick responses are critical. The researchers provide two variants for this mode: DeepConf-low, which aggressively filters traces to maximize performance and savings, and DeepConf-high, a more conservative option that prioritizes maintaining baseline accuracy.

DeepConf online thinking — DeepConf "online thinking" mode stops reasoning mid-response if the model's conference level drops (source: arXiv)

DeepConf in action

The researchers evaluated DeepConf on several recent open-source models, including DeepSeek-8B, Qwen3-32B, and the GPT-OSS series, across challenging mathematical and STEM reasoning benchmarks like AIME and HMMT. They compared its performance against standard single-trace generation and self-consistency with majority voting.

In offline tests, DeepConf consistently outperformed standard majority voting. The aggressive filtering strategy (keeping only the top 10% most confident traces) yielded the largest gains. For example, it boosted the accuracy of DeepSeek-8B on the AIME25 benchmark from 82.3% to 87.4%. Most notably, with GPT-OSS-120B, it achieved 99.9% accuracy on the same benchmark, effectively saturating it.

Online evaluations demonstrated significant efficiency gains by reducing the number of tokens the models generate. Compared to the baseline, the aggressive DeepConf-low setting reduced token generation by 43-79% across multiple benchmarks while often matching or improving accuracy in most cases. For instance, on the AIME24 dataset, DeepConf-low improved DeepSeek-8B's accuracy by 5.8 percentage points while using 77.9% fewer tokens.

DeepConf performance — DeepConf increases the accuracy of test-time scaling techniques while reducing the token consumption of LLMs (source: arXiv)

The researchers will soon release the code for DeepConf on GitHub.

For enterprise leaders wondering about the engineering lift, paper co-author and Meta AI research scientist Jiawei Zhao explained that DeepConf is designed for easy adoption. “It’s more like a plug-in layer on top of the existing serving stack,” he said. For teams already using parallel generation techniques such as majority voting, implementing the offline mode is a “drop-in change.” The more efficient online mode requires adding a “small hook during generation that looks at token probabilities in real time,” which he describes as a minor change, not a major rewrite. The team is actively working on integrations with popular inference frameworks like vLLM, with others like SGLang and TensorRT-LLM also exploring support.

However, it's important to apply DeepConf where it shines. “DeepConf is best suited for structured reasoning tasks like math, science, or coding, where the model’s internal confidence reflects the quality of its reasoning,” Zhao notes. For more subjective, open-ended tasks like document summarization or marketing content generation, he explains that the confidence signal “doesn’t translate as cleanly,” as there isn't a long reasoning chain to measure.

This focus on structured reasoning leads to a clear trade-off that enterprise leaders can manage. Zhao clarifies that the choice between its aggressive and conservative modes comes down to the risk tolerance of the use case. For high-stakes applications like financial analysis or legal review where reliability is paramount, he recommends DeepConf-high. This safer mode "still cuts about half of the generation cost, while staying very close to the model’s baseline accuracy."

In contrast, the more aggressive DeepConf-low, which can achieve 70-85% token savings, is better suited for lower-stakes tasks like internal knowledge base queries or generating first drafts, "where speed and cost matter more than absolute correctness." This provides businesses with a clear dial to tune the balance between cost and reliability for their specific needs.

As businesses increasingly rely on LLMs for complex reasoning, managing inference costs at scale becomes a critical challenge. By making reasoning more efficient without the need for costly retraining, DeepConf offers a practical path forward. The researchers conclude, “We hope this method highlights the potential of test-time compression as a practical and scalable solution for efficient LLM reasoning.”

But looking ahead, Zhao believes the concept's potential goes far beyond just saving tokens. He sees it as a foundational step toward more autonomous AI systems that can "pause when unsure, switch strategies, or even ask for clarification" when their confidence is low. This internal self-awareness could be key to building the next generation of more reliable and adaptive AI for the enterprise.