Intuit has been on a multi-year journey building its Generative AI Operating System (GenOS) to power AI agents across its platforms which include TurboTax, QuickBooks, Credit Karma and Mailchimp. The GenOS technology already enables thousands of developers to build AI experiences and has deployed agents like QuickBooks' accounting assistant that saves small businesses 12 hours per month.

Now the financial software giant is announcing major GenOS enhancements that reveal how enterprises can build domain-specific AI systems that outperform general-purpose alternatives. The latest upgrades focus on three key areas: custom financial large language models, seamless expert-in-the-loop capabilities and advanced agent evaluation frameworks.

The big breakthrough comes from Intuit's new custom-trained Financial LLMs that deliver 90% accuracy on transaction categorization. That represents a marked improvement over previous models while slashing latency by 50% compared to general-purpose LLMs. For a platform already processing tens of millions of AI interactions, those efficiency gains translate into substantial cost savings and dramatically better user experiences.

"What's so extraordinary about this is that as we go forward, we're going to be driving the costs down and the latency down for this model," Ashok Srivastava, Intuit's Chief AI Officer, told VentureBeat in an exclusive briefing. "But what we're also seeing simultaneously is that the quality of the model is improving, and this is something that I've been talking about for a long time, the so called race to the bottom in terms of cost and latency, race to the top in terms of accuracy."

The technical breakthrough: Semantic understanding at scale

So how did Intuit make its Financial LLMs so much better?

The key innovation lies in how Intuit approached the semantic understanding problem that plagues many enterprise AI implementations. Traditional machine learning models learn direct mappings between transactions and categories. Intuit's Financial LLMs understand the contextual meaning behind financial terminology.

Mapping transactions to categories is something that financial software has done for some time, though typically with rigid pre-defined categories. That's where Intuit is now looking to do better.

"If the categories were predefined and Intuit just had to find those categories, then have all customers map to it...that's actually an easy problem," Srivastava said. "But in this case, everyone has their own set of categories, and we want that. We want that level of personalization."

The way Intuit's Financial LLMs work is the system now actually learns what the user's categories are because it has a better understanding of semantics. This semantic understanding enables the models to handle personalized categorization systems. That's a critical capability for enterprise deployments where different organizations have unique taxonomies and business rules.

The training approach starts with transaction data from banks that's been anonymized and scrubbed for personally identifiable information. Intuit then enhances the model through supervised fine-tuning and specialized guardrails built into the training process that improve semantic understanding.

The multi-step process represents months of innovation work by Srivastava's specialized financial LLM team. This methodical approach to domain-specific model training offers a template for other enterprises looking to build AI systems that outperform general-purpose alternatives in specialized domains.

Advanced AI agent evaluation: Beyond accuracy to efficiency

Beyond improving the accuracy of its Financial LLMs, Intuit is also  significantly expanding its GenOS Evaluation Service within the Agent Starter Kit. While basic evaluation capabilities have existed since GenOS inception, the company is now making major investments in sophisticated frameworks that measure agent efficiency and decision quality under uncertainty.

The enhanced evaluation service addresses a critical gap in enterprise AI deployments. Most companies focus on whether AI agents produce accurate results but ignore whether those results represent optimal decisions.

"When you're working with agents, the agent is making decisions or proposing decisions on your behalf," Srivastava explained. "The question is, is that the right set of decisions or not? Number one. Number two,  even if it accomplishes the task that you want, is that the best way to do it?"

He illustrated a hypothetical scenario. An AI agent suggests a route from San Francisco to Los Angeles via Oklahoma City. 

"That is a potential solution to this problem, but it's highly inefficient,” he said. “When you're building agentic technology, you need to see, not only is it accurate, because, frankly, it is accurate, but is it efficient?"

For enterprise AI teams, this suggests moving beyond binary success metrics to evaluate decision paths under uncertainty. The key is building evaluation systems that can assess whether an agent chose the most efficient route to a goal while accounting for external factors and constraints.

The strategic implications for Enterprise AI

Intuit's GenOS evolution offers several lessons for enterprise AI teams.

Domain specialization beats generalization: Custom models trained on industry-specific data can significantly outperform general-purpose alternatives on specialized tasks. This happens despite requiring more upfront investment.

Evaluation frameworks are competitive advantages: Sophisticated measurement of AI agent efficiency and decision quality under uncertainty separates successful enterprise AI implementations from failed experiments.

Human-AI orchestration requires infrastructure: Seamless expert-in-the-loop capabilities demand purpose-built routing and handoff systems. Ad hoc human oversight isn't sufficient.

Developer productivity compounds: Internal AI tooling investments create accelerating returns through improved developer velocity and code quality.

For enterprises looking to lead in AI adoption, Intuit's approach suggests a clear strategy. The winning approach involves building specialized, domain-aware AI systems with sophisticated evaluation frameworks. Simply deploying general-purpose models isn't enough.