When the One Big Beautiful Bill arrived as a 900-page unstructured document — with no standardized schema, no published IRS forms, and a hard shipping deadline — Intuit's TurboTax team had a question: could AI compress a months-long implementation into days without sacrificing accuracy?
What they built to do it is less a tax story than a template, a workflow combining commercial AI tools, a proprietary domain-specific language and a custom unit test framework that any domain-constrained development team can learn from.
Joy Shaw, director of tax at Intuit, has spent more than 30 years at the company and lived through both the Tax Cuts and Jobs Act and the OBBB. "There was a lot of noise in the law itself and we were able to pull out the tax implications, narrow it down to the individual tax provisions, narrow it down to our customers," Shaw told VentureBeat. "That kind of distillation was really fast using the tools, and then enabled us to start coding even before we got forms and instructions in."
How the OBBB raised the bar
When the Tax Cuts and Jobs Act passed in 2017, the TurboTax team worked through the legislation without AI assistance. It took months, and the accuracy requirements left no room for shortcuts.
"We used to have to go through the law and we'd code sections that reference other law code sections and try and figure it out on our own," Shaw said.
The OBBB arrived with the same accuracy requirements but a different profile. At 900-plus pages, it was structurally more complex than the TCJA. It came as an unstructured document with no standardized schema. The House and Senate versions used different language to describe the same provisions. And the team had to begin implementation before the IRS had published official forms or instructions.
The question was whether AI tools could compress the timeline without compromising the output. The answer required a specific sequence and tooling that did not exist yet.
From unstructured document to domain-specific code
The OBBB was still moving through Congress when the TurboTax team began working on it. Using large language models, the team summarized the House version, then the Senate version and then reconciled the differences. Both chambers referenced the same underlying tax code sections, a consistent anchor point that let the models draw comparisons across structurally inconsistent documents.
By signing day, the team had already filtered provisions to those affecting TurboTax customers, narrowed to specific tax situations and customer profiles. Parsing, reconciliation and provision filtering moved from weeks to hours.
Those tasks were handled by ChatGPT and general-purpose LLMs. But those tools hit a hard limit when the work shifted from analysis to implementation. TurboTax does not run on a standard programming language. Its tax calculation engine is built on a proprietary domain-specific language maintained internally at Intuit. Any model generating code for that codebase has to translate legal text into syntax it was never trained on, and identify how new provisions interact with decades of existing code without breaking what already works.
Claude became the primary tool for that translation and dependency-mapping work. Shaw said it could identify what changed and what did not, letting developers focus only on the new provisions. "It's able to integrate with the things that don't change and identify the dependencies on what did change," she said. "That sped up the process of development and enabled us to focus only on those things that did change."
Building tooling matched to a near-zero error threshold
General-purpose LLMs got the team to working code. Getting that code to shippable required two proprietary tools built during the OBBB cycle.
The first auto-generated TurboTax product screens directly from the law changes. Previously, developers curated those screens individually for each provision. The new tool handled the majority automatically, with manual customization only where needed.
The second was a purpose-built unit test framework. Intuit had always run automated tests, but the previous system produced only pass/fail results. When a test failed, developers had to manually open the underlying tax return data file to trace the cause. "The automation would tell you pass, fail, you would have to dig into the actual tax data file to see what might have been wrong," Shaw said. The new framework identifies the specific code segment responsible, generates an explanation and allows the correction to be made inside the framework itself.
Shaw said accuracy for a consumer tax product has to be close to 100 percent. Sarah Aerni, Intuit's VP of technology for the Consumer Group, said the architecture has to produce deterministic results. "Having the types of capabilities around determinism and verifiably correct through tests — that's what leads to that sort of confidence," Aerni said.
The tooling handles the speed. But Intuit also uses LLM-based evaluation tools to validate AI-generated output, and even those require a human tax expert to assess whether the result is correct. "It comes down to having human expertise to be able to validate and verify just about anything," Aerni said.
Four components any regulated-industry team can use
The OBBB was a tax problem, but the underlying conditions are not unique to tax. Healthcare, financial services, legal tech and government contracting teams regularly face the same combination: complex regulatory documents, hard deadlines, proprietary codebases, and near-zero error tolerance.
Based on Intuit's implementation, four elements of the workflow are transferable to other domain-constrained development environments:
Use commercial LLMs for document analysis. General-purpose models handle parsing, reconciliation and provision filtering well. That is where they add speed without creating accuracy risk.
Shift to domain-aware tooling when analysis becomes implementation. General-purpose models generating code into a proprietary environment without understanding it will produce output that cannot be trusted at scale.
Build evaluation infrastructure before the deadline, not during the sprint. Generic automated testing produces pass/fail outputs. Domain-specific test tooling that identifies failures and enables in-context fixes is what makes AI-generated code shippable.
Deploy AI tools across the whole organization, not just engineering. Shaw said Intuit trained and monitored usage across all functions. AI fluency was distributed across the organization rather than concentrated in early adopters.
"We continue to lean into the AI and human intelligence opportunity here, so that our customers get what they need out of the experiences that we build," Aerni said.
