Zendesk reports 30% faster response, 95% reliability after GPT-5 integration

AI agents aren’t anything new in customer service — we’re all familiar with that automated voice that first greets us when we call an 800 number. Typically, they've handled so-called Tier 1 (general) and Tier 2 (low-level technical) inquiries.

Now, though, at least for AI-driven customer service company Zendesk, agentic AI can handle more complex multi-turn inquiries and even execute multi-step processes like returns.

The provider of support center interaction and workflow tracking software has seen increased gains in these capabilities since deploying GPT-5 in its Zendesk Resolution Platform.

The company has found that, for most customers, agents powered by GPT-5 can solve more than 50% of tickets — and in some cases, even 80 to 90% of them. Further, OpenAI’s latest iteration is faster, fails less often and can comprehend ambiguity better.

“We talk about two personas of AI: There's the autonomous agent, and then the copilot agent,” Shashi Upadhyay, Zendesk’s president of product, engineering and AI, told VentureBeat. “The autonomous agent takes the first shot at it, and if it has to hand it off, the copilot agent helps the human agent solve the problem.”

Solving more problems, more quickly, and understanding complexity

With Zendesk, a customer’s first interaction is always with an autonomous AI agent; if it can’t solve the problem, it hands things over to a human agent. Even just a year ago, the simple tasks it could handle would be information retrieval from a database, serving up a link to help a customer reinstall software on their iPhone, for example.

But today’s agents don’t just provide those links; they summarize them and provide step-by-step instructions.

“What we have found is that there are a lot of tasks, a lot of tickets, a lot of problems that the current generation of AI is able to solve pretty well, and it keeps getting better,” said Upadhyay.

His team has been working with GPT-5 for a few months — previously the company used GPT-4o — testing out various service-oriented scenarios and integrations and providing feedback to OpenAI prior to the model’s launch in early August.

A key finding: GPT-5 allows for medium reasoning with “significantly longer” context windows, which can be useful in multi-turn conversations (dialogues extending beyond simple question-answer), step-by-step procedure execution and generation of structured outputs from loosely worded inputs.

The team’s main goal was to maintain conversational structure, accuracy and context window efficiency, and Upadhyay notes that GPT-5 performs reliably even with higher token loads, allowing for smoother automated service interactions with multiple turns and inputs.

He identified top use cases for GPT-5:

Long-context answer generation;
Intent clarification and disambiguation (identifying what the user wants even if they’re being vague);
Agent reply generation in auto-assist scenarios (generating draft responses for human agents);
Procedure compilation and execution (translating high-level code into low-level instructions, then acting on them).

Early results have been impressive. Notably, GPT-5 showed high execution reliability: 95%-plus on standard procedures, with a 30% reduction in failure on large flows. “That improvement is a very big deal in an enterprise setting,” Upadhyay explained.

Essentially, execution measures how well a model handles instructions, he explained: When you ask it to do something, does it do that directly? Or does it do something else? Does it follow up? Does it just hang up?

Upadhyay noted that gen AI agents have been “notoriously bad” at carrying out orders. “You can tell them, ‘Follow these five steps,’ but because they hallucinate and they try to be creative, they don't always follow all the five steps,” he said. In fact, the more steps a model is given, the more likely it is to hallucinate.

Other notable improvements with GPT-5 include:

Fewer fallback escalations: Reduced by more than 20%. “Meaning it was able to solve 20% more problems than the previous model,” said Upadhyay. “That's a huge lift in our world.”
Increased speed: 25 to 30% faster overall and supporting 3 to 4 more prompt iterations per minute.
Better ability to handle ambiguity and clarify customer input, allowing for increased coverage of automated flows in more than 65% of conversations.
More complete responses with fewer missed details, reducing agent handoffs.
Maintaining structure across long workflows and adapting to “real-world service complexity” without losing context.
Higher quality assist: A 5-point lift in agent suggestion accuracy across four languages, providing more concise and contextually relevant responses aligned with tone guidelines.

These improvements are critical for Zendesk, Upadhyay notes, as the company has introduced outcome-based pricing, meaning it only gets paid when actually solving a problem using AI.

“The more of these workflows an AI agent is able to handle by itself, the more valuable that provides for our customers,” he said.

A rigorous evaluation process

Zendesk takes a modular approach to AI: GPT-5 handles the conversation between the autonomous agent and the human agent, operating in conjunction with an intent classification and reasoning pipeline. Other models in the mix include Anthropic’s Claude, Google’s Gemini and Meta’s Llama.

“We are always working with a collection of models,” said Upadhyay. “We test them, we pick the one that’s best suited for a particular task, considering performance versus cost trade-offs.”

When evaluating new models, his team is not looking for “benchmark wins”; rather, they want to see whether the model provides tangible, accurate outcomes. Their finely-honed process allows them to roll out new models in less than 24 hours, and is based on a five-factor evaluation framework:

Precision: Can the model return accurate, complete answers grounded in trusted sources like help center articles?
Automated resolution: Does it increase the percentage of issues auto-resolved without human intervention?
Execution: Can it follow structured workflows with high fidelity?
Latency: Does it respond quickly enough in live support environments?
Safety: Does it avoid hallucination and only act when confident?

As Upadhyay noted: “They need guardrails so they don't do stupid stuff.”

Strong operational guardrails include real-time observability with structured logging of agent behavior; intent-layer pre-routing (routing based on intent rather than simply forwarding information) to reduce risk and improve clarity; triggered governance to prevent out-of-policy responses; and protocols that default to safe escalation or agent involvement.

“We treat the model as a nondeterministic tool within a controlled system, not a standalone decision-maker,” said Upadhyay. “That is what enables us to deploy it in enterprise-grade environments.”

AI agents and human agents should be trained the same

Ultimately, AI agents should be treated just like human agents, Upadhyay emphasizes: Regularly trained and managed, and taught how to take action in a way that aligns with the enterprise’s mission.

“They're raw, they're smart, but you have to teach them how to operate in a completely new environment, like an intern or a human getting a new job,” said Upadhyay.

This is because out-of-the-box models are general-purpose and trained on a large corpus of internet data. More often than not, they’ve never seen data inside a particular enterprise; they haven’t seen what a support ticket looks like, what a sales call looks like.

Zendesk’s approach is to map vague input to clear actions, then synthesize responses and execute multi-step workflows. Upadhyay’s team uses an internal test bench, iterating through examples and using knowledge graphs and action builders so that models can act.

“We reset any data that's already available, then run the models on top of that and continue to change the process until we can get it right,” he explained.

In production, layers include a quality assurance (QA) agent that monitors every conversation and an analytics agent. “Like a coach, it makes an assessment: ‘Was that a good interaction or not?’” Upadhyay explained. “That determination is then used to improve the performance of both human and AI agents.”

As an 18-year-old company with 100,000 customers and operations in nearly 150 countries, Zendesk has an incredible amount of data at its disposal.

“Over time, we have seen every possible support ticket, over every possible industry,” said Upadhyay. “We can fine-tune out-of-the-box models to an extremely high degree depending upon what industry we're talking about, or what language we're talking about.”

This data can help models understand what a good resolution looks like, or what a human agent could have done better in a specific situation. AI is tested and compared to identical human-led circumstances; it’s a continuous process of training, fine-tuning and narrowing responses to reduce hallucination rates and improve instruction-following.

Accuracy is critical in enterprise settings, Upadhyay emphasized. “If you're right 90% of the time in a consumer setting, people are very impressed,” he said. “In an enterprise setting, you have to push that to 99% correctness, or better, over time.”

From knowledge retrieval, to reasoning, to humans with superpowers

What differentiates GPT-5 and other newer models is their ability to reason and answer questions, not just retrieve data and generate content, Upadhyay noted.

“AI agents have crossed a barrier where they handle more complex problems really easily now because they can reason,” he said. “They can use information, often coming from multiple sources, and give you a coherent answer.”

So, for example, let's say a customer bought a piece of furniture online, and you want to return it. The process can require a complicated set of movements — the agent has to first determine that they are, in fact, the original purchaser by pulling data from the customer relationship management (CRM) system, then cross-validate with the order management platform, pull up policy documents, reason whether the return is valid, then initiate a credit or refund and arrange for item return.

Reasoning models can tackle that multi-step process and have shown “real improvement” where they can take action, said Upadhyay. “The agent can decide that, yes, you're eligible for a return, but also take action to enable you to do the return,” he noted. “That is the next tier, and that's where we are today.”

Zendesk is a “big believer” in autonomous AI agents, he said. Still, the future of enterprise will be a combination of agentic AI and “humans with superpowers” assisted by copilot AI agents. And, human roles will evolve not just to solve problems, but to be “very good AI supervisors."

“It’s a very big opportunity, because it is going to create an entirely new category of jobs, high value roles, which are both deep in the product and the problem-solving but also really good at managing,” said Upadhyay. “That's going to be a massive transformation of support.”