Arize launches prompt variable monitoring to detect when models go awry

Artificial intelligence observability service Arize has a new product that helps companies identify when prompt data causes large language models (LLMs) to go wrong or hallucinate. Designed for AI engineers, the tool provides information necessary to debug complex systems, isolating the problem areas sometimes created by a few lines of code.

"We are all prompt engineers — we've done prompts ourselves — but a lot of the applications in these systems are templatizing prompts to make it this thing that you can then apply to your data. So there's this template that you apply over and over again to your data. And it helps answer your questions," Arize co-founder and CEO Jason Lopatecki said in an interview with VentureBeat. "And those templates take... prompt variables, the data you extract from your system. And there are so many ways that the data causes the LLM to go wrong, hallucinate, [or break]."

_{Arize dashboard showing prompt variable trace monitoring. Image credit: Arize}

A service like this makes sense because you don't want your chatbot or app to spread misinformation or shame your brand. It's probably easier to monitor if you're using one LLM, but it's not uncommon for companies to utilize multiple models from OpenAI, Google, Meta, Anthropic, Mistral and others.

With prompt variable monitoring, engineers can observe the insertion of variables directing the applied LLMs to do certain things. These models help power customer service or support chatbots, ensuring inappropriate information doesn't occur with outputted responses is critical. Lopatecki calls misinformation the number one reason there are hallucinations. Pinpointing where the system needs fixing is important, either with the data inputted, the prompt template chosen, or elsewhere.

_{Dashboard of Arize's AI prompt variable analysis. Credit: Arize}

What is variability?

What is variability? It's the possible outputs generated from AI models based on small or big changes or even mistaken data being inputted. And it's not just one input-output decision that must be observed. Lopatecki says there are more complex decisions in which AI outputs feed other AI decisions. "It's not one response that you're doing inside the chat that we're experiencing on a day-to-day basis. It's AI summarizing things, AI is making a decision on the summary. There are all these places where this very variation can snowball into more problems."

_{Arize's context window analysis dashboard. Credit: Arize}

For this reason, Arize is building tools for the AI engineer. These are the ones who know how to use the latest LLMs and are building these AI systems. "They need these tools to help them build the new level of intelligence that we want to put into these apps. AI engineer will be a term you see everywhere, and you already see it, but it will be everywhere in a year here," Lopatecki claims.

He boasts that Arize wants to be the "Datadog for AI" and even considers the cloud monitoring platform a competitor. In fact, Datadog has made its own AI moves in the past, including launching monitoring support for OpenAI models like GPT-4. However, Lopatecki asserts his company has the advantage: "They weren't born in the AI space. [It's] moving so fast. They don't have much of a product that's there yet." He also highlighted open-source startups as potential rivals.

"What we see happening right now is this imperative to deliver...You deliver it, and you test it in some small set of test cases. There's just so much variability in what these systems can do, the outputs that can happen, and how they can respond to different data that when you put it into the real world, you get an immense number of problems," Lopatecki explained. "I think people are realizing the breadth of things that can go wrong and off the rails. How to debug is hard. But they'll push to get stuff out. It's been so big that the pain point is coming to a culmination right now."

What is variability?

More