Arize AI wants to improve enterprise LLMs with 'Prompt Playground,' new data analysis tools

We all know enterprises are racing at varying speeds to analyze and reap the benefits of generative AI — ideally in a smart, secure and cost-effective way. Survey after survey over the last year has shown this to be true.

But once an organization identifies a large language model (LLM) or several that it wishes to use, the hard work is far from over. In fact, deploying the LLM in a way that benefits an organization requires understanding the best prompts employees or customers can use to generate helpful results — otherwise it's pretty much worthless — as well as what data to include in those prompts from the organization or user.

"You can't just take a Twitter demo [of an LLM] and put it into the real world," Aparna Dhinakaran, cofounder and chief product officer of Arize AI, said in an exclusive video interview with VentureBeat. "It's actually going to fail. And so how do you know where it fails? And how do you know what to improve? That's what we focus on."

Introducing 'Prompt Playground'

Three-year-old business-to-business (B2) machine learning (ML) software provider Arize AI would know, as it has since day one been focused on making AI more observable (less technical and more understandable) to organizations.

Today, the VB Transform award-winning company announced at Google's Cloud Next 23 conference industry-first capabilities for optimizing the performance of LLMs deployed by enterprises, including a new "Prompt Playground" for selecting between and iterating on stored prompts designed for enterprises, and a new retrieval augmented generation (RAG) workflow to help organizations understand what data of theirs would be helpful to include in an LLMs responses.

Almost a year ago, Arize debuted its initial platform in the Google Cloud Marketplace. Now it is augmenting its presence there with these powerful new features for its enterprise customers.

Prompt Playground and new workflows

Arize’s new prompt engineering workflows, including Prompt Playground, enable teams to uncover poorly performing prompt templates, iterate on them in real time and verify improved LLM outputs before deployment.

_{Screenshot of Arize AI's Prompt Playground tool. Credit: Arize AI}

Prompt analysis is an important but often overlooked part of troubleshooting an LLM's performance, which can simply be boosted by testing different prompt templates or iterating on one for better responses.

With these new workflows, teams can easily:

Uncover responses with poor user feedback or evaluation scores
Identify the underlying prompt template associated with poor responses
Iterate on the existing prompt template to improve coverage of edge cases
Compare responses across prompt templates in the Prompt Playground prior to implementation

As Dhinakaran explained, prompt engineering is absolutely key to staying competitive with LLMs in the market today. The company's new prompt analysis and iteration workflows help teams ensure their prompts cover necessary use cases and potential edge scenarios that may come up with real users.

"You've got to make sure that the prompt you're putting into your model is pretty damn good to stay competitive," said Dhinakaran. "What we launched helps teams engineer better prompts for better performance. That's as simple as it is: We help you focus on making sure that that prompt is performant and covers all of these cases that you need it to handle."

Understanding private data

For example, prompts for an education LLM chatbot need to ensure no inappropriate responses, while customer service prompts should cover potential edge cases and nuances around services offered or not offered.

Arize is also providing the industry's first insights into the private or contextual data that influences LLM outputs — what Dhinakaran called the "secret sauce" companies provide. The company uniquely analyzes embeddings to evaluate the relevance of private data fused into prompts.

"What we rolled out is a way for AI teams to now monitor, look at their prompts, make it better and then specifically understand the private data that's now being put into those those prompts, because the private data part makes sense," Dhinakaran said.

Dhinakaran told VentureBeat that enterprises can deploy its solutions on premises for security reasons, and that they are SOC-2 compliant.

The importance of private organizational data

These new capabilities enable examination of whether the right context is present in prompts to handle real user queries. Teams can identify areas where they may need to add more content around common questions lacking coverage in the current knowledge base.

"No one else out there is really focusing on troubleshooting this private data, which is really like the secret sauce that companies have to influence the prompt," Dhinakaran noted.

Arize also launched complementary workflows using search and retrieval to help teams troubleshoot issues stemming from the retrieval component of RAG models.

These workflows will empower teams to pinpoint where they may need to add additional context into their knowledge base, identify cases where retrieval failed to surface the most relevant information, and ultimately understand why their LLM may have hallucinated or generated suboptimal responses.

Understanding context and relevance — and where they are lacking

Dhinakaran gave an example of how Arize looks at query and knowledge base embeddings to uncover irrelevant retrieved documents that may have led to a faulty response.

_{Screenshot of Arize AI's embeddings analysis tool. Credit: Arize AI}

"You can click on, let's say, a user question in our product, and it'll show you all of the relevant documents that it could have pulled, and which one it did finally pull to actually use in the response," Dhinakaran explained. Then "you can see where the model may have hallucinated or provided suboptimal responses based on deficiencies in the knowledge base."

This end-to-end observability and troubleshooting of prompts, private data and retrieval is designed to help teams optimize LLMs responsibly after initial deployment, when models invariably struggle to handle real-world variability.

Dhinakaran summarized Arize's focus: “We’re not just a day one solution; we help you actually ongoing get it to work.”

The company aims to provide the monitoring and debugging capabilities organizations are missing, so they can continuously improve their LLMs post-deployment. This allows them to move past theoretical value to real-world impact across industries.