Arize launches Phoenix, an open-source library to monitor LLM hallucinations

Arize AI, a California-headquartered company providing machine learning (ML) observability capabilities, today announced Phoenix, an open-source library to monitor large language models (LLMs) for hallucinations.

The software solution comes as the industry re-tools around LLMs and data scientists apply large foundational models to new use cases, including those involving medical and legal data — where even the slightest level of hallucination or bias can create a major problem in the real world.

It is designed to be a standalone offering delivering ML observability in a data science notebook environment where data scientists build models, the company said.

How exactly does Phoenix help with LLMs?

Large language models like OpenAI’s GPT-4 and Google’s Bard are all the rage today, with data scientists and ML engineers racing to build applications on top of them. These could be anything from virtual lawyer products providing legal advice to healthcare chatbots designed to summarize doctor-patient meetings or provide information about existing insurance coverage.

Now, while these applications can be very effective, the models running them remain susceptible to hallucination — in other words, producing false or misleading results. Phoenix, announced today at Arize AI’s Observe 2023 summit, targets this exact problem by visualizing complex LLM decision-making and flagging when and where models fail, go wrong, give poor responses or incorrectly generalize.

“Phoenix runs locally, in an environment that interfaces with Notebook cells on the Notebook server. Its library uses embeddings (vectors representing meaning and context of data points that the model processes and generates) and clustering of those embeddings as a method for data visualization and debugging,” Jason Lopatecki, CEO and cofounder of Arize, tells VentureBeat.

In the real world, this means a user just has to upload the chatbot conversation — complete with prompts and responses — and start the software. The library will automatically use the foundational embeddings (mapping out how they connect, how they are related and how they progress as sentences are generated) and LLM-assisted evaluation to generate scores for responses and visualize them to show where the bot gave a good response and where it failed.

_{Phoenix visualizing LLM responses}

As the visualization is produced, the user can investigate, grab groups of responses representing a problem (like questions from Spanish-speaking end users where the LLM responded incorrectly) and troubleshoot for fine-tuning the model and improving its outcomes.

“Once in a notebook environment, the downloaded data can power observability workflows that are highly interactive. Phoenix can be used to find clusters of data problems and export those clusters back to the observability platform for use in monitoring and active learning workflows,” Lopatecki added.

It can also help surface issues like data drift for generative AI, LLMs, computer vision and tabular models, the company noted.

Rapidly evolving space

While Arize AI claims that Phoenix, which is available starting today, is the first software library designed to help with LLM evaluation and risk management, enterprises should keep in mind that this is a rapidly evolving space with new players cropping up almost every day.

“The current generation of AI models is a black box to almost everyone. Almost no one understands how they do what they do. Phoenix is the first step to building software that helps map out the internals of how these models think and what decisions they are making, designed for the users of LLMs,” the CEO said.

He added that over 100 users and researchers at different companies and organizations advised on the development of Phoenix, with initial feedback being quite positive.

Christopher Brown, CEO and co-founder of Decision Patterns and a former UC Berkeley lecturer, called the solution a “much-appreciated” advancement in model observability and production. He said the integration of observability utilities directly into the development process not only saves time but encourages model development — and production teams to actively think about model use and ongoing improvements before releasing to production.

How exactly does Phoenix help with LLMs?

Rapidly evolving space

More