TruEra launches free tool for testing LLM apps for hallucinations

TruEra, a vendor providing tools to test, debug and monitor machine language (ML) models, today expanded its product portfolio with the launch of TruLens, open-source software dedicated to testing applications built on large language models (LLMs) like the GPT series.

Available starting today for free, TruLens provides enterprises with a quick and easy way to evaluate and iterate on their LLM applications and eliminate the chances of hallucination and bias in the production stage.

Currently, only a limited number of vendors offer tools to tackle this aspect of LLM app development, even as enterprises across sectors continue to explore the potential of generative AI for different use cases.

Why TruLens for LLM applications?

LLMs are all the rage, but when it comes to building applications based on these models, companies have to go through a tiring experimentation process that involves human-driven response scoring. Essentially, once the first version of an app is developed, teams have to manually test and review its answers, adjust prompts, hyperparameters and models, and then re-test over and over until a satisfactory result is achieved.

This not only takes a lot of time but is difficult to scale up.

With TruLens, TruEra is addressing this gap by introducing a programmatic method of evaluation called “feedback functions.” As the company explains, a feedback function scores the output of an LLM application for quality and efficacy by analyzing both the text generated from the LLM and the response’s metadata.

“Think of it as a way to log and assess direct and indirect feedback about the performance and quality of your LLM app. This helps developers to create credible and powerful LLM apps faster. You can use it for a wide variety of LLM use cases, like chatbot question answering, information retrieval and so on,” Anupam Datta, cofounder, president and chief scientist at TruEra, told VentureBeat.

_{TruLens for LLMs: How it works}

TruLens can be added to the development process with a few lines of code. Once it's up and running, users can create their own feedback functions — customized to specific use cases — or use the out-of-the-box options.

Currently, the software provides feedback functions that test for truthfulness, question-answering relevance, harmful or toxic language, user sentiment, language mismatch, response verbosity, and fairness and bias. Moreover, it also logs how much an LLM is being pinged within the app, giving an easy way to track usage costs.

“This helps you to also determine how to build the best version of the app at the lowest ongoing cost. All of those pings add up,” Datta noted.

Other offerings for LLM applications

While testing LLM-driven applications for performance and response accuracy is the need of the hour, only a handful of players have launched solutions to deal with it. These include Datadog’s OpenAI model monitoring integration, Arize’s Pheonix solution, and Israel-based Mona Labs' just-launched generative AI monitoring solution.

TruEra, for its part, claims that TruLens is best used in the development phase of LLM app development.

“This is actually the phase that most companies are in today — they are experimenting with development and really have an acute need for tools to help them iterate faster and home in on application versions that are both effective at their tasks and risk-minimizing. You can, of course, use it on both development and production models,” Datta said.

According to an Accenture survey, 98% of global executives agree that AI foundation models will play an important role in their organizations' strategies in the next three to five years. This signals that tools like TruLens will soon see increased demand from enterprises.