Are you ready to bring more awareness to your brand? Consider becoming a sponsor for The AI Impact Tour. Learn more about the opportunities here.

The use of AI language models to generate text for business applications is gaining steam. Large companies are deploying their own systems, while others are leveraging models like OpenAI’s GPT-3 via APIs. According to OpenAI, GPT-3 is now being used in over 300 apps by thousands of developers, producing an average of more than 4.5 billion novel words per day.

But while recent language models are impressively fluent, they have a tendency to write falsehoods ranging from factual inaccuracies to potentially harmful disinformation. To quantify the risks associated with “deceptive” models, researchers at the University of Oxford and OpenAI created a dataset called TruthfulQA that contains questions some humans might answer incorrectly due to false beliefs or misconceptions. The researchers found that while the best-performing model was truthful on 58% of questions, it fell short of human performance at 94%.


In the subfield of AI known as natural language processing (NLP), robustness testing can be the exception rather than the norm.  One report found that 60% to 70% of answers given by NLP models were embedded somewhere in the benchmark training sets, indicating that the models were usually simply memorizing answers. Another study found that metrics used to benchmark AI and machine learning models tended to be inconsistent, irregularly tracked, and not particularly informative.

TruthfulQA aims to avoid these benchmarking pitfalls with a bank of questions about health, law, finance, and politics that requires models to avoid generating false answers learned from text. The dataset spans 817 questions in 38 different categories. Researchers worded the questions such that some humans and models might answer falsely.

VB Event

The AI Impact Tour

Connect with the enterprise AI community at VentureBeat’s AI Impact Tour coming to a city near you!


Learn More

The team tested several different models on TruthfulQA, including GPT-3; GPT-3 predecessor GPT-2; open source versions of GPT-3 called GPT-Neo and GPT-J; and UnifiedQA, a model fine-tuned on question-answer tasks. To classify answers from the models as either true or false, the team developed “GPT-judge,” an algorithm trained on answers to TruthfulQA questions from all of the evaluated models.

Language model research

Above: Examples of falsehoods generated by models tested on the dataset.

Interestingly, the results show that larger models generally perform worse than smaller models in the same family. The size of a model is measured by the number of parameters it contains — variables internal to the model that the model learns from historical training data. For example, the largest GPT-Neo and GPT-J models were 17% less truthful (as measured by TruthfulQA) than a model 60 times as small. Meanwhile, UnifiedQA did better on truthfulness than the three GPT families, with the largest model performing only slightly worse than the smallest.

When forced to choose from multiple responses rather than generate answers themselves, larger models also performed worse on TruthfulQA than smaller ones. No models significantly outperformed random guessing. And even the “best” model gave false answers 42% of the time, versus 6% for human participants. (Eighty-seven percent of the humans’ answers were true on TruthfulQA.)

The researchers speculate that the models haven’t learned the training distribution well enough or that the models’ training objectives actually incentivize false answers. “We suggest that scaling up models alone is less promising for improving truthfulness than fine-tuning using training objectives other than imitation of text from the web,” the researchers wrote in a preprint paper, “TruthfulQA: Measuring How Models Mimic Human Falsehood.” They added: “[Our preliminary work finds] that today’s large models are much less truthful than humans.”

Large language models

The work adds to growing skepticism that the size of language models — and their training datasets — correspond to performance.  Earlier this month, a team of Google researchers published a study claiming that a model much smaller than GPT-3, fine-tuned language net (FLAN), bests GPT-3 by a large margin on a number of challenging benchmarks. And scientists at the Institute for Artificial Intelligence at the Medical University of Vienna, Austria found that GPT-3 underperforms in domains like biomedicine compared with smaller, less architecturally complex but carefully fine-tuned models.

Maria Antoniak, a natural language processing researcher and data scientist at Cornell University, says that when it comes to natural language, the question of whether larger models are the right approach is still open. While some of the best benchmark performance scores today come from large datasets and models, the payoff from dumping enormous amounts of data into models is uncertain.

“The current structure of the field is task-focused, where the community gathers together to try to solve specific problems on specific datasets,” Antoniak told VentureBeat in a previous interview. “These tasks are usually very structured and can have their own weaknesses, so while they help our field move forward in some ways, they can also constrain us. Large models perform well on these tasks, but whether these tasks can ultimately lead us to any true language understanding is up for debate.”

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.