ChatGPT can turn toxic just by changing its assigned persona, researchers say

ChatGPT can be inadvertently or maliciously set to turn toxic just by changing its assigned persona in the model's system settings, according to new research from the Allen Institute for AI.

The study — which the researchers say is the first large-scale toxicity analysis of ChatGPT — found that the large language model (LLM) carries inherent toxicity that is heightened up to six times when assigned a diverse range of personas (such as historical figures, profession, etc). Nearly 100 personas from diverse backgrounds were examined across over half a million ChatGPT output generations — including journalists, politicians, sportspersons and businesspersons, as well as different races, genders and sexual orientations.

Assigning personas can change ChatGPT output

These system settings to assign personas can significantly change ChatGPT output. "The responses can in fact be wildly different, all the way from the writing style to the content itself," Tanmay Rajpurohit, one of the study authors, told VentureBeat in an interview. And the settings can be accessed by anyone building on ChatGPT using OpenAI's API, so the impact of this toxicity could be widespread. For example, chatbots and plugins built on ChatGPT from companies such as Snap, Instacart and Shopify could exhibit toxicity.

The research is also significant because while many have assumed ChatGPT's bias is in the training data, the researchers show that the model can develop an "opinion" about the personas themselves, while different topics also elicit different levels of toxicity.

And they emphasized that assigning personas in the system settings is often a key part of building a chatbot. "The ability to assign [a] persona is very, very essential," said Rajpurohit, because the chatbot creator is often trying to appeal to a target audience of users who will be using it and expecting useful behavior and capabilities from the model.

There are other benign or positive reasons to use the system settings parameters, such as to constrain the behavior of a model — to tell the model not to use explicit content, for example, or to ensure it doesn't say anything politically opinionated.

System settings also makes LLM models vulnerable

But that same property that makes the generative AI work well as a dialogue agent also makes the models vulnerable. If it is used by a malicious actor, the study shows that "things can get really bad, really fast" in terms of toxic output, said Ameet Deshpande, one of the other study authors. "A malicious user can modify the system parameter to completely change ChatGPT to a system which can produce harmful outputs consistently." In addition, he said, even an unsuspecting person modifying a system parameter might modify it to something that changes ChatGPT's behavior and make it biased and potentially harmful.

The study found that toxicity in ChatGPT output varies considerably depending on the assigned persona. It seems that ChatGPT's own understanding about individual personas from its training data strongly influences how toxic the persona-assigned behavior is — which the researchers say could be an artifact of the underlying data and training procedure. For example, the study found that journalists are twice as toxic as businesspersons.

"One of the points we're trying to drive home is that because ChatGPT is is a very powerful language model, it can actually simulate behaviors of different personas," said Ashwin Kalyan, one of the other study authors. "So it's not just a bias of the whole model, it's way deeper than that, it's a bias of how the model interprets different personas and different entities as well. So it's a deeper issue than we've seen before."

And while the research only studied ChatGPT (not GPT-4), the analysis methodology can be applied to any large language model. "It wouldn't be really surprising if other models have similar biases," said Kalyan.

Assigning personas can change ChatGPT output

System settings also makes LLM models vulnerable

More