Language models that can search the web hold promise -- but also raise concerns

Language models -- AI systems that can be prompted to write essays and emails, answer questions, and more -- remain flawed in many ways. Because they "learn" to write from examples on the web, including problematic social media posts, they're prone to generating misinformation, conspiracy theories, and racist, sexist, or otherwise toxic language.

Another major limitation of many of today's language models is that they're "stuck in time," in a sense. Because they're trained once on a large collection of text from the web, their knowledge of the world -- which they gain from that collection -- can quickly become outdated depending on when they were deployed. (In AI, "training" refers to teaching a model to properly interpret data and learn from it to perform a task, in this case generating text.) For example, You.com's writing assistance tool -- powered by OpenAI's GPT-3 language model, which was trained in summer 2020 -- responds to the question "Who's the president of the U.S.?" with "The current President of the United States is Donald Trump."

The solution, some researchers propose, is giving language models access to web search engines like Google, Bing, and DuckDuckGo. The idea is that these models could simply search for the latest information about a given topic (e.g., the war in Ukraine) instead of relying on old, factually wrong data to come up with their text.

In a paper published early this month, researchers at DeepMind, the AI lab backed by Google parent company Alphabet, describe a language model that answers questions by using Google Search to find a top list of relevant, recent webpages. After condensing down the first 20 webpages into six-sentence paragraphs, the model selects the 50 paragraphs most likely to contain high-quality information; generates four "candidate" answers for each of the 50 paragraphs (for a total of 200 answers); and determines the "best" answer using an algorithm.

While the process might sound convoluted, the researchers claim that it vastly improves the factual accuracy of the model's answers -- by as much as 30% -- for questions and can be answered using information found in a single paragraph. The accuracy improvements were lower for multi-hop questions, which require models to gather information from different parts of a webpage. But the coauthors note that their method can be applied to virtually any AI language model without much modification.

OpenAI's WebGPT performs a web search for answers to questions and cites its sources.

"Using a commercial engine as our retrieval system allows us to have access to up-to-date information about the world. This is particularly beneficial when the world has evolved and our stale language models have now outdated knowledge ... Improvements were not just confined to the largest models; we saw increases in performance across the board of model sizes," the researchers wrote, referring to the parameters in the models that they tested. In the AI field, models with a high number of parameters -- the parts of the model learned from historical training data -- are considered "large," while "small" models have fewer parameters.

The mainstream view is that larger models perform better than smaller models -- a view that's been challenged by recent work from labs including DeepMind. Could it be that, instead, all language models need is access to a wider range of information?

There's some outside evidence to support this. For example, researchers at Meta (formerly Facebook) developed a chatbot, BlenderBot 2.0, that improved on its predecessor by querying the internet for up-to-date information about things like movies and TV shows. Meanwhile, Google’s LaMDA, which was designed to hold conversations with people, "fact-checks" itself by querying the web for sources. Even OpenAI has explored the idea of models that can search and navigate the web -- the lab's "WebGPT" system used Bing to find answers to questions.

New risks

But while web searching opens up a host of possibilities for AI language systems, it also poses new risks.

The "live" web is less curated than the static datasets historically used to train language models and, by implication, less filtered. Most labs developing language models take pains to identify potentially problematic content in the training data to minimize potential future issues. For example, in creating an open source text dataset containing hundreds of gigabytes of webpages, research group EleutherAI claims to have performed "extensive bias analysis" and made “tough editorial decisions” to exclude data they felt were “unacceptably negatively biased" toward certain groups or views.

The live web can be filtered to a degree, of course. And as the DeepMind researchers note, search engines like Google and Bing use their own "safety" mechanisms to reduce the chances unreliable content rises to the top of results. But these results can be gamed -- and aren't necessarily representative of the totality of the web. As a recent piece in The New Yorker notes, Google's algorithm prioritizes websites that use modern web technologies like encryption, mobile support, and schema markup. Many websites with otherwise quality content get lost in the shuffle as a result.

This gives search engines a lot of power over the data that might inform web-connected language models' answers. Google has been found to prioritize its own services in Search by, for example, answering a travel query with data from Google Places instead of a richer, more social source like TripAdvisor. At the same time, the algorithmic approach to search opens the door to bad actors. In 2020, Pinterest leveraged a quirk of Google's image search algorithm to surface more of its content in Google Image searches, according to The New Yorker.

Labs could instead have their language models use off-the-beaten path search engines like Marginalia, which crawls the internet for less-frequented, usually text-based websites. But that wouldn't solve another big problem with web-connected language models: Depending on how the model's trained, it might be incentivized to cherry-pick data from sources that it expects users will find convincing -- even if those sources aren't objectively the strongest.

The OpenAI researchers ran into this while evaluating WebGPT, which they said led the model to sometimes quote from "highly unreliable" sources. WebGPT, they found, incorporated biases from the model on which its architecture was based (GPT-3), and this influenced the way in which it chose to search for -- and synthesize -- information on the web.

"Search and synthesis both depend on the ability to include and exclude material depending on some measure of its value, and by incorporating GPT-3’s biases when making these decisions, WebGPT can be expected to perpetuate them further," the OpenAI researchers wrote in a study. "[WebGPT's] answers also appear more authoritative, partly because of the use of citations. In combination with the well-documented problem of 'automation bias,' this could lead to overreliance on WebGPT’s answers."

Facebook's BlenderBot 2.0 searching the web for answers.

The automation bias, for context, is the propensity for people to trust data from automated decision-making systems. Too much transparency about a machine learning model and people become overwhelmed. Too little, and people make incorrect assumptions about the model -- instilling them with a false sense of confidence.

Solutions to the limitations of language models that search the web remain largely unexplored. But as the desire for more capable, more knowledgeable AI systems grows, the problems will become more urgent.

New risks

More