Stanford study challenges assumptions about language models: Larger context doesn’t mean better understanding

A study released this month by researchers from Stanford University, UC Berkeley and Samaya AI has found that large language models (LLMs) often fail to access and use relevant information given to them in longer context windows.

In language models, a context window refers to the length of text a model can process and respond to in a given instance. It can be thought of as a working memory for a particular text analysis or chatbot conversation.

The study caught widespread attention last week after its release because many developers and other users experimenting with LLMs had assumed that the trend toward larger context windows would continue to improve LLM performance and their usefulness across various applications.

If an LLM could take an entire document or article as input for its context window, the conventional thinking went, the LLM could provide perfect comprehension of the full scope of that document when asked questions about it.

Assumptions around context window flawed

LLM companies like Anthropic have fueled excitement around the idea of longer content windows, where users can provide ever more input to be analyzed or summarized. Anthropic just released a new model called Claude 2, which provides a huge 100k token context window, and said it can enable new use cases such as summarizing long conversations or drafting memos and op-eds.

But the study shows that some assumptions around the context window are flawed when it comes to the LLM’s ability to search and analyze it accurately.

The study found that LLMs performed best “when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts. Furthermore, performance substantially decreases as the input context grows longer, even for explicitly long-context models.”

Last week, industry insiders like Bob Wiederhold, COO of vector database company Pinecone, cited the study as evidence that stuffing entire documents into a document window for doing things like search and analysis won’t be the panacea many had hoped for.

Semantic search preferable to document stuffing

Vector databases like Pinecone help developers increase LLM memory by searching for relevant information to pull into the context window. Wiederhold pointed to the study as evidence that vector databases will remain viable for the foreseeable future, since the study suggests semantic search provided by vector databases is better than document stuffing.

Stanford University’s Nelson Liu, study lead author, agreed that if you try to inject an entire PDF into a language model context window and then ask questions about the document, a vector database search will generally be more efficient to use.

"If you’re searching over large amounts of documents, you want to be using something that’s built for search, at least for now,” said Liu.

Liu cautioned, however, that the study isn’t necessarily claiming that sticking entire documents into a context window won’t work. Results will depend specifically on the sort of content contained in the documents the LLMs are analyzing. Language models are bad at differentiating between many things that are closely related or which seem relevant, Liu explained. But they are good at finding the one thing that is clearly relevant when most other things are not relevant.

“So I think it’s a bit more nuanced than ‘You should always use a vector database, or you should never use a vector database’,” he said.

Language models' best use case: Generating content

Liu said his study assumed that most commercial applications are operating in a setting where they use some sort of vector database to help return multiple possible results into a context window. The study found that having more results in the context window didn’t always improve performance.

As a specialist in language processing, Liu said he was surprised that people were thinking of using a context window to search for content, or to aggregate or synthesize it, although he said he could understand why people would want to. He said people should continue to think of language models as best used to generate content, and search engines as best to search content.

“The hope that you can just throw everything into a language model and just sort of pray it works, I don’t think we’re there yet," he said. "But maybe we’ll be there in a few years or even a few months. It’s not super clear to me how fast this space will move, but I think right now, language models aren’t going to replace vector databases and search engines.”

Assumptions around context window flawed

Semantic search preferable to document stuffing

Language models' best use case: Generating content

More