New DeepMind study reveals a hidden bottleneck in vector search that breaks advanced RAG systems

Vector embeddings are the backbone of modern enterprise AI, powering everything from retrieval-augmented generation (RAG) to semantic search. But a new study from Google DeepMind reveals a fundamental mathematical limitation to this technology that could cause sophisticated AI systems to fail in unexpected ways.

This isn't a problem that can be solved with bigger models or more training data. The research suggests that as search and retrieval tasks become more complex, the standard single-vector embedding approach will hit a hard ceiling, unable to represent all the possible ways documents can be relevant to a query.

For AI product leaders, the key takeaway is that "embedding-based retrieval is not a panacea," as the authors told VentureBeat. Understanding this limitation is crucial for building the next generation of robust and reliable AI systems.

The growing strain on vector embeddings

At their core, vector embeddings are a way of turning unstructured data like text, images, or audio into numerical representations, or vectors, in a high-dimensional space. This allows AI models to understand the semantic relationships between different pieces of data. In "dense retrieval," used in most RAG systems, a query is also turned into a vector, and the system finds the most relevant documents by identifying the document vectors closest to the query vector in that space.

Recently, the rise of "instruction-following" retrieval has pushed this technology beyond simple keyword matching. The key measure of difficulty, the authors said, is "the extent to which queries may need to retrieve arbitrary subsets of documents." This failure can occur even with incredibly simple documents and queries (e.g., documents such as "Alice likes apples, bananas, ...", "Bob likes soccer, grapes, ...," and queries such as "Who likes apples?"). These tasks force the model to connect documents based on abstract relationships, dramatically increasing the number of possible "relevant" document sets that a model must be able to represent.

Previous studies have pointed out limitations in vector embeddings, but the common assumption was that unrealistic or poorly formed queries caused these issues. This new paper argues that the limitation is intrinsic to the architecture itself. The researchers write, “Rather than proposing empirical benchmarks to gauge what embedding models can achieve, we seek to understand at a more fundamental level what the limitations are.”

Finding the breaking point in a perfect world

To prove this theoretical limit, the researchers designed an ideal experiment that represents the absolute best-case performance any embedding model could achieve. They bypassed language models entirely and directly optimized the numerical vectors to solve a retrieval task. This setup, which they call "free embedding optimization," removes any constraints imposed by natural language and isolates the geometric capacity of the vector space.

Even in this perfect scenario, they found a clear breaking point. For any given embedding dimension (the number of values in each vector), there is a "critical point" where the number of documents in the system becomes too large to represent all possible combinations of relevant results. The dimensionality of the embedding is simply too small to encode the complexity.

The paper extrapolates this finding to real-world scales, concluding that "for web-scale search, even the largest embedding dimensions with ideal test-set optimization are not enough to model all combinations." This means that even under perfect conditions, the underlying math of the single-vector approach has a built-in ceiling.

A simple test breaks today's top models

Structure of the LIMIT dataset (source: arXiv)

Current industry benchmarks often don't expose this weakness. The paper notes that a dataset like QUEST, with 325,000 documents, has more than 7.1e+91 (71 followed by 90 zeroes) possible combinations of 20 relevant documents. Yet, its 3,000 queries test only an infinitesimally small fraction of this space.

To demonstrate the problem in a real-world setting, the researchers created a new dataset called LIMIT. The task is deceptively simple. Queries are straightforward ("Who likes Apples?"), and documents contain clear answers. However, the dataset is specifically constructed to stress-test a model's ability to handle a large number of overlapping relevance combinations, where one document can be relevant to many different queries.

The results were stark. State-of-the-art embedding models from Google, Snowflake, and others severely struggle on the LIMIT dataset, with some achieving less than 20% recall (the proportion of correct documents the model finds) on the full task. In a surprising twist, BM25, a decades-old lexical search algorithm, performs exceptionally well on the same task.

Crucially, when the researchers fine-tuned a model on a training version of LIMIT, its performance barely improved. This indicates the problem is not a "domain shift" (where a model fails because it hasn't seen similar data before) but a core architectural limitation. The models fundamentally lack the capacity to solve the task.

embedding models performance on LIMIT dataset — Embedding models perform poorly on LIMT while the classic BM25 (dotted grey line) shows high performance (source: arXiv)

What this means for enterprise AI

The core takeaway is that as AI applications require more complex reasoning, relying solely on single-vector embeddings may lead to a performance plateau. The models may simply be unable to retrieve the correct set of documents, no matter how well-trained they are. Here’s how leaders and developers can respond.

1. Spot the warning signs in your own applications

According to the researchers, the problem becomes apparent when the definition of “relevance” becomes combinatorial. "An early warning sign in your own applications is when queries that logically require multiple documents to be fully answered (e.g., using terms like 'compare,' 'and,' or 'both') start to fail," they explained. "If your system consistently retrieves only one of the relevant documents instead of the required set, that's a clear indication that you are hitting the geometric limitations we've identified." You can also infer these limits from your data itself. For example, if your "documents" are pull requests and your "queries" are diffs, it's very likely that for any two pull requests, there's a diff for which they are both relevant, a scenario where single-vector embeddings will struggle.

2. Build a more resilient system: The hybrid approach

For developers in the trenches, the most practical guidance is to build hybrid search architectures. The authors recommend combining the strengths of different methods:

Use dense embeddings for their powerful semantic understanding and their ability to find documents that are conceptually related, even without keyword overlap.
Rely on sparse methods like BM25 for their precision and combinatorial robustness, ensuring that all specified constraints in a query are met.

"Combining them provides a much more resilient and reliable system," the authors said in their comments, while noting that for simpler applications where there is typically a single best document, a pure embedding model may be perfectly adequate.

3. Rethink your evaluation strategy

A compelling point from the research is that academic leaderboards can be misleading, as they test only a tiny fraction of possible queries. "Relying solely on leaderboard performance can lead enterprises to adopt tools that look great on benchmarks but are brittle in practice," the authors warned.

Their advice is to move beyond standard benchmarks and create internal evaluations that mirror the combinatorial nature of real-world queries. Instead of just testing for single-document relevance, they suggest teams "proactively design test cases that require retrieving specific sets of documents." For example, a team could synthetically generate queries that require retrieving specific pairs or triplets of documents from their corpus, providing a much more accurate picture of a model's true capabilities.

The Future of Retrieval

The paper suggests enterprises should explore more expressive architectures such as cross-encoders (models that jointly process query and documents), multi-vector models (models that learn multiple different embeddings to capture more nuance), and revisiting sparse models such as BM25.

While these techniques can help solve the challenges presented in the LIMIT dataset, the researchers believe this isn't necessarily the end of the single-vector paradigm. "One of the subtle but important conclusions of our work is that single vectors are, in theory, incredibly powerful – far more than current models demonstrate," the researchers told VentureBeat. "The geometric capacity is there, but empirically, we haven't yet been able to train models that fully exploit it. Ultimately, our work is a call to action for the research community. It highlights that we have been overly biased by popular benchmarks like MTEB and underscores the urgent need for more rigorous evaluation and continued innovation in the core principles of embedding-based retrieval."