SQL query logs hold the context AI agents need to stop hallucinating joins

When Miro’s data team pointed AI agents directly at its Snowflake environment, the agents got the wrong answer more than 65% of the time. The problem wasn’t the model — it was context. With more than 10,000 tables and no semantic layer to guide routing, the agents had no way to know which data assets matched which business questions.

DataHub is releasing a context intelligence layer Thursday that mines existing SQL query history to build a semantic index — and exposes it to agents via MCP, LangChain, Google’s Agent Development Kit and CrewAI. The company calls it Context Intelligence, and it’s built on the same query-log infrastructure DataHub has used for lineage tracking in production deployments worldwide.

The company was founded by the team that built DataHub as an open source project at LinkedIn, where co-founder and CTO Shirshanka Das led data infrastructure for nearly 11 years. The open source project now has more than 15,000 contributors and 3,000 production deployments worldwide.

"For the first time, enterprises can turn years of analyst query history into a living, retrievable knowledge base where agents stop hallucinating joins because they have access to the joins that have worked before, validated by the people who ran them," Shirshanka Das, co-founder and CTO of DataHub, told VentureBeat in an exclusive interview.

Why query history beats raw schema for agent routing

DataHub began as a metadata management project at LinkedIn, built to solve two problems simultaneously: making data easy to find and use across the organization while ensuring it was only used for the right reasons. Das open-sourced the project in early 2020 after nearly six years of internal development.

The primary use case in the years since has been lineage — understanding how data flows from operational systems through streaming infrastructure into warehouses and out to business tools. Regulatory compliance audits, operational triage and new engineer onboarding all depend on that lineage graph. Postgres is the most-connected source in the DataHub deployment base globally, followed by MySQL, Oracle and the major cloud warehouses including Snowflake and Google BigQuery. The platform supports more than 100 connected metadata sources.

That deployed base matters for what DataHub is releasing. The query log extraction and SQL parsing capabilities powering Context Intelligence were developed across years of production deployment, not built for this release. The same infrastructure now serves agents querying a semantic index at runtime.

"The consumption layer has changed from humans to agents," Das said.

Context Intelligence mines validated query history, not raw logs

Context Intelligence is a new capability layer built on top of DataHub's existing open source metadata foundation. The open source platform has spent years extracting and parsing query logs from connected warehouses for lineage tracking. That same infrastructure is what Context Intelligence draws on to build the semantic index. The capability is new. The underlying plumbing is not.

Filtering for signal. Warehouse query logs contain too much noise to use directly. DataHub's engine filters for what Das describes as the "golden queries," meaning high-quality analyst queries and scheduled pipelines that represent proven business logic.

Inverting SQL into semantic definitions. The engine extracts patterns from those queries and translates them into structured text definitions DataHub calls semantic anchors. Those anchors form the retrieval basis agents draw on before generating SQL. "You can almost think of it as inverting text to SQL," Das said.

Human validation on top. Context Hub lets domain experts review AI-proposed context, resolve conflicting definitions and simulate the impact of changes before publishing. DataHub surfaces cases where different teams calculate the same metric differently and raises them for human resolution.

How Miro got AI agents working across 10,000 Snowflake tables

Miro, the digital collaboration platform, was already using DataHub for lineage tracking and impact analysis when it began testing analytics agents against its Snowflake environment. Ronald Angel, product manager for the data platform at Miro told VentureBeat that the scale of the data estate became the problem immediately. Sending natural language queries directly to the Snowflake MCP produced incorrect answers more than 65% of the time. Exposing more than 10,000 tables directly to agents caused too much confusion for reliable routing.

Miro addressed the problem by organizing data into well-defined data products that constrain what agents can see rather than exposing raw schema. The production architecture runs from user requests submitted via Claude Chat or Claude Cowork through a context layer where DataHub's MCP maps natural language to the appropriate data assets, then hands off to Snowflake's MCP for SQL generation.

Angel said the context layer pulls in metadata, entity relationships, query history and business intent for each Snowflake table, specifically what business question each entity is designed to answer. Those semantic signals allow the agent to identify the correct database entities before writing SQL rather than guessing from schema alone.

Pinecone, Oracle, Redis, Microsoft: how DataHub fits the context stack

Data vendors including Pinecone, Oracle and Redis all have contextual memory capabilities. On the platform side Microsoft has built out its Fabric IQ as a semantic layer for context.

DataHub’s argument isn’t feature parity. The company is positioning the context layer as platform-neutral — provisioning context into existing endpoints like Snowflake semantic views and Microsoft Fabric IQ rather than replacing them.

"A lot of times people want to be platform neutral when it comes to their context layer," Das said.

Kevin Petrie, an analyst at BARC, told VentureBeat that he sees DataHub's ability to integrate diverse metadata for both structured and unstructured objects, including documents and images, as differentiating them in the market.

"Many other vendors are more focused on structured tables, which provide trusted facts but often lack the rich context of text objects," he said.

Michael Ni, VP and principal analyst at Constellation Research, told VentureBeat that for him what stands out about DataHub’s context layer is its support of the shift from passive cataloging to continuously refreshed semantic intelligence. Ni described the competition for context as the next major platform war, arguing that whoever controls context at runtime controls the decision layer for data, agents, workflows and decisions.

"Buyers need to be careful, since many vendors only support a portion of the full context capabilities required for AI and agentic solutions," Ni said. "Buyers should be clear on their context management requirements, as vector memory isn't business meaning, business meaning isn't governance, and governance isn't execution."

Why query history beats raw schema for agent routing

Context Intelligence mines validated query history, not raw logs

How Miro got AI agents working across 10,000 Snowflake tables

Pinecone, Oracle, Redis, Microsoft: how DataHub fits the context stack

More