Database technology evolves to combine machine learning and data storage

When Bob van Luijt, the CEO of SeMI Technologies, looks at the history of databases, he highlights a few distinct waves. First, there was the world of SQL, where all the data fit neatly into rectangular tables. Then came the NoSQL revolution that brought the flexibility of the document model, where each entry didn’t need to have the same fields. Now, his company is bringing Weaviate to the market as part of a wave of AI-centric databases that merge the power of machine learning with data storage.

The new model offers not just the potential for tapping the power of AI algorithms, but also a more flexible search engine that isn’t locked into searching for exact matches. While traditional databases require the names to be spelled correctly or the exact confirmation code to locate a record, Weaviate can find entries that are the most similar.

What does it mean to be similar? That’s still a wide open question for many users. Much of the art goes into defining how to calculate just how close or far apart two pieces of data might be. Finding the closest records in the database begins with finding a metric or a way to specify just what it means to be nearby in some multidimensional space defined by an AI.

While SeMI Technologies is the main fundraiser, much of the branding is focused on Weaviate, the open source database. Companies can download the code or purchase Weaviate as a managed service.

Many Weaviate users rely on pre-built models for text in English and other well-known languages. There’s one model built out of the entire collection of Wikipedia articles that SeMI built, so people could experiment. A number of other pre-built models are available, including deepset’s Haystack for semantic search or Jina.ai’s document-based search.

What is a vector database?

The core Weaviate engine, though, can work with any arbitrary collection of values, the reason some call these systems “vector databases.”

“The majority of use cases are still in text.” said van Luijt. “But you notice that more and more people start to get it and say, ‘Oh, I can do this with text. Let me also throw some images at it.’”

According to van Luijt after experimenting with images and audio, some users are importing other data like DNA sequences or geological surveys. Searching through the genome is a natural match for the technology because some genealogical research depends upon inexact matches. Researchers can track the flow of groups and population through time and location, opportunities that open ways to study human history through DNA data.

Other options are just emerging as users imagine new similarity metrics. One preliminary experiment is breaking the earth’s surface into small squares and grading their susceptibility to flooding. They hope to create new models that will better price insurance risk and guide investment in the face of global warming.

Better searching of large datasets

van Luijt says that Weaviate and SeMI new search engine offer faster matching and greater efficiency for large datasets over traditional databases extended with AI algorithms. Some will use the database for a basic search and then export these potential answers to a machine learning model that will grade them and choose the best answer.

“If you do that over a thousand documents, you're [going to] be fine. You don't notice anything.” said van Luijt. “But you cannot search the entire database and do tasks like question answering in a few milliseconds, over the entire dataset. “

This kind of wide-open opportunity is driving experimentation and investment. Last week, SeMI closed a $16 million funding round that was co-led by New Enterprise Associates (NEA) and Cortical Ventures. In August 2020, Zetta Venture Partners led a $1.6 million seed round with ING Ventures.

"We've been closely watching ML and AI advancements and waiting for the right team and product to reinvent how we work with data," said Tony Florence, managing general partner, Technology at NEA at the announcement. "The Weaviate Vector Database enables users to interact with unstructured data as vectors, across text, audio, and images, which unlocks incredibly powerful use cases.”

Assessing the competitive landscape

The competition varies, as there are several other open source vector databases that offer similar features. Milvus, for instance, emerged from the LF AI & Data Foundation's incubator program and it also supports searching vector data for similar results. Pinecone.io is tightly integrated with Apache’s Kafka and offers similarity search for streaming data. Vespa is focused specifically on text applications and using similarity to drive recommendations.

Cloud companies are also bundling the option into their data storage products. Google, for instance, offers the Vertex AI Matching Engine, which helps power their AutoML product.

But traditional databases companies are also beefing up their connections with AI algorithms. Oracle, for example, offers a collection of AI algorithms and boasts of “the speed of in-database learning.” IBM has rebranded its classic db2 as “the AI database” and boasts using machine learning to boost query performance and “confidence-based querying.”

All want to find a way to support the demands of the computation-heavy artificial intelligence algorithms as they plow through larger and more complex sets of data.

“It's really AI-first infrastructure that we have here.”, explained van Luijt. “For the first time, this bridge is being built between all that stuff that's being done in data science and people seeing the promise and need for their companies. We're making that bridge.”

What is a vector database?

Better searching of large datasets

Assessing the competitive landscape

More