Pinecone CEO on bringing vector similarity search to dev teams

The traditional way for a database to answer a query is with a list of rows that fit the criteria. If there's any sorting, it's done by one field at a time. Vector similarity search looks for matches by comparing the likeness of objects, as captured by machine learning models. Pinecone.io brings "vector similarity" to the average developer by offering turnkey service.

Vector similarity search is particularly useful with real-world data because that data is often unstructured and contains similar yet not identical items. It doesn't require an exact match because the so-called closest value is often good enough. Companies use it for things like semantic search, image search, and recommender systems.

Success often depends upon the quality of the algorithm used to turn the raw data into a succinct vector embedding that effectively captures the likeness of objects in a dataset. This process must be tuned to the problem at hand and the nature of the data. An image search application, for instance, could use a simple model that turns each image into a vector filled with numbers representing the average color in each part of the image. Deep learning models that do something much more elaborate than that are very easy to get nowadays, even from deep learning frameworks themselves.

We sat down with Edo Liberty, the CEO and one of the founders of Pinecone, and Greg Kogan, the VP of marketing, to talk about how they're turning this mathematical approach into a Pinecone vector database that a development team can deploy with just a few clicks.

VentureBeat: Pinecone specializes in finding vector similarities. There have always been ways to chain together lots of WHERE clauses in SQL to search through multiple columns. Why isn't that good enough? What motivated Pinecone to build out the vector distance functions and find the best?

Edo Liberty: Vectors are by no means new things. They have been a staple of large-scale machine learning and a part of machine learning-driven services for at least a decade now in larger companies. It's been kind of "table stakes" for the bigger companies for at least a decade now. My first startup was based on technologies like this. Then, we used it at Yahoo. Then, we built another database that deployed it.

It's a big part of image recognition algorithms and recommendation engines, but it really didn't hit the mainstream until machine learning. In pretrained models, AI scientists started generating these embeddings in vector representations of complex objects pretty much for everything. So it just became a lot lower and became a lot more common. People suddenly started having these vectors and suddenly, it's like they are asking "OK, what now?"

Greg Kogan: The reason why clauses fall short is that they are only as useful as the number of facets that you have. You can string together WHERE clauses, but it won't produce a ranked answer. Even for something as common as semantic search, once you can get a vector embedding of your text document, you can measure the similarity between documents much better than if you're stringing together words and just looking for keywords within the document. Other things we're hearing is search for other unstructured data types like images or audio files. Things like that where there was no semantic search before. But now, they can convert unstructured data into vector embeddings. Now you can do vector similarity search on those items and do things like find similar images or find similar products. If you do it on user behavior data or event logs, you can find similar events, similar shoppers, and so on.

'Once it's a vector, it's all the same to us'

VentureBeat: What kind of preprocessing do you need to do to get to the point where you've got the vector? I can imagine what it might be for text, but what about other domains like images or audio?

Kogan: Once it's a vector, it's all the same to us. We can perform the same mathematical operations on it. From the user's point of view, they would need to find an embedding model that works with their type of data. So for images, there are many computer vision models available off the shelf. And if you're a larger company with your own data science team, you're most likely developing your own models that will transform images into vector embeddings. It’s the same thing for audio. There's wav2vec for audio, for instance.

For text and images, you can find loads of off-the-shelf models. For audio and streaming data, they're hard to find so it does take some data science work. So the companies that have the most pressing need for this are those more advanced companies that have their own data science teams. They've done all the data science work and they see that there's a lot more they can do with those vectors.

VentureBeat: Are any of the models more attractive, or does it really involve a lot of domain-specific kind of work?

Kogan: The off-the-shelf models are good enough for a lot of use cases. If you're using basic semantic search over documents, you can find some off-the-shelf models, like sentence embeddings and things like that. They are fine. If your whole business depends on some proprietary model, you may have to do it on your own. Like if you're a real estate startup or financial services startup and your whole secret sauce is being able to model something like financial risk or the price of a house, you're going to invest in developing your own models. You could take some off-the-shelf model and retrain it on your own data to eke out some better performance from it.

Large banks of questions generate better results

VentureBeat: Are there examples of companies that have done something that really surprised you, that built a model that turned out to be much better than you thought it would even end up?

Liberty: If you have a very large bank of questions and good answers to those questions, a common and reasonable approach is to look for what is the most similar question and just return the best answer that you have for this other question, right? It sounds very simplistic, but it actually does a really good job, especially if you have a large bank of questions and answers. The larger the collection, the better the results

Kogan: We didn't even realize it could be applicable for bot detection and image duplication. So if you're a consumer company that allows uploading of images, you may have a bot problem where a user uploads some bad images. But once that image is banned, they try to upload a slightly tweaked version of that image. Simply looking up a hash of that image is not going to find you a match. But if you look for similarity, like closely similar images, you suspend that account immediately or at least flag it for review.

We've also heard this for financial services organizations, where they get way more applications than they can manually review. So they want to flag applications that resemble previously flagged fraudulent applications.

VentureBeat: Is your technology proprietary? Did you build this on some kind of open source code? Or is it some mixture?

Kogan: At the core of Pinecone is a vector search library that is a proprietary index. A vector index. We find that people don't care so much about exactly which index it is or whether it's proprietary or open source. They just want to add this capability to their application. How can I do that quickly and how can I scale it up? Does it have all the features we need? Does it maintain its speed and accuracy at scale? And who manages the infrastructure?

Liberty: We do want to contribute to the open source community. And we're thinking about our open core strategy. It's not unlikely that we will support open source indexes publicly soon. What Greg said is accurate. I'm just saying that we are big fans of the open source community and we would love to be able to contribute to it as well.

VentureBeat: Now it seems that if you're a developer that you don't necessarily integrate it with any of the databases per se. You just kind of side-load the data into Pinecone. When you query, it returns some kind of key and you go back to the traditional database to figure out what that key means.

Kogan: Exactly right. Yes, you're running it alongside your warehouse or data lake. Or you might be storing the main data anywhere. Soon we'll actually be able to store more than just the key in Pinecone. We're not trying to be your source of truth for your user database or your warehouse. We just want to eliminate the round trips. Once you find your ranked results or similar items, then we'll have a bit more there. If all you want is the S3 location of that item or the user ID, you've got it in your results.

More flexibility on pricing

VentureBeat: On pricing, it looks like you just load everything into RAM. Your prices are determined by how many vectors you have in the dataset.

Kogan: We used to have it that way. We recently started letting some users have a little bit more control over things like the number of shards and replicas. Especially if they want to increase their throughput. Some companies come to us with insanely high throughput demands and latency demands. When they sign up and they create an index, they can choose to have more shards and more replicas for higher availability and throughput. In that case, you still have the same amount of data, but because it's being replicated, you're going to pay more because you're looking for data on more machines.

VentureBeat: How do you handle the jobs where companies are willing to wait a little bit and don't care about a cold start?

Kogan: For some companies, the memory-based pricing doesn't make sense. So we're happy to work with companies to find another model.

Liberty: What you're asking about is a lot more fine-grained control over costs and performance. We do work with larger customers and larger teams. We just sat down with a very large company today. The workload is 50 billion vectors. Usually, we have a very tight response time. Let's say 20, 30, 40, 50 milliseconds is typical 99% of the time. But they say that this is an analytical workload and we are happy to have a full second latency or even two seconds. That means they can pay less. We are very happy to work with customers and find trade-offs, but it's not something that's open in the API today. If you sign in on the website and use the product, you won't have those options available to you yet.

Kogan: We simplified the self-serve pricing on the website to make it easier for people to just jump in and play around with it. But once you have 50 billion vectors and crazy performance or scale requirements, come talk to us. We can make it work.

Our initial bet was that more and more companies would use vector data as machine learning models become more prevalent and the data scientists become more productive. They realize that you can do a lot more with your data, once it's going to a vector format. You can collect less of it and still succeed. There are privacy and consumer protection implications as well.

It's becoming less and less extreme of a bet. We are seeing the early adopters, the most advanced companies have already done this. They're using vector similarity search and using recommendation systems for their search results. Facebook uses them for their feed ranking. The vision is that more companies will leverage vector data for recommendation and many use cases still to be discovered.

Liberty: The leaders already have it. It's already happening. It’s more than just a trend.