Built-for-purpose databases tackle new-era enterprise challenges

Before the web and cloud revolutions, enterprise data was like the skit on the old SNL show. The bit was set in a Chicago Loop diner, a place where you could get whatever you wanted -- as long as it was a cheeseburger.

That is something like the database market before the web happened. The answer to every database question was: “a relational database.” They tended to come from Oracle, Microsoft or IBM.

New varieties took hold as built-for-purpose NoSQL databases emerged. Today, new contenders are still arriving, usually with a special emphasis on cloud architecture. Earlier entries evolve offerings, often bringing in familiar pieces of relational technology.

A world where data mostly took the form of entries in corporate ledgers changed into one with data of all kinds. This included data ranging from online users’ trace activity to event logs from machine operations, and more.

Among purpose-built databases that have breached DB-Engines’ top 10 rankings to join Oracle, Microsoft and IBM in popularity, are MongoDB (#5), which began life as a document-oriented database; Redis (#6), originally an in-memory key-value store; and Elasticsearch (#7), a search engine that has taken on many database stylings.

Many more databases continue to bubble under the top 10. This story looks at three of them.

Pinecone Systems – machine learning spawns vector database

Pinecone Systems, Inc., in a sense comes out of AWS, one of the hotbeds of artificial intelligence (AI) and large-scale machine learning (ML). Pinecone’s CEO and founder, Edo Liberty, formerly directed Amazon AI Labs.

ML work at Amazon and elsewhere introduced a new data-type – vector data -- to an already vibrant data mix. Pinecone is among a small group – including Milvus, Zilliz and others – bringing vector data platforms to a global market for ML expected to grow from $17 billion in 2021 to $90 billion by 2026.

Basically, vector embeddings result from work on deep learning neural network models that convert raw data into simpler object vectors that can serve in various applications. Cloud houses like AWS saw a need to store, search and manage these embeddings as part of their operations.

Such work is challenging for resource-constrained organizations outside the ranks of the big cloud giants. This provided the impetus for Pinecone’s distributed team located in New York, San Francisco and Tel Aviv.

Beginning in 2019, the company worked to create a vector database that was purpose-built to handle the vector embeddings produced by machine learning. Pinecone cites users including Clubhouse, Expel, CourseHero and others. Among use cases are IT threat detection, document duplication, personalized article recommendations and semantic search. The company recently released keyword-aware semantic search capabilities built on its vector database.

Greg Kogan, vice president of marketing for Pinecone, said the company’s goal is to help utilize AI model vector output in production. “We see a lack of tooling and lack of infrastructure existing now for people doing machine learning engineering work. These are people trying to implement AI and ML into their applications,” he told VentureBeat.

The possible applications are vast, and handling vectors appears to be a hurdle many will need better tooling to jump.

InfluxData -- time-series data marches on

InfluxData grew out of a 2013 Y Combinator-backed effort to build monitors and collect real-time metrics from SaaS applications. Early on, it pivoted to produce a time-series database purpose-built to handle a rush of unstructured data that cloud and web companies were encountering.

Today, InfluxDB competes with Kx KDB, TimescaleDB and others in a global time-series database software market projected to reach $575 million by 2028, according to Verified Market Research. InfluxData’s customer list includes the likes of Adobe, which applied the software for SharePoint microservices observability; eBay, which pursued anomaly detection; and Cisco, for devops monitoring of SaaS ecommerce apps.

Like other rising NoSQL database makers, InfluxData eventually saw the need to engineer an entirely new storage engine to underlie its flagship offering. At its Influx Days 2022 event earlier in November, the company announced a controlled beta based on a new engine.

Known as InfluxDB IOx, it brings columnar data processing to InfluxDB to run faster queries against multidimensional data sources. Importantly, the company now offers relational mainstay SQL language support, with a special eye toward the analytics jobs increasingly applied to time-series data.

The move to columnar helps InfluxData customers better deal with the explosion of metrics they now monitor, according to Paul Dix, CTO and cofounder of InfluxData. Customer requests led to the SQL support as well.

Now, as it works on what some might describe as “an engine swap,” InfluxData must meet new challenges, Dix admits.

“From a technical perspective, developing the new database core, and then getting everything converted over, is the challenge. From a product perspective, it means bringing new features that this core enables to our users in a way that's easy to understand and easy to use,” he told VentureBeat. “Over time, what people are going to realize is that having a columnar-style database is likely to provide the best performance and functionality,” he said.

FaunaDB -- A database grows at Fauna

Among trends in databases today, none outstrips the cloud in impact. In 2021, revenue for managed cloud database services, or database platform-as-a-service (dbPaaS), rose to $39.2 billion, by Gartner estimates. As analyst Merv Adrian pointed out, dbPaaS now represents more than 49% of all database management systems revenue. Big cloud players lead here.

Still, it’s no surprise that startups are building large-scale distributed databases meant to exploit cloud architecture. Among those is Fauna, maker of a “document-relational” database that matches NoSQL scalability with relational-style transactional consistency. Among startups also targeting distributed cloud databases are Cockroach Labs, ScyllaDb and others.

Fauna arose from the work of developers at Twitter, which needed a globe-spanning scalable database. The work was influenced by the Calvin scalable deterministic database project now stewarded by the University of Maryland, which sought to maintain data consistency without hardware dependencies. Fauna emerged from stealth in 2016, and has since found use by Lexmark, Santander, Fabriq and others.

This month, the company added Intelligent Routing to its list of database capabilities. Intelligent Routing efficiently distributes requests and queries to databases scaling across geographies and cloud providers.

The move to distributed databases that are able to maintain data consistently for apps available around the world is important, said Eric Berg, CEO of Fauna.

“When you look back to relational databases of the 70s and 80s, you see that people loved their strong consistency. We all know that once that hit the internet, it didn't scale physically,” he said.

As an example, he points to Fauna customers that have seen their own customer bases expand globally, and now have a need to run replications and guarantee latency across several cloud regions. Without new tooling, it's left to the user to connect these regional “dots” – that is, unless that functionality is built into the database, according to Berg.

Pinecone Systems – machine learning spawns vector database

InfluxData -- time-series data marches on

FaunaDB -- A database grows at Fauna

More