Yugabyte CTO outlines a PostgreSQL path to distributed cloud

Let the OSS Enterprise newsletter guide your open source journey! Sign up here.

Like others, Yugabyte is a database company that's building a high-performance distributed database for supporting large, geographically distributed cloud workloads. Yugabyte did not quite start from scratch, however. At the core of its code is PostgreSQL, an open source database with a history that spans several decades. But PostgreSQL was originally built to run on just one computer, so Yugabyte’s teams have rebuilt the guts to scale.

VentureBeat sat down with CTO and cofounder Karthik Ranganathan to understand what the company borrowed and what its team built to create the tool. Ranganathan, who was closely involved in the first wave of modern NoSQL activity as an engineering lead at Facebook, tells the tale.

This interview has been edited for brevity and clarity.

VentureBeat: In effect, you're creating a big replicated and sharded version of PostgreSQL. Why Postgres?

Karthik Ranganathan: We see that Postgres is actually the fastest-growing database. It's happening for many reasons, but I'll just focus on three reasons. The No. 1 reason is because it's completely open. It's very transparent about the features roadmap. The second reason is it is the truly open source database that any of the modern cloud companies can pick up and run without having to worry about having to pay for an Oracle or a Db2 or SQL server. And the No. 3 reason it is the most feature-rich open source database. It's got features that can actually match that of other databases, like Oracle, Db2, or SQL server.

VentureBeat: So how has Yugabyte set about changing it?

Ranganathan: Modernizing an application is actually easy. That's one of the reasons why we picked Postgres. We are completely open source as well. We reuse the upper half of Postgres completely so we are Postgres-compatible almost to a fault. Like in the sense that if you have an application running on Postgres, it just runs. But you need to figure out how to make it run well in a distributed substrate. So our message that we're trying to get across is that if people are picking Postgres to run an application in the cloud, we have done the work to get Postgres to run in the cloud. If you expect to grow the application in the cloud, you have high availability needs or replication needs built into the data model, those are things we can take care of exceptionally well.

VentureBeat: I remember, say 20 to 30 years ago, Postgres and MySQL were the two leaders. But MySQL really jumped out and became the foundation for the LAMP stack, which proliferated. Then it seems like in recent years, Postgres jumped into the limelight and began generating so much more interest and so much more excitement. Why do you think that is?

Ranganathan: First, 30 years ago, open source [databases were not] the norm. If you told people, "Hey, here's an open source database," they're going to say, "Okay? What does that mean? What is it? What does it really mean? And why should I be excited?" And so on. I remember because at Facebook I was a part of the team that built an open source database called Cassandra, and we had no idea what would happen. We thought "Okay, here's this thing that we're putting out in the open source, and let's see what happens." And this is in 2007.

Back in that day, it was important to use a restrictive license -- like GPL -- to encourage people to contribute and not just take stuff from the open source and never give back. So that's the reason why a lot of projects ended up with GPL-like licenses.

Now, MySQL did a really good job in adhering to these workloads that came in the web back then. They were tier two workloads initially. These were not super critical, but over time they became very critical, and the MySQL community aligned really well and that gave them their speed.

But over time, as you know, open source has become a staple. And most infrastructure pieces are starting to become open source. The more open the better, right? And [fewer] restrictions means anybody can control the roadmap, anybody can contribute to it. If there's a big company wanting a fix and no one has time to do it, they can invest in building a team around it. All of this becomes much easier with a very transparent and open community.

Postgres is really having a day in the sun because of that, but it's also because Postgres has an incredibly strong set of features. When you compare it with the likes of Oracle and SQL Server and Db2 and triggers and stored procedures and partial indexes -- it's just got a lot of complex features built in. That made it viable for people moving off these existing databases that are mostly on-prem. If you want to run it in the cloud, you have to find an equal database that can support that application. And it just happened to be Postgres. If you kind of connect MySQL's rise to the rise of the LAMP stack, you can connect PostgreSQL's rise to the rise of the cloud movement.

VentureBeat: You mentioned that at the top level, the highest level, you're completely Postgres-compatible. Does that mean a storage engine underneath is what you've replaced?

Ranganathan: It's more than that actually. We have replaced the storage engine, among other things, but we have made the database completely replicated and highly available. So there's really no single point of failure.

You can abstract out the upper half of Postgres itself into things that receive the query that performs security checks and verifications that compute the way you execute a query. And then, you know, go ahead and do the execution. We've retained all of that.

What we’ve changed is not just the storage engine. It's also the replication engine. Your data might be sitting on one node or a bunch of other nodes, right? So this node needs to not only understand that the data is in a different storage engine. It also needs to know about the location of the different pieces of data. The second bit is now that your data is replicated, if you fail you're going to want some other node to take over instantaneously. So you need to know how to fail over to the right node to pick it up. It's almost a dynamic membership problem. And the third bit is around the system catalog. We have the place where the set of tables you created is stored. That's just stored as a bunch of files in Postgres. We really needed to make that replicated and highly available as well.

And finally, we tackled the problem [uncovered] when you create a table on machine No. 1 and No. 2 should recognize it instantly. You can't have this lag where the table says it's not there or you're triggering an ALTER TABLE fail. We have to do all of this type of stuff when we replace the bottom layer.

VentureBeat: When I look through a lot of your literature, you push YugabyteDB as a SQL database. But you also have a NoSQL API. How does that work? Is NoSQL just a layer that's translated into SQL below? Or are they independent?

Ranganathan: It's side by side. That's another core piece of IP for us. Half of our team has database blood from Oracle, and another bunch of the core team is from Facebook, where we actually built the first few NoSQL databases, including Cassandra. I think our "Aha!" moment, after building both sides, is that it is possible to build a storage engine where the data format is uniform. The way you access data can be independent of the query format.

Our aim is to make it simple to build cloud-native applications. Naturally, we don't want to take a side. We don't want to say, "Look, we're only SQL. All of you NoSQL [folks] are doing it wrong. You need to move over to SQL." That message never works.

We said that doing both is a real advantage. There are some things that NoSQL does that are really good. So we said, in order to build the perfect database, we have to perfectly hybridize the two sides. Picking a SQL API and putting all the NoSQLisms inside is going to take a very long time. It's going to be like this for many years.

Let me give you a simple example. If a SQL client driver -- a JDBC driver -- is only aware of a single node, and you said "Connect to this node," that's all it does. A NoSQL client is a smart client, where after you connect to one node, it'll discover all the other nodes. It'll discover nodes that you add or you remove. It'll discover the locations of these various nodes to say, "Look, this is in the US West. That's in US Central. This is in the US East. I can read-only from the US West." You can do all sorts of really powerful things with the NoSQL client.

Now it's just difficult to hybridize these two because you need driver-level changes on the SQL side, which is a core DB feature. It's difficult for a company to do this while catching up. So we said we're going to follow an alternative approach, where we give multiple APIs on top of the database. We'll build an extensible query layer that's more exhaustive than the Postgres query layer. Of course, what we have is the Express one, but we also support an Apache Cassandra-compatible API. It's a completely different API, but data is stored in the same storage. The replication mechanisms are the same, but the access patterns are optimized for NoSQL.

VentureBeat: Does that mean I could do a SQL query, select on a certain table, and it would find the right columns and do that and then I could turn around and on the same table I could just do a Cassandra-like query?

Ranganathan: Not on the same table. You could have a SQL table sitting right next to a NoSQL table and you could have both of them transactionally consistent. All of your replication, encryption at rest -- all of that is taken care of for you. But not on the same table.

Our aim is to cater to microservices that either need tremendous scale and distribution or great scale but also a tremendous amount of relational integrity. We can go both ways. But the reality is that your apps are going to look purely one or the other. Either SQL or NoSQL.

VentureBeat: You talked about transactional consistency. How do you maintain that across the two different styles of tables? One side gets a Cassandra-style Yugabyte Cloud Query Language (YCQL) and the other gets SQL?

Ranganathan: Tables can either be a multi-row transactional or single row. You can opt in to do multi-row or multi-table transactions on the NoSQL side. We're adding into that world -- you can have indexes, and those are net new things that we bring to that world. But on the SQL side, all tables are default transactional to the highest degree. You really can't opt out of transactions with SQL.

These two tables are silos that have respective APIs. But you can use these respective APIs. You can use the Postgres foreign data wrappers to connect them. You can do interesting things. For example, you can declare an external table on the Postgres side to say "Look, that's an external table that you can access." You can do things like that. But other than that, you cannot cross-access the data because we want to build best of breed -- not the lowest common denominator -- on the both sides.

VentureBeat: There are a number of extensions to PostgreSQL, like the geographic information or GIS tools. Can you work with them?

Ranganathan: They do. At least on the query layer, all the extensions work. Those that hit the storage layer of Postgres will not because we replace the storage engine. So geographic information works, but we still are building GIST indexes. You can make your queries, but the queries won't be efficient today because we don't have GIST index support. That's more of a lower half thing, right? We have to organize data according to the GIS tasks but once we do that, it's going to work beautifully. But the upper half already just works.

VentureBeat: Do you find that people are using one side of the APIs much more than the other?

Ranganathan: Postgres is on fire. It's not even close. The YCQL-side [NoSQL side] is big, but the sheer amount of usage, the number of apps, and the number of people using it on the Postgres side are just incredible. It's just staggering.

More