Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more
Was it only a few years ago that a terabyte was a huge dataset? Now that every random gadget from the internet of things is “phoning home” a few hundred bytes at a time and every website wants to track everything we do, it seems terabytes just aren’t the right unit any more. Log files are getting larger, and the best way to improve performance is to study these endless records of every event.
Rockset is one company tackling this problem. It is devoted to bringing real-time analytics to the stack so that companies can exploit all of the information in event streams as they happen. The company’s service is built on top of RocksDB, an open source, key-value database designed for low latency ingestion. Rockset has tuned it to handle the unending flow of bits that must be watched and understood to ensure that modern, interaction-heavy websites are performing correctly.
VentureBeat sat down with Venkat Venkataramani, CEO of Rockset, to talk about the technical challenges faced in building this solution. His outlook on data was largely forged in engineering leadership roles at Facebook, where a wide number of data management innovations occurred. In conversation, we pressed particularly on the database that lies at the heart of the Rockset stack.
VentureBeat: When I look over your webpage, I don’t really see the word “database” very often. There are words like “querying” and other verbs that you normally associate with databases. Does Rockset think of itself as a database?
Venkat Venkataramani: Yes, we are a database built for real-time analytics in the cloud. In the 1980s when databases came to being, there was only one kind of database. It was a relational database and it was only used for transaction processing.
After a while, about 20 years later, companies had enough data that they wanted more powerful analytics to run their businesses better. So data warehouses and data lakes were born. Now fast-forward 20 years from there. Every year, every enterprise is generating more data than what Google had to index in 2000. Every enterprise is now sitting on so much data, and they need real-time insights to build better products. Their end users are demanding interactive real-time analytics. They need business operations to iterate in real time. And that is what I would consider our focus. We call ourselves a real-time analytics database or a real-time indexing database, essentially a database built from scratch to power real-time analytics in the cloud.
VentureBeat: What’s different between the traditional transactional processing and your version?
Venkataramani: Transaction processing systems are usually fast, but they don’t [excel at] complex analytical queries. They do simple operations. They just create a bunch of records. I can update the records. I can make it my system of record for my business. They are fast, but they’re not really built for compute scaling, right? They’re both for reliability. You know: Don’t lose my data. This is my one source of truth and my one system of record. It offers point-in-time recovery and transactional consistency.
But if all of them need transactional consistency, transactional databases can’t run a single node transaction database faster than about 100 writes per second. But we’re talking about data torrents that do millions of events per second. They’re not even in the ballpark.
So then you go to warehouses. They give you scalability, but they’re too slow. It’s too slow for data to come into the system. It’s like living in the past. They’re often hours behind or even days behind.
The warehouses and lakes give you scale, but they don’t give you speed like you might expect from a system of record. Real-time databases are the ones that demand both. The data never stops coming, and it’s going to be coming in torrents. It’s gonna be coming in millions of events per second. That is the objective here. That is the end goal. This is what the market is demanding. Speed, scale, and simplicity.
VentureBeat: So you’re able to add indexing to the mixture but at the cost of avoiding some transaction processing. Is making a choice in the trade-off the solution, at least for some users?
Venkataramani: Correct. We are saying we’ll give you the same speed as an old database, but give up transactions because you’re doing real-time writes anyway. You don’t need transactions, and that allows us to scale. The combination of the converged index along with the distributed SQL engine is what allows Rockset to be fast, scalable, and quite simple to operate.
The other thing about real-time analytics is the speed of the queries is also very important. It’s important in terms of data latency, like how quickly data gets into the system for query processing. But more than that, the query processing also has to be fast. Let’s say you’re able to build a system where you can accumulate data in real time, but every time you ask a question, it takes 40 minutes for it to come back. There’s no point. My data ingestion is fast but my queries are slow. I am still not able to get visibility into that in real time, so it doesn’t matter. This is why indexing is almost like a means to an end. The end is very fast query performance and very short data latency. So fast queries on fresh data is the real goal for real-time analytics. If you have only fast queries on stale data, that is not real-time analytics.
VentureBeat: When you look around the world of log-file processing and real-time solutions, you often find Elasticsearch. And at the core is Lucene, a text search engine just like Google. I’ve always thought that Elastic was kind of overkill for log data. How much do you end up imitating Lucene and other text-search algorithms?
Venkataramani: I think the technology you see in Lucene is pretty amazing for when it was created and how far it has come. But it wasn’t really built for these kinds of real-time analytics. So the biggest difference between Elastic and RocksDB comes from the fact that we support full-featured SQL including JOINs, GROUP BY, ORDER BY, window functions, and everything you might expect from a SQL database. Rockset can do this. Elasticsearch cannot.
When you can’t JOIN datasets at query time, there is a tremendous amount of operational complexity that is thrown in at the operator. That is why people don’t use Elasticsearch for business analytics as much and use it predominantly for log analytics. One big property of log analytics is you don’t need JOINs. You have a bunch of logs and you need to search through those logs, there are no JOINs.
VentureBeat: The problem gets more complicated when you want to do more, right?
Venkataramani: Exactly. For business data, everything is a JOIN with this, or a JOIN with that. If you cannot JOIN datasets at query time, then you are forced to de-normalize data at ingestion time, which is operationally difficult to deal with. Data consistency is hard to achieve. And it also incurs a lot of storage and compute overhead. So Lucene and Elasticsearch have a few things in common with Rockset, such as the idea to use indexes for efficient data retrieval. But we built our real-time indexing software from scratch in the cloud, using new algorithms. The implementation is entirely in C++.
We use converged indexes, which deliver both what you might get from a database index and also what you might get from an inverted search index in the same data structure. Lucene gives you half of what a converged index would give you. A data warehouse or columnar database will give you the other half. Converged indexes are a very efficient way to build both.
VentureBeat: Does this converged index span multiple columns? Is that the secret?
Venkataramani: Converged index is a general purpose index that has all the advantages of both search indexes and columnar indexes. Basic columnar formats are data warehouses. They work really well for batch analytics. But the minute you come into real-time applications, you have to be spinning compute and storage 24/7. When that happens, you need a compute-optimized system, not a storage-optimized system. Rockset is compute-optimized. We will be able to give you 100 times better query performance because we’re indexing. We build a whole bunch of indexes on your data and, byte-for-byte, the same data set will consume more storage in RocksDB — but you get extreme compute efficiency.
VentureBeat: I noticed that you say things like connect to your traditional databases as well as event backbones like Kafka streams. Does that mean that you might even separate the data storage from the indexing?
Venkataramani: Yes, that is our approach. For real-time analytics, there will be some data sources like Kafka or Kinesis where the data doesn’t necessarily live elsewhere. It’s coming in large volumes. But for real-time analytics you need to join these event streams with some system of record.
Some of your clickstream data could be coming from Kafka and then turn into a fast SQL table in Rockset. But it has user IDs, product IDs, and other information that has to be joined with your device data, product data, user data, and other things that need to come from your system of record.
That is why Rockset also has built-in real-time data connectors with transactional systems such as Amazon DynamoDB, MongoDB, MySQL, and PostgreSQL. You can continue to make your changes to your system of record, and those changes will also be reflected in Rockset in real time. So now you have real-time tables in Rockset, one coming from Kafka and one coming from your transactional system. You can now join and do analytics on it. That is the promise.
VentureBeat: That’s the technologist’s answer. How does this help the non-tech staff?
Venkataramani: A lot of people say, “I don’t really need real time because my team looks at these reports once a week and my marketing team doesn’t at all.” The reason why you don’t need this now is because your current systems and processes are not expecting real-time insights. The minute you go real time is when nobody needs to look at these reports once a week anymore. If any anomalies happen, you will get paged immediately. You don’t have to wait for a weekly meeting. Once people go real time, they never go back.
The real value prop of such real-time analytics is accelerating your business growth. Your business is not running in weekly or monthly batches. Your business is actually innovating and responding all of the time. There are windows of opportunity that are available to fix something or take advantage of an opportunity and you need to respond to it in real time.
When you’re talking tech and databases, this is often lost. But the value of real-time analytics is so immense that people are just turning around and embracing it.
VentureBeatVentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
- networking features, and more