Database trends: The rise of the time-series database

The problem: Your mobile app just went viral, and you've got a boatload of new users flooding your servers with a bazillion packets of data. How can you store this firehose of sensor data? Is there a way to deliver some value with statistical analysis? Can you do this all on a budget with a well-tuned database that won't drive the price of supporting the widget through the roof?

The time-series database (TSDB) is designed to handle these endless streams, and it's one of the most notable current trends in database technology. It gives developers a tool for tracking the bits flowing from highly interactive websites and devices connected to the internet. It adds strong algorithms for fast queries for statistical analysis, which makes it popular for tackling problems like online ad selection and smart device support.

The TSDB has grown in popularity in recent years, and last year it was the fastest-growing type of database in the enterprise, largely because of the growing number of use cases for it. After all, time-series data is a sequence of data points collected over time, giving you the ability to track changes over that period -- and that's what you need to do if you're running sophisticated transactions like advertising, ecommerce, supply chain management, and more.

What are some other major use cases for a TSDB?

Appliance makers are adding internet connections to add a bit of zip to their product lines, and now these devices are all phoning home to report data so the customers can manage them from their phone anywhere and anytime.
Mobility is becoming an extension of the cloud. The rent-by-minute scooters and ride-sharing platforms track users before, during, and after the ride. All of these data points can be studied to improve performance and plan deployments for future demands.
Many documents are slowly turning from a single block of data into a stream of changes. Word processors that used to store the current version of a document are now recording every keystroke and mouse click that produced them. This makes editing simpler, with infinite levels of "undo" available.
Houses are becoming more digital, and many items that were once little more than a switch (e.g., a thermostat, lamp, or television) are now recording events every second or even more often.

What makes a TSDB shine?

First, datasets are large and getting larger. Log files are measured in petabytes now, and they're growing. Devices from the so-called internet of things (IoT) are proliferating, and they're often designed to rely on a central service for analysis and presentation. Sense.com, for instance, collects information on electrical consumption in houses millions of times per second. When these bits are reported, Sense.com's central database must store enough data to be useful but not enough to overwhelm the storage.

The time-series datasets often have fewer relationships between data entries in different tables that require transaction-based locking to avoid inconsistencies. Most of the data packets contain a timestamp, several sensor readings, and not much more.

This allows special indices to speed queries like the number of events in a day, week, or other time period. Good time-series indices can offer quick answers to statistical questions about ranges of data.

The databases can also offer some support because many of the maintenance chores are regular and easy to automate. The databases can automatically dispose of old data while delivering only fresh statistics. While standard databases are designed to store data forever, time-series databases can be configured to give data elements a specific time to live. Others will use a round-robin algorithm to store a fixed set.

As time goes by, the databases deploy specialized compression functions that will store time-series data in less space. If sensor readings don't change from millisecond to millisecond, there's no reason to store another copy of the same value. Timescale.com, for instance, boasts of 94%-97% saving in storage thanks to compression algorithms tuned to the regular data patterns.

Who benefits the most?

Tracking how people, machines, and organizations behave over time is the key to customization. Time-series databases that optimize the collection and analysis of time-series data open up the opportunity to provide business models that adjust and avoid one-size-fits-all standardization. Algorithms that place advertising, for instance, can look at recent behavior. Intelligent devices like thermostats can search through events and understand what people want at different times of the day.

How are legacy players approaching it?

All major databases have long had fields that store dates and times. All of the traditional queries for searching or tabulating the data still work with these entries. Oracle databases, for example, have been popular on Wall Street for storing regular price quotes. They aren't optimized like the new databases, but that doesn't mean that they can't answer the questions with a bit more computational power. Sometimes it's cheaper to buy bigger machines than switch to a new database.

Some applications may collect a variety of data values, and some may be best suited to the stability of a traditional database. Banking applications, for instance, are filled with ledger transactions that are just time-series tables of the total deposits. Still, bank developers can be some of the most conservative, and they may prefer a legacy database with a long history over a new tool with better efficiencies.

Sometimes the traditional companies are rolling out newer models that compete. Oracle, for instance, is also tuning its NoSQL database to search and analyze the time-series data streams from sensors and other real-time sources. The API will maintain a running collection of fresh data points and enforce time-to-life control over the data to avoid overloading the storage.

The newer data analysis engines often include tools specifically built for time-series data. For example, Microsoft's Data Mining tool for its SQL Server has a collection of functions that can look at historical data and predict future trends.

The cloud companies are also adding data storage services for this market. AWS, for example, launched its Timestream service, a tool optimized for IoT data. It will also integrate with the rest of the AWS stack through standard pathways like the Lambda functions, as well as customized ones for machine learning options like SageMaker.

Which new startups are emerging?

New companies see an opportunity through focusing on adding the right amount of indexing and post-processing to make queries fast and effective.

InfluxDB began as an open source project and is now available as either a standalone installation or an elastic serverless option from the InfluxDB Cloud. The company's Flux query language simplifies tasks like computing the moving averages of the data stream. The language is functional and designed to be easily composable so queries can be built up from other queries.

Timescale DB is a separate engine that is fully integrated with PostgreSQL for tasks that might need traditional relational tables and time-series data. The company's benchmarks boast of speeding up ingesting data by a factor of 20. The queries for searching the data or identifying significant values like maxima can be thousands of times faster.

Prometheus stores all data with a timestamp automatically and provides a set of standard queries for analyzing changes in the data. Its PromQL bears some resemblance to the emerging data format for queries, GraphQL. This makes it simple for developers to set up alerts that could be triggered by data anomalies.

Redis created a special module for ingesting the rapid data flows into the database. The indexing routines build a set of average statistics about the data's evolution. To save memory, it can also downsample or aggregate the elements.

Kdb+, a database that's the foundation of the Kx platform, maintains a connection with relational databases that makes it simpler to work with some of the relational schema that dominate some applications. The streaming analytics built by the database offer both traditional statistics and also some machine learning algorithms.

What's next?

Open source projects and startups have many of the same goals as other tech projects. They all want to find ways to handle bigger data streams with more complicated analytics that are run in more efficient silos -- bigger, faster, smarter, and cheaper.

Beyond that, groups are starting to think about the long-term custodial responsibilities that the endless streams might require. The Whisper open source database, for instance, is designed to gracefully turn high-resolution data that might be compiled from a rapid stream into a lower-resolution, historical summary that can be stored and studied more efficiently over time. The goal is to save space while still providing useful summaries. The database is, in essence, deliberately saving summaries and disposing of the information that was originally entrusted to it.

The companies are debating the language used by developers to write queries. QuestDB is revisiting and extending SQL by adding features for grouping and analyzing data by time. It believes that SQL is a language that will live on, in part because so many DBAs know it.

Other companies are building specialized languages that are closer to functional programming languages. For example, InfluxDB's Flux language encourages developers to compose their solutions out of multiple smaller, reusable functions.

The companies will also be pushing to extend the presentation layer. Many of the databases are already loosely coupled with graphical dashboards like Grafana. These connections will grow deeper, and in many cases the tools will effectively merge with the time-series database. Matomo, for instance, is presented as a product for tracking visitors to websites.

Is there anything a TSDB can't do?

In a sense, all databases are time-series databases because they maintain a log of the transactions that build up the table. The real question is which applications need to track how data changes over time. Many traditional databases were concerned only with the current state. They tracked, for instance, how many empty seats were left on the airplane. That's the most important detail for selling tickets.

But sometimes there are hidden opportunities in even these applications. For instance, tracking when the tickets are sold can help pricing strategies in the future because airlines can know whether demand is running ahead or behind historical norms. In this sense, even traditional applications that don't seem to need to track changes over time might be improved. The time-series databases might just be an opportunity.

This article is part of a series on enterprise database technology trends.