We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Register today!
When Socrates reportedly said the “unexamined life is not worth living,” the Greek philosopher didn’t imagine the modern internet with its seemingly unlimited ability to absorb data. Every mouse click, page view, and event seems destined to end up in a log file somewhere. The sheer volume makes juggling all of this information a challenge, which is where a log management database really shines.
Collecting information is one thing; analyzing it is much harder. But many business models depend on finding patterns and making sense of the clickstream to gain an edge and justify their margins. The log database must gather the data and compute important statistics. Modern systems are usually tightly coupled with presentation software that distills the data into a visual infographic.
What is a log management database?
Log management databases are special cases of time-series databases. The information arrives in a steady stream of ordered events, and the log files record them. While many web applications are generally focused on web events, like page views or mouse clicks, there’s no reason the databases need to be limited to just this domain. Any sequence of events can be analyzed, such as events from assembly lines, industrial plants, and manufacturing.
For instance, a set of log files may track an assembly line, tracking an item as it reaches various stages in the pipeline. The result may be as simple as noting when a stage finished, or it could include extra data about the customization that happened at that stage, like the paint color or the size. If the line is running smoothly, many of the events will be routine and forgettable. But if something goes wrong, the logs can help diagnose which stage was failing. If products need to be thrown away or examined for fault, the logs can narrow that work.
Specialized log processing tools began appearing decades ago, and many were focused on simply creating reports that aggregate data to offer a statistical overview. They counted events per day, week, or month and then generated statistics about averages, maxima, and minima. The newer tools offer the ability to quickly search and report on individual fields, like the IP address or account name. They can pinpoint particular words or phrases in fields and search for numerical values.
What are the challenges of building a log database?
Log data is often said to be “high cardinality,” which means the fields can hold many different values. Indeed, the value in any timestamp is constantly changing. Log databases use algorithms to build indices for locating particular values and optimize these indices for a wide variety of values.
Good log databases can manage archives to keep some data while eliminating other data. They can also enforce a retention policy designed by the compliance offices to answer all legal questions and then destroy data to save money when it’s no longer needed. Some log analysis systems may retain statistical summaries or aggregated metrics for older data.
How are legacy databases approaching the market?
The traditional database companies have generally not been focused on delivering a tool for log storage because traditional relational databases have not been a good match for the kind of high cardinality data that’s written much more often than it’s searched. The cost of creating the index that’s the core offering of a relational database is often not worth it for large collections of logs, as there just are’t enough JOINs in the future. Time-series and log databases tend to avoid using regular relational databases to store raw information, but they can store some of the statistical summaries generated along the way.
IBM’s QRadar, for instance, is a product designed to help identify suspicious behavior in the log files. The database inside is focused on searching for statistical anomalies. The User Behavior Analytics (UBA) creates behavior models and watches for departures.
Oracle is offering a service called Oracle Cloud Infrastructure Logging Analytics that can absorb log files from multiple cloud sources, index them, and apply some machine learning algorithms. It will find issues ranging from poor performance to security breaches. When the log files are analyzed, the data can also be classified according to compliance rules and stored for the future if necessary.
Microsoft’s Monitor will also collect log files and telemetry from throughout the Azure cloud, and the company offers a wide range of analytics. An SQL API is one example of a service tuned to the needs of database administrators watching log files of Microsoft’s SQL Server.
Who are the upstart companies?
Several log databases are built upon Lucene, a popular open source project for building full-text search engines. While it was originally built to search for particular words or phrases in large blocks of text, it can also break up values into different fields, allowing it to work much like a database.
Elastic is one company offering a tool that starts multiple versions of Lucene on different engines so it will scale automatically as the load increases. The company bundles it together with two other open source projects, LogStash and Kibana, to create what it calls the “ELK stack.” LogStash ingests the data from raw log files into the Elastic database, while Kibana analyzes the results.
Amazon’s log analytics feature is also built upon the open source Elasticsearch, Kibana, and LogStash tools and specializes in deploying and supporting the tools on AWS cloud machines. AWS and Elastic recently parted ways, so differences may appear in future versions.
Loggly and LogDNA are two other tools built on top of Lucene. They integrate with most log file formats and track usage over time to identify performance issues and potential security flaws.
Not all companies rely on Lucene, in part because the tool includes many features for full-text searching, which is not as important for log processing, and these features add overhead. Sumo Logic, another performance tracking company, ingests logs with its own version of SQL for querying the database.
Splunk built its own database to store log information. Customers who work directly with the applications designed to automate monitoring tasks — like looking for overburdened servers or unusual access patterns that might indicate a breach — generally don’t use the database. Splunk’s database is designed to curate the indexes and slowly archive them as time passes.
EraDB offers another database with a different core but the same API as Elastic. It promises faster ingestion and analysis because its engine was purpose-built for high cardinality log files without any of the overhead that might be useful for text searching.
Is there anything a log database can’t do?
Log databases are ideal for endless streams of events filled with different values. But not all data sources are filled with high cardinality fields. Those with frequently repeating values may find some reduction in storage by a more traditional tabular structure that can save space.
The log systems built upon text search engines like Lucene may also offer extra features that are not necessary for many applications. In a hypothetical assembly line, for instance, there’s little need to search for arbitrary strings or words. Supporting the ability for arbitrary text search requires more elaborate indexes that take time to compute and disk space to store.
This article is part of a series on enterprise database technology trends.
VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn more about membership.