Understanding when to use a database, data lake, or data warehouse

The "data" part of the terms "data lake," "data warehouse," and "database" is easy enough to understand. Data are everywhere, and the bits need to be kept somewhere. But should they be stored in a data warehouse, a data lake, or an old-fashioned database? It all depends on how that data is going to be used.

It's difficult to define the names precisely because they are tossed around colloquially by developers as they figure out the best way to store the data and answer questions about it. All three forms share the goal of being able to squirrel away bits so that the right questions are answered quickly.

Still, the terms have evolved and taken on relatively standard meanings.

What is a database?

The database now means both the software that stores and manages the information as well as the information stored within the database. Developers use the word database with some precision to mean a collection of data, because the software needs to know that orders are kept on one machine and the addresses on another.

Users rarely know where the values are kept and may just call the entire system the database. And that's fine -- most software development is about hiding that level of detail. Among databases, the relational database has become a workhorse for much corporate computing. The classic format arranges the data in columns and rows that form tables, and the tables are simplified by splitting the data into as many tables and sub-tables as needed. Good relational databases add indexes to make searching the tables faster. They can employ SQL and use sophisticated planning to simplify repeated elements and produce concise reports as quickly as possible.

Lately, non-relational types of databases have gained traction. These so-called NoSQL databases don't store the data in relational tables. They are often chosen when developers want the flexibility to add new fields or elements for some entries but not others.

But there are use cases where the database is not enough.

What is a data warehouse?

The data warehouse is a collection of databases, although some may use less structured formats for raw log files. The idea of a data warehouse evolved as a consequence of businesses establishing long-term storage of the information that accumulates each day, and to meet the need to report on and analyze that data.

Building a data warehouse is more than just choosing a database and a structure for the tables, as it requires creating retention policies. Data warehouses often include sophisticated analytics to generate statistics to study changes over time. Data warehouses are often tightly integrated with graphics routines that produce dashboards and infographics to quickly show changes in the data.

Generally, the term data warehouse has come to describe a relatively sophisticated and unified system that often imposes some order upon the information before storing it.

What is a data lake?

A data lake takes a different approach to building out long-term storage from a data warehouse. In modern data processing, a data lake stores more raw data for future modeling and analysis, while a data warehouse typically applies a relational schema to the information before it's stored. The data lake may not even use databases to store the information because the extra processing required isn't worth it. The data is stored in flat files or logs.

Lakes are better choices for storing large amounts of records in case someone wants access to a few or many of them in the future. Regulatory compliance is a common use case.

Some use both metaphors for the same system. The incoming raw data is stored in the data lake and, after some analysis and aggregation, the information often finds a home in the data warehouse.

What are some examples?

Databases, warehouses and lakes take many forms because businesses have many different needs for historical record keeping. The choices a business makes for keeping these records affects the architecture and structure. Here are several hypothetical examples:

A Drop-Shipping Company. They sell gadgets online and outsource fulfillment to others. They use a basic database to track orders and often discard records not long after the orders have been delivered. Their products change frequently and so they feel they have no need for historical data.
A Doctor's Office. The medical industry has elaborate regulations to protect patient privacy. They use a special service to store patient records that can offer long-term retrieval for queries that may come years later. The service acts like a lake because the doctor and the patients are not involved in any research that might involve comparing and contrasting outcomes from treatment. The service can just store and retrieve, not analyze.
A Manufacturing Company. The company has a dominant position in a stable industry that requires them to make smart decisions about long-term trends in sales and pricing. They need to compare sales by region over time to make commitments for opening and refurbishing plants and physical warehouses. Managing this supply chain is much easier with a sophisticated data warehouse able to run complex queries.
A Network Security Group. The routers and switches collect plenty of raw data about the packets traveling across the network in case someone wants to analyze any anomalies. These raw values are stored in a big data lake for several weeks until they're no longer needed. If no unusual events occur, the data is disposed of without being analyzed.
A Drug Research Company. The company gathers raw data about drug trials and also compiles aggregated reports for regulation. The company wants to retain the data, perhaps indefinitely, to aid future researchers and satisfy any questions from regulators. It uses a data lake to collect the initial raw information and a warehouse to store aggregated reports.

What the legacy companies are doing in this space

There are two major themes. Some of the companies that make traditional databases are adding features to support analysis and turning the completed product into a data warehouse. At the same time, they're building out extensive cloud storage with similar features to support companies that want to outsource their long-term storage to a cloud.

Microsoft's Azure has been migrating data warehouse work to live under the umbrella called "Synapse Analytics." It integrates Microsoft's cloud storage with the different routines that can include some of the artificial intelligence. The tool is designed to scale to handle petabytes of data using technologies like Apache Spark developed to transform, analyze, and query big data sets. Microsoft also highlights the fact that billing is separate for the storage and computation so users can save money when they can turn off the instances devoted to analytics.

Microsoft also bundles some of the same storage and analytical options under the heading of a data lake. It includes both SQL-based options and the more general object storage, and its marketing materials said it's intended for "data of any size, shape, and speed."

Oracle also offers an Autonomous Data Warehouse for cloud and on-premises that integrates its Autonomous Database with a number of tools with enhanced analytical routines. The service hides all of the work for patching, scaling, and securing the data. It also offers some of the functionality of a data lake, including the classic Big Data tools like Apache Spark, under the "Big Data" product name.

Users of IBM's Db2 can also choose IBM's cloud services to build a data warehouse. Its tool, which is also available as a Docker container for on-premises hosting, bundles together machine learning, statistical, and parallel processing analytical routines with some migration tools for integrating data sources.

What the upstarts are doing in this space

Many of the data warehouses and data lake are built on premises by in-house development teams that use a company's existing databases to create custom infrastructure for answering bigger and more complex queries. They stitch together data sources and add applications that will answer the most important questions. In general, the warehouse or lake is designed to build a strong historical record for long-term analysis.

The cloud companies are offering two different solutions. First, they want to help store the data. Amazon, for instance, offers a wide range of storage solutions at different prices where speed can be traded for savings. Some of the tiers are priced below $1 per terabyte per month for storage alone, but there can be additional charges for retrieval. Some of the slower tiers, called Glacier, can also use a basic subset of SQL to find certain data elements, a useful feature that turns the long-term storage into a database of sorts. Amazon also offers a wide range of analytical tools like the RedShift cloud data warehouse, which works with all of its storage options.

Second, the cloud companies are also integrating their analytics tools with the storage to turn their racks into data warehouses or data lakes. Google's BigQuery database, for instance, is also integrated with some of Google's machine learning tools to make it possible to explore the use of AI with the data that's already stored on its disks.

Some of the upstarts are offering some services and not others. Backblaze, for instances, will store data at rock-bottom prices that may be 60%, 70% or lower than the major clouds. Its API is designed to work just like Amazon’s S3 to make switching easier.

Others are designed to work with any data source. Teradata and Snowflake, for instance, are two companies offering sophisticated tools for adding analysis to data. They emphasize a multi-cloud strategy so users can build their warehouse out of many storage options.

What a database can't do

Is there anything a database can't do that a data warehouse or data lake can?

The terms are not crisp and consistent, but generally databases are more limited in size. Data warehouses and data lakes refer to collections of databases that might be in one, unified product, but often can be a collection built from different merchants. The metaphors are flexible enough to support many different approaches.

This article is part of a series on enterprise database technology trends.