What is a data lake? Definition, benefits, architecture and best practices

What is a data lake?

A data lake is defined as a centralized and scalable storage repository that holds large volumes of raw big data from multiple sources and systems in its native format.

To understand what a data lake is, consider a data lake as an actual lake, where the water is raw data that flows in from multiple sources of data capture and can then flow out to be used for a range of internal and customer-facing purposes. This is much broader than a data warehouse, which would be more like a household tank, one that stores cleaned water (structured data) but just for use of one particular house (function) and not anything else.

Data lakes can be executed using in-house-built tools or third-party vendor software and services. According to Markets and Markets, the global data lake software and services market is expected to grow from $7.9 billion in 2019 to $20.1 billion in 2024. A number of vendors are expected to drive this growth, including Databricks, AWS, Dremio, Qubole and MongoDB. Many organizations have even started providing the so-called lakehouse offering, combining the benefits of both data lakes and warehouses through a single product.

Data lakes work on the concept of load first and use later, which means the data stored in the repository doesn’t necessarily have to be used immediately for a specific purpose. It can be dumped as-is and used all together (or in parts) at a later stage as business needs arise. This flexibility, combined with the vast variety and amount of data stored, makes data lakes ideal for data experimentation as well as machine learning and advanced analytics applications within an enterprise.

Key benefits of having a data lake

Unlike data warehouses, which only store processed structured data (organized in rows and columns) for some predefined business intelligence/reporting applications, data lakes bring the potential to store everything with no limits. This could be structured data, semi-structured data, or even unstructured data such as images (.jpg) and videos (.mp4).

The benefits of a data lake for enterprises include the following:

Learn more: What is data orchestration?

Architecture of a data lake: Storage and analysis process

Data lakes use a flat architecture, and can have many layers depending on technical and business requirements. No two data lakes are built exactly alike. However, there are some key zones through which the general data flows: the ingestion zone, landing zone, processing zone, refined data zone and consumption zone.

1. Data ingestion

This component, as the name suggests, connects a data lake to external relational and nonrelational sources — such as social media platforms and wearable devices — and loads raw structured, semi-structured and unstructured data into the platform. Ingestion is performed in batches or in real time, but it must be noted that a user may need different technologies to ingest different types of data.

Currently, all major cloud storage providers offer solutions for low-latency data ingestion. They include Amazon S3, Amazon Glue, Amazon Kinesis, Amazon Athena, Google Dataflow, Google BigQuery, Azure Data Factory, Azure Databricks and Azure Functions.

2. Data landing

Once the ingestion completes, all the data is stored as-is with metadata tags and unique identifiers in the landing zone. As per Gartner, this is usually the largest zone in a data lake today (in terms of volume) and serves as an always-available repository of detailed source data, which can be used/reused for analytic and operational use cases as and when the need arises. The presence of raw source data also makes this zone an initial playground for data scientists and analysts, who experiment to define the purpose of the data.

3. Data processing

When the purpose(s) of the data is known, its copies move from landing to the processing stage, where the refinement, optimization, aggregation and quality standardization takes place by imposing some schemas. This zone makes the data analysis-worthy for various business use cases and reporting needs.

Notably, data copies are moved into this stage to ensure that the original arrival state of the data is preserved in the landing zone for future use. For instance, if new business questions or use cases arise, the source data could be explored and repurposed in different ways, without the bias of previous optimizations.

4. Refined data zone

When the data is processed, it moves into the refined data zone, where data scientists and analysts set up their own data science and staging zones to serve as sandboxes for specific analytic projects. Here, they control the processing of the data to repurpose raw data into structures and quality states that could enable analysis or feature engineering.

5. Consumption zone

The consumption zone is the last stage of general data flow within a data lake architecture. In this layer, the results and business insights from analytic projects are made available to the targeted users, be it a technical decision-maker or a business analyst, through the analytic consumption tools and SQL and non-SQL query capabilities.

Data lake challenges

Over the years, cloud data lake and warehousing architectures have helped enterprises scale their data management efforts while lowering costs. However, the current set-up has some challenges, such as:

Data lake security: 6 best practices for enterprises in 2022

1. Identify data goals

In order to prevent your data lake from becoming a data swamp, it is recommended to identify your organization’s data goals — the business outcomes — and appoint an internal or external data curator who could assess new sources/datasets and govern what goes into the data lake based on those goals. Clarity on what type of data has to be collected can help an organization dodge the problem of data redundancy, which often skews analytics.

2. Document incoming data

All incoming data should be documented as it is ingested into the lake. The documentation usually takes the forms of technical metadata and business metadata, although new forms of documentation are also emerging. Without proper documentation, a data lake deteriorates into a data swamp that is difficult to use, govern, optimize and trust. Users fail to discover the required data.

3. Maintain quick ingestion time

The ingestion process should run as quickly as possible. Eliminating prior data improvements and transformations increases ingestion speed, as does adopting new data integration methods for pipelining and orchestration. This helps make the data available as soon as possible after data is created or updated, so that some forms of reporting and analytics can operate on it.

4. Process data in moderation

The main goal of a data lake is to provide detailed source data for data exploration, discovery and analytics. If an enterprise processes the ingested data with heavy aggregation, standardization and transformation, then many of the details captured with the original data will get lost, defeating the whole purpose of the data lake. So, an enterprise should make sure to apply data quality remediations in moderation while processing.

5. Focus on subzones

Individual data zones in the lake can be organized by creating internal subzones. For instance, a landing zone can have two or more subzones depending on the data source (batch/streaming). Similarly, the data science zone under refined datasets layer can include subzones for analytics sandboxes, data laboratories, test datasets, learning data and training, while the staging zone for data warehousing may have subzones that map to data structures or subject areas in the target data warehouse (e.g. dimensions, metrics and rows for reporting tables and so on).

6. Prioritize data security

Security has to be maintained across all zones of the data lake, starting from landing to consumption. To ensure this, connect with your vendors and see what they are doing in these four areas: user authentication, user authorization, data-in-motion encryption and data-at-rest encryption. With these elements, an enterprise can keep its data lake actively and securely managed, without the risk of external or internal leaks (due to misconfigured permissions and other factors).