To further strengthen our commitment to providing industry-leading coverage of data technology, VentureBeat is excited to welcome Andrew Brust and Tony Baer as regular contributors. Watch for their articles in the Data Pipeline.
Table of contents
What is a data lake?
A data lake is defined as a centralized and scalable storage repository that holds large volumes of raw big data from multiple sources and systems in its native format.
To understand what a data lake is, consider a data lake as an actual lake, where the water is raw data that flows in from multiple sources of data capture and can then flow out to be used for a range of internal and customer-facing purposes. This is much broader than a data warehouse, which would be more like a household tank, one that stores cleaned water (structured data) but just for use of one particular house (function) and not anything else.
Data lakes can be executed using in-house-built tools or third-party vendor software and services. According to Markets and Markets, the global data lake software and services market is expected to grow from $7.9 billion in 2019 to $20.1 billion in 2024. A number of vendors are expected to drive this growth, including Databricks, AWS, Dremio, Qubole and MongoDB. Many organizations have even started providing the so-called lakehouse offering, combining the benefits of both data lakes and warehouses through a single product.
Data lakes work on the concept of load first and use later, which means the data stored in the repository doesn’t necessarily have to be used immediately for a specific purpose. It can be dumped as-is and used all together (or in parts) at a later stage as business needs arise. This flexibility, combined with the vast variety and amount of data stored, makes data lakes ideal for data experimentation as well as machine learning and advanced analytics applications within an enterprise.
Intelligent Security Summit
Learn the critical role of AI & ML in cybersecurity and industry specific case studies on December 8. Register for your free pass today.
Key benefits of having a data lake
Unlike data warehouses, which only store processed structured data (organized in rows and columns) for some predefined business intelligence/reporting applications, data lakes bring the potential to store everything with no limits. This could be structured data, semi-structured data, or even unstructured data such as images (.jpg) and videos (.mp4).
The benefits of a data lake for enterprises include the following:
- Expanded data types for storage: Since data lakes bring the capability to store all data types, including those critical to the performance of advanced forms of analytics, organizations can leverage them to identify opportunities from insights that could help with improving operational efficiency, increasing revenue, cost efficiency, risk management etc.
- Revenue growth from expanded data analytics: According to an Aberdeen survey, organizations that implemented a data lake outperformed competitors by 9% in organic revenue growth. These companies were able to perform new types of analytics on previously unusable and siloed data — log files, data from click-streams, social media and IoT devices — now centrally stored in the data lake.
- Unified data from silos: Data lakes can centralize information from disparate departmental silos, mainframes, and legacy systems, thereby offloading their individual capacity and preventing data duplication while increasing data usability when connected with the larger data structure. This helps formulate a 360-degree customer view for enterprises, which in turn helps improve customer targeting and marketing campaign orchestration. Unified data is also less expensive to store than siloed data.
- Omnichannel data orchestration: An organization can implement a data lake to ingest data from across multiple sources, including IoT equipment sensors in factories and warehouses. These sources can be internal and/or customer-facing for a data lake of unified data. Customer-facing data enables marketing, sales and account management teams to orchestrate omni-channel campaigns using the most updated and unified information available for each customer, whereas internal data is used for holistic employee and finance management strategies.
Learn more: What is data orchestration?
Architecture of a data lake: Storage and analysis process
Data lakes use a flat architecture, and can have many layers depending on technical and business requirements. No two data lakes are built exactly alike. However, there are some key zones through which the general data flows: the ingestion zone, landing zone, processing zone, refined data zone and consumption zone.
1. Data ingestion
This component, as the name suggests, connects a data lake to external relational and nonrelational sources — such as social media platforms and wearable devices — and loads raw structured, semi-structured and unstructured data into the platform. Ingestion is performed in batches or in real time, but it must be noted that a user may need different technologies to ingest different types of data.
Currently, all major cloud storage providers offer solutions for low-latency data ingestion. They include Amazon S3, Amazon Glue, Amazon Kinesis, Amazon Athena, Google Dataflow, Google BigQuery, Azure Data Factory, Azure Databricks and Azure Functions.
2. Data landing
Once the ingestion completes, all the data is stored as-is with metadata tags and unique identifiers in the landing zone. As per Gartner, this is usually the largest zone in a data lake today (in terms of volume) and serves as an always-available repository of detailed source data, which can be used/reused for analytic and operational use cases as and when the need arises. The presence of raw source data also makes this zone an initial playground for data scientists and analysts, who experiment to define the purpose of the data.
3. Data processing
When the purpose(s) of the data is known, its copies move from landing to the processing stage, where the refinement, optimization, aggregation and quality standardization takes place by imposing some schemas. This zone makes the data analysis-worthy for various business use cases and reporting needs.
Notably, data copies are moved into this stage to ensure that the original arrival state of the data is preserved in the landing zone for future use. For instance, if new business questions or use cases arise, the source data could be explored and repurposed in different ways, without the bias of previous optimizations.
4. Refined data zone
When the data is processed, it moves into the refined data zone, where data scientists and analysts set up their own data science and staging zones to serve as sandboxes for specific analytic projects. Here, they control the processing of the data to repurpose raw data into structures and quality states that could enable analysis or feature engineering.
5. Consumption zone
The consumption zone is the last stage of general data flow within a data lake architecture. In this layer, the results and business insights from analytic projects are made available to the targeted users, be it a technical decision-maker or a business analyst, through the analytic consumption tools and SQL and non-SQL query capabilities.
Data lake challenges
Over the years, cloud data lake and warehousing architectures have helped enterprises scale their data management efforts while lowering costs. However, the current set-up has some challenges, such as:
- Lack of consistency with warehouses: Companies may often find it difficult to keep their data lake and data warehouse architecture consistent. It is not just a costly affair; teams also need to employ continuous data engineering tactics to ETL/ELT data between the two systems. Each step can introduce failures and unwanted bugs, affecting the overall data quality.
- Vendor lock-in: Shifting large volumes of data into a centralized EDW becomes quite challenging for companies, not only because of the time and resource required to execute such a task, but also because this architecture creates a closed loop, causing vendor lock-in.
- Data governance: While the data in the data lake tend to be mostly in different file-based formats, a data warehouse is mostly in database format, and it adds to the complexity in terms of data governance and lineage management between the two storage types.
- Data copies and associated costs: Data available in data lakes and data warehouses leads to an extent of data copies and has associated costs. Moreover, commercial warehouse data in proprietary formats increases the cost of migrating data. A data lakehouse addresses these typical limitations of a data lake, as well as data warehouse architecture, by combining the best elements of data warehouses and data lakes to deliver significant value for organizations.
Data lake security: 6 best practices for enterprises in 2022
1. Identify data goals
In order to prevent your data lake from becoming a data swamp, it is recommended to identify your organization’s data goals — the business outcomes — and appoint an internal or external data curator who could assess new sources/datasets and govern what goes into the data lake based on those goals. Clarity on what type of data has to be collected can help an organization dodge the problem of data redundancy, which often skews analytics.
2. Document incoming data
All incoming data should be documented as it is ingested into the lake. The documentation usually takes the forms of technical metadata and business metadata, although new forms of documentation are also emerging. Without proper documentation, a data lake deteriorates into a data swamp that is difficult to use, govern, optimize and trust. Users fail to discover the required data.
3. Maintain quick ingestion time
The ingestion process should run as quickly as possible. Eliminating prior data improvements and transformations increases ingestion speed, as does adopting new data integration methods for pipelining and orchestration. This helps make the data available as soon as possible after data is created or updated, so that some forms of reporting and analytics can operate on it.
4. Process data in moderation
The main goal of a data lake is to provide detailed source data for data exploration, discovery and analytics. If an enterprise processes the ingested data with heavy aggregation, standardization and transformation, then many of the details captured with the original data will get lost, defeating the whole purpose of the data lake. So, an enterprise should make sure to apply data quality remediations in moderation while processing.
5. Focus on subzones
Individual data zones in the lake can be organized by creating internal subzones. For instance, a landing zone can have two or more subzones depending on the data source (batch/streaming). Similarly, the data science zone under refined datasets layer can include subzones for analytics sandboxes, data laboratories, test datasets, learning data and training, while the staging zone for data warehousing may have subzones that map to data structures or subject areas in the target data warehouse (e.g. dimensions, metrics and rows for reporting tables and so on).
6. Prioritize data security
Security has to be maintained across all zones of the data lake, starting from landing to consumption. To ensure this, connect with your vendors and see what they are doing in these four areas: user authentication, user authorization, data-in-motion encryption and data-at-rest encryption. With these elements, an enterprise can keep its data lake actively and securely managed, without the risk of external or internal leaks (due to misconfigured permissions and other factors).
VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.