Data Lakes evolved as a means to handle and store this ever-growing deluge of data. In enterprises and large organizations, data can easily become fragmented between various departments and teams.
The amount of data being gathered and generated by enterprises and organizations is constantly increasing. Data Lakes evolved as a means to handle and store this ever-growing deluge of data. In enterprises and large organizations, data can easily become fragmented between various departments and teams.
A Data Lake is a storage repository that can centralize and store vast amounts of raw data in its native format. The data can be structured, semi-structured, or unstructured. The data structure and requirements are not defined until the data is needed, at read-time.
This essentially means that Data Lakes create a future-proof environment for raw data, unconstrained and unfiltered by traditional, strict database rules and relations at write-time. The ingested raw data is always there, and can be re-interpreted and analyzed as needed.
Some experts think of Data Lakes as a replacement for Data Warehouses, while others see them as a staging area for filtering and feeding data into existing Data Warehouse solutions; or as a place to store data backups from Data Warehouses and databases.
It’s important to note that Data Lake architecture varies widely from application to application and architectural considerations are always subject to technical and business requirements. The Data Lake Architecture presented in this article is meant to demonstrate a common-case prototype but is far from comprehensive enough to cover the multitude of applications of modern Data Lakes.
Data processing in Data Lakes can be loosely organized in the following conceptual model:
The Ingestion Layer is tasked with ingesting raw data into the Data Lake. Modification of raw data is prohibited. Raw data can be ingested in batches or in real-time, and is organized in a logical folder structure. The Ingestion layer can accommodate data from different external sources, such as:
One of the advantages is that it can quickly ingest almost any type of data covering any system, including (but not limited to):
The Distillation Layer converts the data stored by the Ingestion Layer to structured data for further analysis. In this layer, raw data is interpreted and transformed into structured data sets and subsequently stored as files or tables. The data is cleansed, denormalized, and derived at this stage, and then becomes uniform in terms of encoding, format, and data type.
The Processing Layer runs user queries and advanced analytical tools on structured data. Processes can be run in real-time, as a batch, or interactively. Business logic is applied in this layer and data is consumed by analytical applications. This layer is also known as trusted, gold, or production-ready.
The Insights Layer is the output interface, or the query interface, of the Data lake. It uses SQL or non-SQL queries to request and output data in reports or dashboards.
The Unified Operations Layer performs system monitoring and manages the system using workflow management, auditing, and proficiency management.
In some Data Lake implementations, a Sandbox Layer is included as well. As the name suggests, this layer is a place for data exploration by data scientists and advanced analysts. The sandbox layer is also referred to as the Exploration Layer or Data Science Layer.
Data Lakes rely on big data storage and take advantage of its high reliability, scalability, and uptime. The main requirement for Data Lake storage is the ability to store vast amounts of data at a low cost.
Using cloud storage has the advantage of scalability while being comparatively lower in cost. On-premise Data Lake implementations can also be used, especially if the required big data hardware infrastructure is already in place.
Modern Data Lake architecture separates the physical storage layer from the computing layer, making them independently scalable to meet individual needs. Data Lakes traditionally relied on the Hadoop Distributed File System (HDFS) with Apache ORC or Parquet columnar file formats. Generally, we are seeing a migration towards cloud-native storage like the Amazon S3 and Azure Data Lake Storage.
IBM offers the helpful “5 V’s of Big Data” to demonstrate the most important dimensions of stored data:
Security should be implemented in all layers of the Data Lake, with the traditional intent of restricting access to data. Only authorized users and services are permitted. Data Lake security is accomplished by employing the following methods:
One of the challenges in Data Lake security is handling sensitive or confidential personal data and adhering to legal requirements regarding the way this data can be collected, stored, and used. In global enterprises, this is even more challenging due to the necessity to comply with regulatory frameworks in different countries like HIPAA in the US, GDPR in the EU, or the PCI global security standard.
The data analysis paradigm in Data Lakes is described as a top-down approach in comparison to traditional database systems:
This approach to data analysis in Data Lakes saves a lot of upfront work that usually goes into creating the data structure, thus allowing fast ingestion and storage of data. Moving structuring data to the last step is helpful in situations when the structure itself is hard to define and subject to changes or different interpretations.
Data Lake management deals with the challenges of monitoring and logging the transformations of data as it moves through different layers of the Data Lake. All actions performed on the data are logged, as well as all user actions that led up to them.
Metadata is data describing data. Ingestion of raw data without applying detailed metadata should not be allowed. A Data Lake can quickly turn into a Data Swamp when you are unable to locate data. On the other hand, being too strict with metadata can result in no ingested data at all, so you end up with a data desert.
A Data Lake team is essentially a Data Science (DS) team. Depending on the size of the company and the volume of big data, DS teams are custom-built for specific business tasks. In general, DS team roles and responsibilities in a Data Lake architecture would be similar to the following:
Bear in mind that many of these required skills intersect, so an individual could combine multiple roles in a functional team.
Most businesses go through the following stages of development when building and integrating Data Lakes within their existing business architecture:
Stages of data lake implementation
For the sake of brevity, this list is limited to the bare essentials:
Software and cloud vendors have developed several software stacks for Data Lake implementation. We will list a few of the more popular ones for reference.
A Data Lake is a secure, robust, and centralized storage platform that lets you ingest, store, and process structured and unstructured data. Raw data assets are kept intact, while data exploration, analytics, machine learning, reporting, and visualization are performed on the data and tweaked as needed. This means raw data can be reused and repurposed at a later date, without much hassle.
Although many proponents or vendors may make bold promises, Data Lake architecture will never remove the need for traditional databases, nor replace them. It is simply not envisioned or designed to do that. Most daily business operations will continue to rely on traditional database systems. Repetitive and strictly defined tasks—such as sales, invoicing, inventory, banking transactions—are perfectly implemented in traditional databases. Data Lakes work in conjunction with traditional databases to generate more value from data already available to an organization, gaining new insights and discovering new information from existing data.
Early implementations of Data Lakes were plagued by the fact that the architecture is designed by data scientists for data scientists. Setting up all different components and tools required highly qualified data engineers. Mining and analyzing data from Data Lakes also faced the same challenge, as it was mostly code-based and required specialized talent. Of course, this was not an issue for huge tech companies that dominate the big data space, thanks to their large pool of skilled software engineers and data scientists. However, new solutions like integrated, turnkey Data Lake platforms and GUI-based user interfaces instead of code-based control could make it much easier for companies to implement and use Data Lakes.
In the future, Data Lake architecture and logic could be used and integrated with large document management systems, various digital archives, public records, health care records, scientific research datasets, and so on.