Data Lake vs Data Warehouse: What’s the Difference Between Them?
While we have already outlined the differences between the data lake concept and a much more widespread business data warehouse several times, here again, details on the differentiation between the two principles:
- Data types: Data lakes process all kinds of data types, whether structured or unstructured, whether images, sound or tables. In contrast, data warehouses are limited to structured data.
- Data pipelines: Data warehouses use the ETL principle (Extract-Transform-Load), which adjusts data from source systems to a given data model before it is fed into the warehouse. Data lakes, on the other hand, use the ELT (Extract-Load-Transform), i.e. load the data directly into the lake in its original raw data version.
- Information content: Directly following from the previous point, the information contained in data warehouses must shrink since the transformation is always accompanied by the loss of data. With the lake, on the other hand, all data and thus the entire information content is completely preserved. This is particularly relevant when using machine learning or deep learning by data scientists since as many attributes as possible should be used here.
- Self-service: The self-service of the data analysis is much higher in the DWH since the data is structured and ready for processing. In the data lake, on the other hand, the self-service of the raw data is, of course, higher, but each access generally requires more expertise.
- Flexibility: Regardless of whether it is data sources, attributes or applications – a data lake is designed to handle new data with great flexibility. A data warehouse, on the other hand, has to go through the various preparation stages before new data can be integrated.
- Maintenance requirement: While the initial maintenance requirement in the data warehouse has a deterrent effect, it allows hardly any effort to be made in the following steps with regard to documentation and data quality. The data lake, on the other hand, requires a great deal of data governance and data management in order not to degenerate into a data swamp.
Use cases as examples for the use of a data lake
Now we have shown in relatively detail what the concept of a data lake looks like and what advantages and disadvantages it brings with it.
The most important question, however, is: Why should an organization establish a data lake at all? Here we briefly present several use cases that contribute directly to the idea of the data lake.
ANALYTICS, DATA SCIENCE, MACHINE LEARNING
The origin and probably still the main area of application of a data lake is its use by means of data analysis, data science and, above all, machine learning.
The idea is that you have access to as much data as possible in the rawest possible form so that you can then use it in an optimally profitable manner.
The combination of structured and unstructured data in a machine learning model can serve as an example. In classic data warehouses, some of the data would not be available, so other systems would have to be used.
These systems would possibly run again in a completely different environment, require different accesses and require different staff for maintenance. Hence the clear advantage of a data lake is that it brings these aspects with it.
API MANAGEMENT AS THE BASIS FOR DATA AVAILABILITY
Companies are often so busy replicating data in a data lake that little thought is given about making it available again. Direct database access or system-dependent APIs are usually the solutions of choice.
If, on the other hand, you think a step further, it makes perfect sense to provide a comprehensive API for the stored data.
This use case of the data lake, therefore, sees data analysis as just one use case of the data in the data lake. Further use cases can be extended data pipelines to other systems, the extraction of data from a legacy system or the provision of data for channels (e.g. website).
If this can be handled by an API, systems and database systems can be exchanged quickly and easily without affecting other entities within the data pipeline.
INTERNET OF THINGS AND DATA STREAMING
Another example where data lakes shine is the Internet of Things (IoT). Data streaming must continuously provide the opportunity to save new data and, if necessary, to analyze it directly.
In a data warehouse based on batch processing, this is hardly possible because the data pipelines are not even considered for this. In the data lake, on the other hand, you “only” have to add a high-frequency database, and you can add stream processing capabilities to your lake – and you are ready for the Internet of Things.
UNSTRUCTURED DATA IS BECOMING MORE AND MORE IMPORTANT
Another classic use case as a direct characteristic of big data is unstructured data. We have arrived in an age in which it is no longer just about structured recorded information, but especially in the range of possible “sensors” for data acquisition, there is enormous potential.
This is why there are more and more movements to capture, store and make available unstructured data. Whether image or sound, text or video – there is enough unstructured data that a comprehensive study of a data lake is worthwhile for this purpose as well.
Examples of a data lake architecture
We now know what a data lake is; we have explained why a data lake makes sense – the how remains. Data Lake Infrastructure is a very broad, general field, as there is no single solution, but rather the architecture makes the idea submissive to the company.
Nevertheless, we would like to show examples of a data lake infrastructure or architecture and, in the case of AWS and Azure, also provide specific services.
TEMPLATE OF THE COMPONENTS OF A DATA LAKE
A data lake usually has several layers. Right at the front, i.e. at the origin, are the source systems. A source system can be an ERP, a CRM, a simple text file, an existing database or a stream.
The data ingestion process is used to integrate these source systems into the data lake. The data is transferred to the data lake using ELT, possibly also ETL. Simply put, data pipelines – scripts or tools – are used to transfer the data from the source to a database or file system.
This is already the next layer, the data storage. Obviously, the storage of data is the heart of the data lake and, therefore, also of the greatest variance.
Usually, a distinction is made between structured and unstructured data, at least in the storage layer, and questions about cloud or on-premise or streaming architectures are often also discussed.
Basically, this layer in data lakes always consists of raw data acquisition (“Raw Zone” or “Landing Zone”) and further distillation steps (for example, aggregations or a data warehouse).
Once the data has been recorded, it is time to process it. This layer is usually the processing layer, i.e. the application of transformations, consolidations or analyzes such as the training of machine learning models.
All of this happens on the data lake, but since the results are fed back directly into the lake, they can also be seen as part of it. Seamless integration is essential here.
Ultimately, it is important to deliver the results or simply the recorded data. This delivery layer (also: serve layer or deployment layer) makes data, machine learning models or similar other applications, systems or channels available in order to use the added value generated.
EXAMPLE OF A DATA LAKE ON AWS
The cloud components of Amazon Web Services (AWS) start with AWS Glue, Amazon’s ETL service. This allows you to cover the data ingestion process but also other aspects of the data governance environment.
AWS has a range of solutions for data storage. For example, AWS Redshift for SQL, DynamoDB for NoSQL or the all-rounder AWS S3 for unstructured data.
To process the data, Amazon uses its services AWS EC2 or AWS Lambda to follow the microservices idea and to carry out machine learning or similar using components such as Amazon SageMaker.
The deployment layer is delivered at Amazon via services such as Amazon Quicksight for visualization or AWS API Gateway for offering APIs. At the centre of containerization are Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Services (EKS).
EXAMPLE OF A DATA LAKE ON AZURE
On Microsoft Azure, the components with which you can put together a general data lake are relatively well integrated. In the beginning, there is usually the Azure Data Factory to handle the data ingest.
Data Factory can dock to many interfaces by means of connectors and thus continuously “feed-in” the data.
Azure has many options as storage components. Whether a classic Azure SQL database, Azure Blob Storage for unstructured data or Azure Cosmos for high-frequency data such as streaming.
In the processing area, Azure Analysis Service or Azure Machine Learning are available. But Azure Databricks, which handles both data engineering and modelling on Hadoop, is also available as a selection.
If you then talk about the deployment, you access, for example, Microsoft-influenced Power BI for visualization, Azure K8s or Azure Redshift Service for containerization and Azure API Management for the creation and management of APIs.
As you can see, Azure offers an extensive library of services to fully represent a data lake and to integrate it into other processes.
EXAMPLE OF AN ON-PREMISE DATA LAKE
While the components of cloud-based data lakes are relatively clear, there are, of course, no limits for on-premise implementations. As a simple example, an ETL service such as Talend or a streaming service such as Kafka can be used for data ingestion.
This is followed by a range of databases such as Oracle or Microsoft SQL DBs, MongoDB for NoSQL or, traditionally, the Hadoop cluster for unstructured data to cover the storage components.
The processing layer is almost infinitely scalable. Whether simple python scripts that are integrated using a virtual machine workbench, data mining tools such as KNIME or a Spark infrastructure for larger amounts of data – all just examples in a very large landscape.
The data and analyzes can then be made available, for example, using visualization software such as Tableau, Docker Containerization or – and this will be exciting – self-implemented REST APIs.
As you can see, in on-premise data lake architectures, due to the incredible freedom, some things have to be planned much more extensively than in the cloud environment, where there are almost always fixed components for the various layers.
This results, on the one hand, in a very large bandwidth and individual configuration of the in-house data lake, on the other hand, of course, in very high demands on technologies IT and DevOps.