What Is A Data Lake? Definition, Benefits, Dangers, Role In A Data-Driven Company
The data lake is a newer concept for collecting, storing and processing data. In this article, we provide a definition of the data lake, discuss advantages and disadvantages, go into specific use cases and show possibilities of what the infrastructure or architecture of a data lake can look like.
Simply put: what is a data lake?
A data lake is defined as a combination of different technologies in order to jointly manage a range of data types. Put simply, and it is about avoiding the time-consuming preparation of classic database systems, on the one hand, and also simply storing unstructured and unprocessed data on the other.
The easiest way to explain the concept of a data lake is to say that it behaves like a hard drive on a computer. All kinds of data can be stored and managed on the hard drive.
Even data that was previously neither known nor processed can be stored there. So you can easily store images, videos but also unprocessed structured data such as CSV to use later.
Classic databases, on the other hand, would be structured data in this analogy, for example, an Excel file. The structure is predefined, and the content is recorded in a structured manner and analyses and operations can be easily carried out on the data, while further steps may be necessary for the data lake before processing is possible.
Interestingly, due to its wide range of technologies, a data lake can, in turn, contain entire databases and systems. So it would be quite natural for prepared Excel files to appear on our hard drive.
Data lakes have their origins as a concept in an article by Forbes, in which the CTO of the ETL tool Pentaho noted the term in order to contrast his view of data management with classic data marts.
Since then, the term has held up, but the original definition of simply dumping all data into a distributed file system like Hadoop has been refined several times.
To summarize it as a short definition: A data lake aims to capture structured and unstructured data equally in all processing stages.
Benefits of a data lake
It can hardly be denied that the idea of a data lake is understandable. But what are the concrete advantages of such an architecture, especially in comparison to older database systems such as a data warehouse?
ACQUISITION OF UNSTRUCTURED DATA
One of the basic ideas and still one of the advantages is the collection of unstructured data in the data lake. The basis for this is mainly big data-typical amounts and types of data, which are to be further analyzed using data science.
The fact that data such as images, videos, text and more is moving into the focus of companies can easily be explained by the explosion in data volumes on the one hand, and also clearly by the fact that more and more companies are doing data mining using artificial intelligence and are trying to add value from theirs Generate data.
RECORDING OF THE VARIOUS PROCESSING STAGES (“DISTILLATION”)
Another important aspect of the advantages of data lakes is the possibility of documenting various steps in data preparation and processing as separate data sets.
Whether raw data, the consolidation of different data sets, aggregation or really processed data either for dashboards or machine learning: all intermediate steps can be saved in the data lake and thus also logged.
This has several advantages. On the one hand, it allows you to quickly and easily understand how the various data sets were created in the end, and on the other hand, each intermediate step can be used by other users. If, for example, a unit combines product and transaction data, these can also be reused directly for other purposes.
FAST STORAGE, FAST ACCESS
Since data lakes generally follow the ELT model, i.e. do not require a data model when storing the data, it is much easier and faster to collect data. Month-long discussions about which data model would be appropriate dissolve quickly with the storage of raw data.
Conversely, the same applies to access. Once raw data is available in the data lake, it can also be delivered quickly and easily and does not have to be laboriously extracted from the source systems afterwards.
THE VALUE IS STILL UNDEFINED – SO THERE IS A HIGH CHANCE OF RECYCLING
Directly linked to the advantage of the ELT process, the value of the data is also significantly higher, as the granularity and information content is higher.
The reason is simple: If you prepare data, you can only reduce the granularity, and data can be left out at most. Thus, raw data are generally always of higher value than aggregated or processed data.
This advantage has always been one of the reasons for storing raw data in a data lake. “Who knows what else I may need this data for” – following this idea has only been possible since raw data has also been stored, and not only the transformed data records are transferred to the business warehouse.
RELIEF OF SOURCE SYSTEMS
Another main argument is the relief from the data source systems. By fully replicating the data in the data lake, the system is only loaded once during the extraction, not every time it is analyzed.
This is a fundamental advantage, especially with core systems: because overloading a core system usually has serious consequences.
From a technical point of view, data lakes are also better equipped for high data requests and can scale accordingly than conventional software tools that are not designed for a large number of raw data users.
ELIMINATION OF DATA SILOS
A final advantage that we would like to list here is the gradual elimination of data silos. By storing the raw data together in a central data lake, one bypasses the limits of the responsibility of source systems.
This allows democratization of the available data and creates a common denominator of what data is available and how it can be used to generate added value.
The dangers of a data lake
In addition to these numerous advantages, there are, of course, also dangers when using a data lake.
THE DATA LAKE BECOMES A DATA SWAMP
The usability of a data lake stands or falls with the corresponding data management, known as data governance. It can only be maintained and used if it is clearly documented which data is in the data lake.
If you do not follow this strict management, you will soon no longer have an overview of which data is used in which current state in which analyzes and channels. The ownership or, in the worst case, the content regarding GDPR compatibility can also be unclear.
If you do not practice proper data governance, the lake will degenerate into a swamp of a lot of data that no one can see through – metaphorically called a data swamp.
ANTI-DEMOCRATIZATION OF DATA ANALYTICS
Actually, by replacing data silos, the access to the data should be simplified, and thus the availability widened. The negative aspect, however, is that with classic business warehouses and the like, the path to self-service of the data is usually much shorter than with a data lake consisting of several highly specialized technologies.
Consequently, it can be argued that by not preparing the data in a data lake, the data is also more difficult to access, especially for non-specialists.
Since the path to becoming a “Citizen Data Scientist” is much further than that to become a “Citizen Data Analyst”, it must be carefully observed how the data lake content can also be made available so that the broadest possible mass can access it.
Another logical point according to the infrastructural structure of data lakes is the necessary management of the various technologies. Above all, the combination of different systems and tools in a framework that allows the technology to go hand-in-hand is a major challenge.
The goal must be that the range of tools used is as small as possible, while at the same time, all requirements are to be met. This requires highly specialized enterprise architects and data engineers who plan the data lake for the long term, establish standards and prevent it from slipping into a tool instead of data centring.
This point follows directly from the previous one. A data lake usually has several system components, which can also be of different complexity. For example, if a data lake has tools for capturing unstructured data and structured data, personnel must be available to manage both solutions.
In addition to these costs, there are the usual costs for licenses and hardware. And the architecture component – which systems interact with each other and how – and the governance component should not be underestimated as a cost factor.
The role of the data lake in a data-driven company
We hope you understood this simple explanation and definition of the idea, concept, benefits, and architecture of data lakes. The question that remains is: What role does a data lake play in a data-driven company?
Simply put – regardless of whether the infrastructure is called a data lake or a data hub, or a data platform – the consolidation and provision of data is absolutely central in a data-driven company.
If the goal is for all processes and operations to be supported by data, then the data must also be available, recorded and available. This only works if the storage components, as well as the data governance, are fully under control.
In summary, a data lake – or a modification of it – is one of the most important tools for building a data-driven company. You have to be able to access data not only in analytics and machine learning but in all processes.
Since such consolidation and centralization is a mammoth project, our recommendation is to set up a data lake early and iteratively to maintain and manage it instead of leaving everything operational out of sight through endless planning.