Data lake is a newer IT term created for a new category of data store. But just what is a data lake?
According to IBM, “a data lake is a storage repository that holds an enormous amount of raw or refined data in native format until it is accessed.”
That makes sense. I think the most important aspect of this definition is that data is stored in its “native format.” The data is not manipulated or transformed in any meaningful way; it is simply stored and cataloged for future use.
Any type of data can be stored in a data lake: structured, semi-structured, and unstructured. For example, organizations can use a data lake for customer information captured from multiple sources for future analysis and aggregation. This can consist of typical structured data (numbers, characters, dates, and times), as well as complex documents, text, multimedia, and more. In general, the data is ingested without transformation and data scientists can run analytical models against the data; business analysts can augment business intelligence activities with the data; and it can even be used as a long-term data archive.
Organizations are under intense pressure these days to capture any data that could be relevant to their business. And the number of sources and amount of data continues to steadily rise. So, the desire to grab the data when it is available is high, but the time to organize and understand that data fully at the time of capture is not usually available.
But a data lake should not be treated as a dumping ground for data. It is important to have a means of understanding and managing the data that is stored in the data lake. Without a mechanism for defining, populating, accessing, and managing the data in your data lakes, you will find them to be less than useful.
Populating a data lake requires knowledge of and proper tools for data integration. Because the data lake contains multiple types of data from multiple sources, it must include support for a wide array of different platforms, data types and structures, interfaces, and processing capabilities.
You will also need some form of metadata management for a data lake environment to remain useful and healthy. Minimally, a data lake requires information about each type of data stored there, but also some guidance on where the data originated (that is, its provenance), the data elements it contains, the meaning of each, and how to read them. Of course, the metadata can be minimal to begin with and then fleshed out as your data scientists and analytics teams explore the data.
Some pundits have surmised that data lakes will summon the death of data marts and data warehouses. But if you think about it, this cannot be the case. A data warehouse, as defined by Bill Inmon, the father of the data warehouse, is “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process.”
In contrast with a data lake, where data is captured and stored with no transformation or aggregation, a data warehouse contains data transformed from multiple sources and is designed for business users. A data lake cannot serve the same purpose unless the data is modified from its “native format” … and then it stops being a data lake by definition.
There are, certainly, many other differences. A data warehouse contains structured data whereas a data lake can contain structured, unstructured, and semi-structured data. Data in the data lake comes from multiple sources and will have varying schemata. As such, the data lake requires schema-on-read capability—and a platform, such as Hadoop, that supports such a requirement. With data from multiple, disparate sources all being stored in its native format, data lakes cannot support schema-on-write like data warehouses do.
Of course, Hadoop is not the only technology that can be used for data lakes. Some organizations with a more cloud-focused mentality are using solutions from cloud providers like Amazon Web Services (AWS) and others.
The type of storage that can be used also separates data warehouses from data lakes. With a data warehouse, performance is important, and you do not want to store data that will be queried by business professionals on slower, less-costly storage devices. Conversely, storing a data lake on such devices makes a lot of sense!
So, understand the differences between data lakes and data warehouses; use them both accordingly; and do not confuse the two.