For a while now, vendors have been advocating that people put their data in a data lake when they put their data in the cloud.
The Data Lake
The idea is that you put your data into a data lake. Then, at a later point in time, the end user analyst can come along and use the data for analytical processing.
Data lakes solve the problem of data disposition. But actually, using data from the data lake is another matter entirely.
There are several major problems with dumping data into a data lake. The first problem is that so much data accumulates there that no one can find what they are looking for. The second problem is that there is no consistency of definition of data. Without consistency it is difficult to relate one data element to another. The third is that the different types of data found in the data lake are not compatible.
These are merely the problems at the tip of the data lake iceberg.
In order to understand how to unlock this problem, lets consider the data that is being put into the data lake.
Data Going into the Data Lake
Typically, there are four kinds of data being stuffed into the data lake— structured, transaction-based data, textual data, analog data and IoT data.
The different kinds of data are not equal when it comes to the volumes of data that are being placed into the data lake.
For example, there is usually a much higher volume of textual data than for structured, transaction-based data when placed into the data lake. In some shops, there is 10 to 20 times as much textual data as there is structured data. This ratio of course depends on the shop and its business. But in general, the volume of textual data outweighs the structured data significantly.
However, data accumulates in the data lake, there ends up being a lot of data in the data lake.
And with this accumulation of data in the datalake there comes a problem. Soon, there is so much data— especially text-based data— that no one can find anything in the datalake. People know the data is there somewhere – they just can’t find it. And if they can’t find it, they can’t do anything with it.
A Manual Search
Trying to instigate a manual search for data in the data lake is a waste of time. In addition, in doing a manual search, there is always the chance that the document is actually there, but has been missed. And when you are searching a large body of items and you happen to miss on, the rest of the search is a waste of time.
So, looking for documents manually in the data lake is a very frustrating game, especially as the number of documents continues to grow in the data lake. Over time the problem just gets worse.
Like a Public Library
The dilemma that organizations that use data lakes find themselves in is very similar to that of the challenge of managing a large public library.
In a large public library, when you want to find a book, how do you do it? You don’t just go to the stacks and start searching. You will be in the library a long time before you ever find what you are looking for. And indeed, if what you are looking for has been skipped by, you may never find what you are looking for.
So, how do you go about finding a book in a large public library?
The Library’s Card Catalog
To find the book in the public library, you first go to the card catalog. Once you go to the card catalog, you can efficiently and easily find the book you are looking for. Once you have quickly and efficiently searched the card catalog, you then know where to go in the stacks to find your book.
Then, once you have found your book, you search through the index of the book in the back of the book in order to find the detailed, specific information you are looking for.
You can do this search quickly, efficiently, and economically in a public library. It is done every day.
A Card Catalog for the Data Lake
You need to be able to do the same thing with your data lake. What you need is a card catalog for your data lake. You have a huge number of documents stashed away in your data lake. You need to be able to search your data lake at the document level and at the detailed item level in order to be able to find what you are looking for.
Technology for the Card Catalog
Fortunately, there is technology that produces both levels of information for you. Textual ETL produces a two-level index scheme that becomes the card catalog and the detailed document index for your data lake.
The component of ETL that produces the catalog is called ECL— extract/classify/load. The component of ETL that produces the lower-level document indexing is simply called ETL— extract/transform/load.
How Textual ETL Works
Textual ETL works in a simple manner. Raw textual documents are read and are transformed into a data base. The data base is a two-level data base— one at a high level for identifying documents, and one at a low level identifying the detail in those documents.
The card catalog and the index allow you to find many details that reside in your documents with ease and efficiency.
An Asset
With your card catalog built, you can now start to find your way through the data lake. The amount of time and money you save only grows over time. Now, the data lake turns into an asset, not a liability.
The exact amount of time and money you save will only grow over time as the data lake itself grows. But the amount of time and money saved is significant.
Now, you can slay the document beast that is sitting in your data lake.