Winter will soon be in full swing. Here in New York, we had a bomb cyclone late last year followed by an arctic blast that brought in crazy cold temperatures. Its seems as though winter takes up permanent residence in NYC this time of year.
But when I look outside across the snow-covered city, I start thinking about data. Why? Well, I do work for a data governance company (Collibra), so data is often on my mind. But lately, many of the discussions around the water cooler have focused on the topic of data lakes.
I know what you’re thinking: What on earth do data lakes and bomb cyclones have in common? Stay with me. I’ll connect the dots – I promise. Businesses today are facing two scenarios when it comes to data lakes: 1) they are building a brand-new lake from scratch or 2) they are trying, sometimes desperately, to clean up the data lake (swamp?) that currently exists.
If you’re fortunate enough to be building a new lake, then you’re one of the lucky ones in my mind. See, you have a blank slate. You can embrace all the right data governance policies and practices right from the start, ensuring that data in your lake is identifiable, has context and meaning, and that you know its authoritative sources. Using a data catalog, you can ingest data that is easily understood and easily trusted. Your data lake will resemble the quiet calm of falling snow and will remain unsullied by poorly described data. You’ll provide the data catalog to your business users, and they will easily find the data sets they need, understand their lineage, meaning, and use, and trust that the data they are using is right. And that increases the odds that your business users will actually use the lake that you’re building. Sounds great, right?
But for those of you facing scenario two, you’re dealing with the aftermath of the blizzard. You know what I’m talking about. What started out as a flurry of pristine white flakes falling gently to the ground quickly turned into the dirty, slushy muck that sprays you every time you step outside. Eventually, the muck goes away, but it takes a great deal of effort to clean up the mess it leaves behind. Far too often, I talk with customers who have a data lake filled with muck. They failed to put in place governance policies and practices before they dumped data into the lake, and now they are left mopping up the mess.
But there’s good news for those of you facing the dirty data lake scenario as well. You, too, can benefit from a governed data catalog. See, a data catalog will help you understand –and document– what you have in your lake. It will create an inventory of data, including attributes that indicate its quality, its definition, its lineage, and its recommended use. A catalog will also help you crowdsource information about the data from the people who are using it. So, if Bob in marketing used a data set and found it to be incomplete, he can note that point in the catalog. And if Sue in accounting found a particular data set to be fit for purpose and highly trustworthy, she can document that as well. And by linking your data catalog to your business glossary, you can make it really easy for the business users searching the lake for data to know exactly what the data means and whether or not it’s the right data for the job at hand.
What are you doing to control the blizzard of data facing your organization?