Tales & Tips from the Trenches: The Data Catalog

I have worked on a wide variety of data catalog projects lately, and I’d like to share some of my thoughts from the various implementations that I’ve done.

What is a Data Catalog?

After discussions with a trusted colleague, I have begun to re-think my definition of what a Data Catalog is. My colleague challenged my definition of data catalog: He believes that a data catalog should simply be defined as “a card catalog for data”. I tended to look at the COTS (Commercial off-the-Shelf) offerings and became insistent that a data catalog should possess many of the enhanced functionalities that these products provide.

Perhaps these products should be thought of as Data Management Platforms; or they can still be considered data catalogs, but with more robust functionality.

Some products offer automation capability, especially in the ingestion of data, both known and new data sources. This is extremely important in keeping up with the data flood. But is it necessary and sufficient for something to be called a data catalog? Some products are more manual in their metadata ingestion. Are they still data catalogs?

I was also discounting various open-source technologies, simply because they don’t offer the robust suite of ML, automation, and add-on functionality that the main COTS tools provide. But if the definition of a data catalog is simply a “card catalog for data” or data inventory, then some of them would qualify. I confess to being drawn in by the “bright and shiny things” that fully functioning products provide. It’s funny, I am critical of others that do this, to get distracted with some awesome feature whether or not they really need it.

However, in working with various clients, sometimes all they want is a very simple card catalog. There may be various security constraints in place that prohibit some of the advanced features from being used. Therefore, I believe it is important to distinguish what functional categories are important to the organization.

Some products—perhaps most products—are geared especially around a specific type of function or use case. Many times, the catalog product representatives are quick to stress this. For example:

If the main problem you want to solve is data governance, then Collibra might be a good choice. Collibra solves most use cases with a variety of governance workflows, even when you wouldn’t think the problem is governance-related. If you want open source, Truedat is a possibility.
If the main challenge you have is data preparation and you want to make it easier for your data analysts, Boomi (formerly known as Unifi) is an option.
If the main thing is creating an environment for your data scientists, and they already use Watson Studio, then IBM is a fabulous choice. It creates a “data fabric” environment for data scientists.
If you want to understand the data in your data lake and you have enormous volumes of data coming in at a rapid pace and you can’t keep up with it with typical data governance, then use Hitachi Ventara Lumada.
When you want an all-purpose data management platform, Informatica is great.
When you want a nice machine learning-augmented data catalog, you don’t have a lot of resources, and you want to stand it up immediately, Alation fits the bill.
If you don’t have a lot of investment dollars but have several developers, and all you want is a simple card catalog, you don’t need other features such as reference data management, business glossary etc. and you have a lot of security concerns, open-source products like CKAN may work fine.
If your analysts are frequently looking for data and the data sets are large, field names don’t usually make sense (like IoT and system-generated data) and there’s no data dictionary, you want a product that does data profiling on the entire data set and not a sample. Several products can do this.

You get the idea.

The interesting thing that I’ve found is when I have helped my clients find a product, I often look at the products with my own data analyst background. It is like looking through sunglasses or colored glasses. If you have rose-tinted glasses everything looks rosy! It’s like the hammer/nail problem. If you are a hammer, everything’s a nail. If you have in the past solved things with data profiling, and data profiling is a critical functionality for the work you have done over the years, then it is a critical feature for you. For example, I used to engineer data warehouses. We had to look at a data profile and tell what was going on with the data before we wrote the ETL. Data profiling helped stop the “Load, Code & Explode” problem. However, what I’ve found is that some need this level of understanding, but not all do. I must delve into what their needs really are in order to ferret out their actual requirements. I have to understand the limits of my own bias. Sometimes my bias is very helpful, if is a concern to the client. But sometimes it can get in the way.

And similarly, if you are a data scientist and you are looking for specific data to use for a statistical modeling problem, you want to know what is in a data field, even if you don’t have anything that explains it. Data profiling can help greatly. But if you are an organization with extremely sensitive data, you may not want to expose it to data profiling. In this case, data profiling may be of little or no help. Therefore, you can see that certain use cases dictate the value of certain features and functions.

What I have found most often is the wrong product with the wrong emphasis is chosen. Then the organization is faced with the Cinderella Stepsister’s Foot Problem: Trying to jam Cinderella’s stepsister’s foot into the glass slipper. It doesn’t fit. Ouch! Then the organization is faced with either investing a lot of money into forcing it to fit or having to admit failure and buy something else. Or, sadly, a third alternative: Scrap the whole thing. I’ve seen all these things take place. It is indeed a sad situation, especially with the great promise that data catalogs bring.

A Word about Open-Source Technology

Open-source technology, when it comes to data catalogs, offers some functionality with generally no support like helpdesk and training; they also vary in the degree of documentation. Most open-source data catalog products offer the basics: a card catalog for data. They may not offer:

Machine learning and inferred tagging assignment (important when you have lots of data coming in and you need fast, efficient ingest of metadata)
Types of business metadata such as Policies, Rules, Business Glossary, Acronym Dictionary and Reference Data management and governance
Data Governance assistance such as workflows

It should be noted that open-source data catalog products provide some of these things and usually not all of them, but have the ability to retrofit them yourself if you want further functionality. For example, some provide a business glossary but no policies. You can create a policy by using the business glossary function. I find this difficult because there are things that pertain to policies that do not pertain to business terms.

I have seen an organization choose a very robust, fully functional data catalog with lots of advanced features. They could not store sample data in the data catalog, nor the results of data profiling due to the sensitive nature of their data. They had to perform workarounds to even get the metadata ingested. Prior to their acquisition of this product, they had created a sandbox, proof of concept prototype using CKAN, which is essentially a portal product. They were happy with the results, leading them to purchase the COTS product. In retrospect, perhaps they should have stayed with the simple CKAN solution because of all the extra work they had to do with the COTS product. They were not really taking advantage of all the extra features the COTS product provided.

If you invest in open-source, it is important to consider the Community: how large is it? How quickly are questions generally answered? The Community is, in essence, your support. If you are not investing money in COTS, you are investing time researching answers to issues. You must also have the developer resources to tweak the product and enhance it to make it fit your use cases.

Hybrid Product

There’s an interesting hybrid product available: Truedat, by Bluetab. They were bought by IBM and they offer training, support and consulting and are still at heart an open-source solution. They feature data governance as their main emphasis, and offer business glossary, profiling, data lineage, data sharing agreements and data governance dashboards, to name a few. They have a different type of pricing, mostly around custom set up. They do not allow the community to add functionality to the tool like some open-source vendors; they do all the enhancements.

Infrastructure Considerations

Another area of importance is infrastructure. It is critical that when you are looking at data catalog tools you identify the infrastructure that it runs on and how it connects with other systems to scan metadata. Be sure that the version the vendor is presenting to you is the one that is running in your current environment and what the future roadmap is for that infrastructure. This is especially true concerning on premise vs cloud. Some products only run on the cloud; others are hosted by the vendor. This is especially important when it comes to isolated private cloud environments that you might see in classified or other secure organizations such as healthcare or finance.

The Importance of a Market Study

Data catalogs are very different; each offers different features and functionalities. I have seen some organizations picking the first one that they hear of, or that someone on their staff has used in another organization with favorable results. As mentioned above, you need to decide what functionalities are important to you. Some nuances to look for:

As mentioned above, data profiling can be a very helpful feature. However, the tools do this very differently. Many only profile a small sample of the data upon metadata ingest. This can give a distorted view of the underlying data; how do you know that the sample is representative of the whole data set? What if the sample has no nulls in an important field, and the rest of the data is 50% null in that same field? You should verify with the tool how profiling is done. If the entire data set is important to profile, ask the vendor about it. Some may say, “you can tune it to profile the entire data set”. I did this once with a tool and it brought the entire system to its knees. The bottom line is it wasn’t designed to do this.
Some do not provide customizable workflow or have constraints on how it is done. Some tools require using a different tool to write workflows in a workflow programming language. Others have built-in workflow designers with drag and drop widgets. Some require any customization to be done by the vendor.
Some do not offer business metadata objects that are managed in the tool with data governance, such as policies, reference data or business glossaries. Some only provide one business glossary which does not enable you to partition business glossaries by division. Most organizations have multiple glossaries based on their structure, such as Finance, Marketing, Product Management, Human Relations, etc.

Data Governance

The tools also vary concerning how they perform data governance. I have seen organizations get bogged down with the wrong sort of data governance implemented. Data governance is supposed to help an organization ensure that its data can be depended upon and is trustworthy, which is a noble goal; however, if you put too much structure around data governance, it can hamper an organization’s ability to be nimble and respond to business challenges quickly.

The data catalog can help, and there are various approaches that can be used when setting up the product. I created a lightweight approach to data governance called Governance Lite ™, which was mainly designed for glossary governance. It is reactive rather than proactive: it is basically concerned with allowing business terms to be proposed along with any definition (or even without one); you can vet the terms and definitions later. There is some benefit to having candidate terms in the glossary, just as long as their status of candidate is known. Can this approach be used with data sets? Some products, notably Lumada, use this approach when metadata is ingested. The tool has a very powerful machine learning facility that automatically assigns tags to columns with a confidence factor and allows this data to be shown before it is vetted. Then someone can accept the suggestion later. The concept is similar to reactive governance: allow the data to be seen in the catalog along with the inferred tags and confidence factors before it is approved. Another tool that does this is Alation. You can ingest data and it becomes searchable. Alation has a “certified” attribute which can be set by a data steward. It also allows a “caution” or warning to be displayed for a data set with an explanation. This allows all data to be searchable regardless of its certified status.

Note that this approach is not always desirable, but it may be appropriate in the case of large amounts of data coming in at a rapid pace. I’m doing a presentation at Enterprise Data World this year on this subject, if you’d like more information.

Summary

It should be noted that there is a wide variety of data catalogs available today; one size does not fit all. Even the category of “Data Catalog” has a wide definition. “Card catalog for data” could mean as simple as a data inventory. Some products like CKAN provide this, because it is mainly a portal: It allows an enterprise to organize web pages and data, and it keeps an inventory of what it manages and allows searches. But it does not provide many data management functions, which other products provide.

When your organization determines that it needs a data catalog, what is the heart of this need? Do you need to track data as it moves through the organization? Do you need to perform data governance and data quality, and certify data? Do you have so much data coming in that you must have some way to quickly inventory it and allow it to be searched, without a lot of governance?

The most important take-away from this article is you must take the time to figure out exactly what your real needs are. This takes time, but it will save you lots of time in the end, trying to retrofit a product that does not do what you need it to. Market studies are your friend, especially when it comes to such a broad area as data catalogs.

MenuMenu

Tales & Tips from the Trenches: The Data Catalog

What is a Data Catalog?

A Word about Open-Source Technology

Hybrid Product

Infrastructure Considerations

The Importance of a Market Study

Data Governance

Summary

Bonnie O'Neil

MenuMenu

What is a Data Catalog?

A Word about Open-Source Technology

Hybrid Product

Infrastructure Considerations

The Importance of a Market Study

Data Governance

Summary

Share this post

Bonnie O'Neil