Data Catalogs have great promise. They can provide the businessperson, analyst, architect, system engineer and data scientist alike the ability to look for data and understand what it contains. However, they are not magic. They require proper curation and a good understanding of data/metadata, privacy/security, and technical ecosystem to be effective. This column illustrates a few of the common pitfalls of data catalog implementation and how you can avoid them.
Data Catalogs Require Curation
Most data catalogs use machine learning (ML) to make intelligent connections between data sets, glossary terms and categories. These connections require the product to learn from the humans. The catalog will get smarter and smarter over time. But in the very beginning, you have to allocate resources to help the tool make these connections. And the usual resources required to train the system, technical and business data stewards’ time is at a premium. This must be accounted for when you are planning your implementation. How do you get around this one? This symptom is often masking a larger problem: the organization’s staff doesn’t see the value of the data catalog. It is very important that you make the case before you purchase it, the value that it brings: time saved due to auto curation (over time) and searching for data, in addition to finding the right data the first time.
Reluctance to Share Data
Some departments and stewards are reticent to share their data, and this extends to metadata. Other organizations have security constraints which do not allow the catalog to “scan” underlying data at all. If this is a problem, discuss it with vendors ahead of time. Some vendors allow for role-based access control over individual metadata objects, such as the ability to obfuscate profiling results or sample data, and sometimes down to the attribute level. For example, only certain users can see the sample data in the social security number field, and everyone else can see the existence of the column, but not any sample data or profiling results for this attribute.
My personal preference is to scan ALL the data, so that you have reference to everything, and tightly control the access to only those who have the privilege to see it. This gives the maximum value to the organization, to answer the global questions of “What data do we have?” and helps administrators manage the data properly. Also, the more data there is in the data catalog, the better the inferences will be and both curation and search will benefit.
Know Your Use Cases Before You Choose
I always like to say that all data catalogs are not alike; one size does NOT fit all. They are very different. Especially in this emerging market which includes “Active Metadata” and Data Fabric/Data Ops enablers. That is an article for another time. Suffice it to say you must have defined your use cases well. This also includes the infrastructure of both where it will live and what systems it has to interact with.
Many catalogs started out as a specialty data management tool, then broadened their scope by adding more functionalities. Were there acquisitions involved? How seamless is the integration between the original tool and the acquired one? How do these add-on functions work?
You need to know what your main, high priority use cases are and how the vendor will support them. Ask for a targeted demo of these functionalities so you can see how it would work to solve your most pressing issues.
Know Your Environment Before You Choose
An important consideration is regarding clouds. Many of the top catalog products are migrating to the cloud. But what if you have a private cloud? Or you need the product to be on premise? You must ensure that the version you will get is articulated well by the vendor. Ask what differences there are between the version that works in your environment and the one that the vendor is showing in a demo. It would not be advantageous if your staff goes to training and learns about all these wonderful features, only to find out that they are not available in the version they have.
Data Profiling
You should explore the role that data profiling will have in your overall technical metadata analysis process. I discussed this in a previous article back in February. It can be extremely helpful in locating the right dataset for your AI model or analytic needs. Data profiling provides overall descriptive statistics about a data set such as:
- The minimum value
- The maximum value
- The percent of nulls and number of nulls
- The percent of distinct values and number of distinct values
- The data type
- The most frequently occurring pattern
Then profiling should allow you to drill into the value frequency and pattern distribution.
Here’s the rub: Does the product produce a data profile of the entire data set? Many do not, they do a sample only. In my estimation, a profile on a sample can be very deceptive. You want to know what is going on with the data at a glance. How do I know that the percent of nulls is accurate for the whole data set if it is based on only a sample? This information can make or break your decision to use a specific data set.
Conclusion
Bottom line: know your use cases and choose a tool wisely! Many people tell me their catalog efforts have been stymied by some of the concerns highlighted in this article. You can avoid this pain by investing in the proper analysis up front and choosing the right tool for you. After you have acquired the tool, your implementation efforts should be anchored in the benefit statements around what your organization will achieve by having a data catalog:
- The hours saved by data scientists and businesspeople in finding the right data
- The hours saved by data architects and data administrators in creating data inventories
- The ability to produce reports quickly and accurately by knowing what data has what they need and to access it quickly
- The ability to locate where data is not being protected properly by seeing data lineage
- The ability to see the transformations that were used on the data to arrive at the value you see
These are just a few. When you articulate the benefits properly, you must sell them throughout the organization, so data users can see what’s in it for them. When they see the product’s value, that will help soften the resistance to the elbow grease needed in the beginning to adequately set it up and train it to be effective.