Many of us know what a hoarder is. There have been shows about individuals who fill their dwellings with what would be viewed as junk to the average person. Whether they like to collect it, or simply cannot bear to part with it – or for some other reason – they surround themselves with stuff that seemingly has no value and cannot be used. In fact, it takes up space that could otherwise be used for more comfortable living and can be a fire hazard.
Don’t Hoard Junk Metadata
What does all this have to do with metadata and data catalogs? Well, a data catalog is a collection of metadata that is intended to be useful to the enterprise. Now, there are many different kinds of metadata, but the type I want to focus on is technical metadata. Technical metadata is metadata that comes from the structures and operations of computerized environments. It includes database structures, data profiles, ETL metadata, inferred foreign keys, report structures, APIs, and so on. Increasingly, this technical metadata is being collected and integrated automatically at vast scale in data catalogs.
At first, this might seem like a great and beneficial achievement. All that technical metadata in one place should be tremendously useful for use cases like data discovery, understanding the provenance of data, and finding the best source of data. And it is undeniable that the development of the capabilities to collect and integrate all this metadata has been a significant technical achievement.
However, there is an assumption here: we are assuming that all metadata is equally valuable. But is this really so? Hoarders must think that all their stuff is valuable, otherwise they would not hoard it. Might we be making a similar mistake by thinking certain metadata is valuable just because it is easy to harvest?
How Metadata Can be Junk
This problem came home to me first when a client explained to me that he was afraid to allow business users access to a data catalog because they might do something like type “CUST” into the search bar and get back tens of thousands of results from a variety of technical components and services. He rightly feared that the users would be horrified and give up, unable even to comprehend the types of technical objects the metadata has been harvested from.
So, here we have a paradox. The more technical metadata that data catalogs contain, the more accurately and completely they hold a picture of the enterprise’s data assets – but at the same time, the more unusable they are by business users, who are meant to be the principal beneficiaries of data catalogs. It seems that we have created Junk Metadata – metadata that cannot usefully be consumed by business users.
“Junk” is a Business Viewpoint
Is this a fair conclusion? Going back to our hoarders parallel, many hoarders will argue that any of their possessions might become useful in the future – an argument that cannot be refuted because nobody can predict the future. Perhaps the same is true of Junk Metadata, and in the future, AI or ML may be used to derive business insights from it.
We can clarify this by defining Junk Metadata and its properties. Junk Metadata is:
Metadata that cannot be understood in business terms by business users.
That is, an item of Junk Metadata either:
- Has no business understandable content; or
- Is not related to sufficient other metadata objects that do have enough business understandable content for the user to infer a business understanding of the item.
A major point here is that it is the business user’s viewpoint that is being considered. What we are calling Junk Metadata may be very useful for IT users. However, Data Catalogs have promised us that they are going to be enterprise-wide, and they are going to democratize data for all users in the enterprise. Otherwise, they would just be another IT technical tool like a DBA workbench.
Is Junk Metadata real? I think it is to some extent. Any metadata in a data catalog must be understandable in business terms to be even considered by business users. Even then, it may have no business use. But I certainly do not want to imply that all technical metadata is Junk Metadata – just that some of it is. And, like a hoarder’s hoard, we cannot dismiss Junk Metadata completely, as there may be a way to figure out how to extract business value from it in the future. However, it is up to us as Data Governance professionals to find the right balance to always keep our Data Catalogs useful to our business users.