What is a Data Glossary and how is it Different from a Data Dictionary?
This is a question I get asked a lot. IT people are generally happy that they understand what a data dictionary is and in my experience some business people also understand what one is (and on the rare occasion may even want to refer to one). But there is often a lack of clarity over what a data glossary is.
The increasing focus on data governance and slowly maturing levels of data governance mean that the term data glossary is being increasingly heard. But there is a great deal of confusion as the terms data dictionary and data glossary are often used interchangeably. To add to the confusion, a data glossary is often called a business glossary, but for clarity, I will use only the term data glossary from this point onward.
The term data dictionary has been in mainstream data management speak for much longer than data glossary, so let’s start by looking at that first. According to the DAMA Dictionary of Data Management, a data dictionary is:
“A place where business and/or technical terms and definitions are stored. Typically, data dictionaries are designed to store a limited set of meta-data concentrating on the names and definitions relating to the physical data and related objects.“
Experienced Data Analysts and Project Managers understand that building a data dictionary during a project should be a key part of your requirements development efforts. Indeed my first experience with a Data Dictionary was when I was a Project Manager for data warehouse implementation, long before I had even heard of data governance!
While it doesn’t always happen, you should definitely take the time to identify and define all of the data that is being used as part of your project and a data dictionary should be created for every system that is built or implemented in your organization. Sadly that is not always the case and even when created they are often forgotten. I have often come across instances where it was created as a project deliverable but not maintained, or even worse, lost/mislaid over time.
Data dictionaries should include a business definition of all terms and this should mean that business stakeholders have been involved in the creation of them. However, because the people who are most likely to refer to a data dictionary are the IT and MI Team, they are often created without business input. This is a pity as for the reasons I stated above, developing these as part of a requirements gathering process is an excellent way to clarify the business requirements and ensure that your new system meets them.
The first difference between the data dictionary and the data glossary is that whilst the data dictionary is seen very much as an IT-owned document, data glossaries should be created and maintained by the business.
Data glossaries are the place to document business terms along with their definitions. At this stage, I’m sure you’re wondering how that makes it different from a data dictionary and I’m going to reinforce that thought by saying that although I said above that they should be created and maintained by the business, a good way to start a data glossary is to use an existing data dictionary. If you are lucky enough to have an existing (and up-to-date) data dictionary for your data warehouse, that would be an excellent place to start.
What makes a data glossary different is that although it can and often does contain details of the systems that data is held on (including tables and columns), the main focus of the content in the data glossary is information designed to improve business understanding and use of data. To that end, while you may have multiple data dictionaries, you should have only one data glossary for your organization.
A data glossary is a key deliverable in a data governance initiative, and because of that, alongside the terms and definitions, you should also be capturing the data owner and data steward for each term. As your organization becomes more mature you may also wish to consider including things like the data quality rules (i.e. what makes it good enough to use). I have even come across some organizations that include a field in their data glossary that flags if there are any data quality issues that any potential users of that data would need to be wary of.
Some people will tell you a data glossary should be used to create a ‘common’ set of definitions. Now I agree that would be sensible in a utopian data world, however, the vast majority of organizations are not yet mature enough in data governance terms to dive straight into this. Instead, I encourage my clients to use the development of a data glossary to identify where there are a number of differing definitions for the same term and conversely where a number of different terms have the same definition. Only then are you in a position to analyze these occurrences and agree to move to standard definitions. This may, of course, involve a high degree of negotiation!
Be aware that forcing everyone to move to standard definitions is not always the right answer. If your investigations conclude that although they are named the same, there are valid business requirements for the different definitions, I would recommend that a sensible alternative would be to re-name terms to make it clear that they are not the same thing. This prevents or solves one of the biggest causes of data quality issues that I have come across which is a lack of understanding of what the data means. This can cause issues in two ways:
- The data producers do not understand what a field should be used for and enter something similar, but slightly different.
- Data consumers can often believe that data in one field represents something that it does not.
To sum up, data dictionaries are more technical in nature and tend to be system specific. A data dictionary defines data elements, their meanings, and their allowable values. A data glossary is enterprise-wide and should be created to improve business understanding of the data they produce and use. A data dictionary should be a project deliverable for all system-related projects and a data glossary is a key part of a successful Data Governance framework.
Finally, if you are currently developing or are about to start to build a data glossary, the tips in this blog published on my website will help you devise a successful approach.