We know the phrase, “Beauty is in the eye of the beholder.”1 In this article, I will apply it to the topic of data quality. I will do so by comparing two butterflies, each that represent a common use of data quality: firstly and most commonly in situ for existing systems, and secondly for use in data migration.
Why butterfly? Because they are beautiful, delicate yet resilient creatures, great in number and variety. The trick is determining how beautiful you want your data, a subjective assessment.
Both use cases are similar in that it’s about the application of quality management techniques to the enterprise asset that is data. There are similarities and differences, though, in how the data quality management techniques and scopes are applied, which I’ll attempt to explain. This is based on my years of applicable experience in both domains and my observations and reading of content shared by many well-educated peers. None of this is new, just that I’ve not seen this comparison before.
In the end, I hope you’ll either agree with my perspectives, and that’s great, or you’ll challenge my perspective and we can learn more from each other.
Before we get into the data quality comparison, first we must apply understanding to In Situ and Data Migration terms for this article; I’ll refer to them as Butterfly 1 and Butterfly 2 respectively.
Butterfly 1: In Situ Data Quality Management
First, let’s start with a definition of In Situ to set the context.
Definition of in situ: in the natural or original position or place.1
For this article, I’m referring to the application of data quality management to existing in-use applications. That is, you have your existing applications used for day to day purposes for the capture and curation of data aligned to business processes and outcomes.
When I use the term “applications,” think Customer Relationship Management, Asset Management, Enterprise Resource Management, or any other applications used to run a business. These are existing applications embedded in your organisation that are used day to day by a variety of staff carrying out their duties creating and curating data.
The purpose for data quality management for in situ applications is generally the improvement of the data for a multitude of uses, whether that be improved customer engagement, efficiency in streamlining processes, greater effectiveness with preventative asset maintenance, better assessment of risks and issues, and more. The more beautiful your data is here, the better it can be trusted and relied upon for how and where it’s used in multiple ways.
Butterfly 2: Data Quality Management for Successful Data Migrations
Data migration involves the transmutation of data in the transfer process of moving data between a legacy source system and its more modern target replacement system.
Definition of Transmutation: change into another nature, substance, form, or condition.12
Data Migration is a complicated activity consisting of the following common activities. There may be more or fewer activities you perform in your data migrations, but this, I believe, provides enough context. Note that I’m deliberately limiting it to the migration of tabular data for simplicity.
- Identifying the source system(s) of record from which to retrieve the data to be migrated. These are often called legacy systems as most will not remain active after the data migration activity, instead archived to read-only use for limited and restricted use.
- Identifying the conceptual model describing the information contained within the application, used to link source sets of data with target sets of data, usually informing high-level source to target table mappings.
- Develop the mapping and associated logic for the transmutation of the data from one structure to another.
- Perform gap analysis between source and target data stores such that any structural variance, eg field types or sizes, acceptable value ranges, etc can be identified.
- Developing the Extract, Transform, and Load (ETL) components to transmute and transfer data from the source application to the target application.
- Performing trial runs for loading data to the target assuring the mapping and transformation rules both to prove the mechanical process, the integrity of the data for use by SMEs, and to prove the compatibility of data across systems.
- Getting consensus along the journey, and agreement closer to the end of the journey that all is satisfactory.
- Finally loading the data into a production environment, performing the review for signoff, and undertaking a review for any further rounds of data migration.
For this article, I’m referring to the application of data quality management to the practice of migrating data from one application to be retired to its successor application that is replacing the original.
What is Data Quality Management?
Much is written in a variety of mediums about Data Quality Management and in this article, I’ll relay some of what’s out there, which you may already know.
The DMBOK has a dedicated chapter on Data Quality Management, the 10th Data Management Function in the framework.3 There are approximately 20 valuable pages on the topic within the book. The following quoted paragraph, for me, is the appropriate all-encompassing statement that answers the question:
“Institutionalizing processes for data quality oversight, management, and improvement hinges on identifying the business needs for quality data and determining the best ways to measure, monitor, control, and report on the quality of data.” 3
Later in this article, we’ll explore each of these terms and how they are applied similarly, but differently for data quality management for in situ applications as compared to use in data migration. The end goal is common, which is to make the quality fit enough for how the data is intended to be used. The first thing we need to get further understanding on is the topic of Critical Data Elements (CDE’s). Identification and agreement on CDE’s are crucial to the prioritisation of work to be done.
“Critical Data Elements(CDEs) are defined as “the data that is critical to an organization’s success” in a specific business area (line of business, shared service, or group function), or ‘the data required to get the job done.” (13)
For in situ applications, there’s a good amount of work to identify what are the CDE’s as determined by the individual data elements involvement, singularly or in combination, to support business processes and goals. There is often a cross-functional committee involved in the identification and agreement of CDE’s where an application is used by more than one business unit. CDE’s are only ever a subset of fields of in situ applications as determined by the importance of the data within.
For Data Migration activities, every field being included from the source system in the mapping documentation can be considered a Critical Data Element because without that agreement, you’ll highly likely run into unplanned issues when trying to transfer the data to the target system mapped fields.
As you read the following, I want you to keep coming back to the above understanding of Critical Data Elements and relate that to the applications you use. Think about the key data fields in the application you think are important, and then think about all the other fields that you may not consider important; it’s that second set of fields that additionally become important during data migration.
What are the Dimensions of Data Quality Management?
There are many publications on the various dimensions of data quality. For example, Collibra has published 6 different dimensions;(5) DMBoK v1 has 11 different data quality dimensions(6). I will name them from the DMBoK, but for brevity, I’ll leave you to research the meaning of each:
Accuracy, Completeness, Consistency, Currency, Precision, Privacy, Reasonableness, Referential Integrity, Timeliness, Uniqueness, and Validity.
Understanding each is fundamental for whichever set of data quality dimensions you choose to measure for each CDE.
Let’s revisit the terms in the DMBoK paragraph defining Data Quality Management and how they are likely applied to in situ application use as compared to use in data migration activities.
Along with identifying the CDE’s and which data quality dimensions should be measured for each, a justification of why related to the perceived business benefits should be captured. These benefits, and decisions of which measurements to utilise, need business decisions to be made before enacting them due to the cost and value tradeoff against other business investments.
The “why” is a fundamental component to justifying the efforts involved in designing and the following activities performed based on these measures. The number of measures you design and define have a direct cumulative effect on the effort and cost to be invested. Note that these are not set in stone, they can be changed based on experience working with the measures. You’ll likely start with a set of measures, and over time retire some and add new ones based on how improvements affect the data over time. Also note that for each CDE, they can have different measures than that applied to related CDE’s.
For in situ use, this is a more difficult challenge because you have more options and more decisions to make about which measures are important for each CDE.
For data migration use, the decisions are far more simple as you’re trying to improve the data just enough to enable data migration into the target system and data compatibility for use in that target system.
Measures are either Quantitative or Qualitative in nature and influence how you monitor them and manage the underlying data over time.
Quantitative data are measures of values or counts and are expressed as numbers. Quantitative data are data about numeric variables (e.g. how many; how much; or how often).9
Qualitative data are measures of ‘types’ and may be represented by a name, symbol, or a number code. Qualitative data are data about categorical variables (e.g. what type).9
Reporting of data quality measures aims to show the change in the CDE data over time as a result of any treatments applied. Reporting will feed into the triage process for any data quality issue identified from the measures. Reporting provides observability.
Reporting is audience dependent on what information it contains, how it’s structured, how it’s interpreted and used to affect change.
Data stewards, as well as data migration team members, associated with in situ applications are interested in the raw data with analysis on what should be done with the included records in the report as they will help inform the decisions on what remedial actions to take.
For management, they’re interested in knowing quality, remedial effort, impact and thus, priorities, and where fixes will be applied. They’ll be interested in a more formal report that is written as a formal document rather than a simple report that summarises the analysis provided to the data stewards.
The reporting effort and lifetime of the reports is dependent on use. In data migration activities, it’s limited to the duration of data migration. For in situ use, the duration is value-dependent, while there’s still value to be garnered from the report will exist.
While the reporting effort is getting established, it’s important to think about monitoring the measures via the reports. You will want to automate the reporting effort as much as possible because of the volume and variety of reports to be run and results distributed to multiple people over time.
Monitoring of the measures over time for trend analysis is important for management upon which to make decisions. Monitoring is the feedback loop to know if any treatment applied to the data is having the desired effect.
IV. Management and Oversight
Management provides the structures of staff who receive the reports, who monitor the changes to CDE’s, and who influence or make decisions on the treatment(s) to be applied to improve the quality of data. Management can also make the necessary decisions to not implement a recommended change.
Definition of Oversight: responsibility for a job or activity and for making sure it is being done correctly14
Management can include the creation of committees responsible for stewarding improvements to the way the various parts of the organisation apply treatments to data asset issues, or introduce new, or modify existing, controls. Management should develop a triage regime for the identification of data quality issues, through to the application of controls and measures and ongoing monitoring and oversight. Ultimately, Management provides the staff with management of changes over time for the betterment of the organisation in part or as a whole.
For in situ applications, Management can review and alter staff’s KPI’s influencing behaviour on the creation and curation of data. Management can review policies and processes and modify these controls. Management will determine the triage priority of any data quality issues as they may be actioned amongst business as usual tasks. Though there may be time pressures on correcting a few of those critical and high priority issues at any one time, overall there is generally little urgency to address all the issues.
For Data Migration, there must be a triage process defined, this is a crucial management control. It’s required because the volume of data quality issues will be far greater, and they will be identified in a relatively short period. Each data quality issue identified needs management to provide decisions and guidance quickly, in days, on how they should be remediated or not. A committee approach is less required because the goal is different in that it’s getting the data into the new system. The other in situ data quality management requirements will reappear in the new target application post-data migration.
Different to in situ, fixing of data quality issues during migration has four key options:
- Fix in Source – make data modifications in the source system of record such that it aligns with the mapping documentation.
- Fix in Flight – include additional transmutation rules into the mapping documentation and ETL activities that account for the corrective action to be performed against the data
- Fix in Target – correct the data in the target system, the new system of record. This is usually left for staff to apply the corrections post-data migration. This is usually an option of last resort.
- Ignore – Management may decide the data should not be mapped and migrated thereby causing the data quality item to no longer be considered.
Through all the above, ultimately everyone involved is looking for multiple levels of improvements to the capture, curation, and utilisation of data.
Definition of Improvement: the process of making something better or of getting better11
No staff member of the organisation seeks to create poor quality data deliberately; a generalisation. Some people will game systems for their benefit, particularly when it comes to maximising KPI’s and remuneration, or even to satisfy KPI’s; often a result is poor data quality during capture.
It is a balance of technical and non-technical changes that collectively contribute to the improvement of the data asset of the organisation contained within one or more applications. Some significant improvements can be done without a lot of financial investment in the change of controls, simply by changing behaviour. Sometimes small improvements are made with perceived significant financial investment. Improvement is subjective and influenced by the alignment with strategic goals.
Ultimately the improvements sought are subject to the goals of the data quality activities for in situ applications or data migration. This determines the investments required and the subjective levels required to be obtained to ascertain the data quality issue has been resolved.
this article, the reader should be aware that many
books, including the DMBoK, and online sources cover each of these topics in
far greater detail. I thank those authors who have previously, and continue to,
contribute their knowledge to the ever-growing community involved in data
- Merriam Webster Dictionary: https://www.merriam-webster.com/dictionary/in%20situ
- DMBoK v1 ISBN: 978-1-9355040-2-3, Page 291
- DMBoK v1 ISBN: 978-1-9355040-2-3, Page 296-297