Thoughts on Data Quality

Data quality is an elusive goal. The more we try to keep accurate records in database systems, the more it seems that at least some of that data is just plain wrong. If you work with your
organization’s data, you know how important it is for that data to be accurate… and you know how often data inaccuracies conspire to make your job more difficult.

With that in mind, I’d like to ruminate on some of the causes of data quality problems.

There are many barriers and inhibitors to data quality. Speed-to-market issues tend to hinder data quality improvement in today’s world of web-enabled, rapidly deployed systems. Instead of
operating within a well-defined data architecture that mandates data modeling and logical database design, we fall prey to the opportunistic techniques of the day.
The rapid-fire nature of project planning and application development is one inhibitor, but it is not the only one. Information overload (see SIDEBAR) and rapidly expanding data storage exacerbates
the problem. As we add more and more data without necessarily adding more people and processes and policies to ensure the accuracy of the data, data quality will continue to degrade.

 On Information Overload

Information overload is one of those terms that people just throw out there, and it is usually accepted. But I just cannot bring myself to embrace the concept that there is too much
information available. I would rather be overwhelmed with an avalanche of details than to have little to no data available – at least when it comes to “things” that I
really care about. Can you imagine ever going back to the pre-Web, pre-Google days of actually having to go to a library or pay exorbitant fees to use something like Lexis/Nexus to find
relevant information?

The propensity to copy data all over the place is another big problem. Instead of using database management systems to share data among multiple applications, we copy data instead… to files
and spreadsheets and emails and other database systems and on and on and on.

Think about it, you’ve probably done it yourself. You’ve got that customer database up on the mainframe. Maybe it exists in DB2 or perhaps even IMS. Then a new application gets funded
for development, and it needs customer data. But the users don’t want to access the customer data where it resides; instead they want their own copy. Maybe even on the same DBMS. Then a new
project comes up, and we create a copy in Oracle on Unix… and those SQL Server folks? Oh, they need a copy, too. And it is downloaded to workstations in spreadsheets and another copy is made for
the data warehouse and, oh yeah, don’t forget to copy it out to those data marts, too.

With copies of the (supposedly) same data swimming all over the place, who can guarantee its accuracy? When it changes in one system, does it get changed
everywhere? What about that user who has his spreadsheet up in one window and an application up in another window: who can ensure that s/he does not wreak havoc on the application data because the
information in the spreadsheet is outdated?

Think about maintaining duplicate data implementations like having multiple watches that do not tell the same time. Will they ever be exactly the same? Synchronizing the data can be a Herculean

Automation can help. ETL tools that automatically populate data changes from one database out to others can help to keep multiple copies of the same data synchronized. But the results will only be
as good as the automation you put in place. It is doubtful that you will automate the modification of every copy of the data throughout your organization. A much better approach would be to stop
the wild copying of data and instead share data using a DBMS. After all that is what a DBMS is designed for, you know?

Assuring data quality can be a struggle even in a centralized and homogeneous environment. Data stewards are needed to implement programs for data quality improvement. Such programs need to include
both technology and business tasks. Without executive-level business folks intimately involved in the data quality effort, it will surely fail. Implementing data quality programs is costly and
without upper level management buy-in, human nature will kick in; and we’ll start copying data hither and yon all over again.

OK, so let’s try to summarize what is required to improve data quality? At a high level, there are six components: accuracy, integrity, security, accessibility, non-redundancy, and

Accuracy involves data modeling and implementing physical databases that correlate to those models. Integrity has a twofold meaning: placing rules on the data into the DBMS that define its allowable values and assuring the backup and recoverability of the database in the event of hardware or software failures. Security involves the protection of the data from unauthorized access – as well as being able to track who did what to which data whenAccessibility means that the data is available in an optimal
manner for those who are authorized users. Non-redundancy means that data is not stored inconsistently in multiple places. And flexibility refers to the ability to modify database structures as changes are required.

Both database administration and data administration need to be involved. Data administration sets the policies and rules for data quality and database administration helps to implement them using
existing database technology. DBA tools for database change management, backup and recovery, and performance assurance can help to
achieve these goals from a physical database perspective. From a logical perspective, there are tools to automate and facilitate data modeling, metadata management, and data profiling. Of course,
tools aren’t the only answer, but they sometimes can make an impossible task at least possible.

And then there are the needed tools that do not exist. Wouldn’t it be nice to have a data movement auditing tool that told you whenever data moved anywhere within the organization? At least
that way we’d know what we were dealing with. But such a tool does not exist (could it even exist given our current complex infrastructures?). So we struggle along, doing the best we can
do… and that means data quality problems will be a sad fact of everyday life.

If you are interested in reading more about this issue, here are some good reference books to consider:

The bottom line here is that data quality is, and likely will continue to be, a pervasive problem today within modern organizations. But we can all fight the good fight to clean it up, can’t


submit to reddit

About Craig Mullins

Craig S. Mullins is a data management strategist and principal consultant for Mullins Consulting, Inc. He has three decades of experience in the field of database management, including working with DB2 for z/OS since Version 1. Craig is also an IBM Information Champion and is the author of two books: DB2 Developer’s Guide and Database Administration:The Complete Guide to Practices and Procedures. You can contact Craig via his website.