Author: Michael H. Brackett
Publisher: Addison-Wesley (ISBN 0201713063)
The second sentence of Michael Brackett’s Data Resource Quality: Turning Bad Habits into Good Practices appeared to get the book off to a bad start. In writing of most companies’
desperate data situation, he writes, “What did organizations do to get into their current disparate data situation.” Only upon further reading does it become apparent that, no, he
really does mean “disparate”. “Disparate means fundamentally distinct or different in kind; entirely dissimilar. Disparate data are data that are essentially not alike . . . They
are unequal and cannot be readily integrated. They are low-quality, defective, discordant, ambiguous, heterogenous data.”
This is as apposed to the kind of data most organizations should be using – Comparate data. “Comparate data are data that are alike in kind, quality, and character, and are
without defect . . . A comparate data resource is a data resource that is composed of comparate data that adequately support the current and future business information demand.”
Mr. Brackett’s book is an excellent guide for companies beginning the task of turning their disparate data into comparate data. As a guide for improving data quality, it is in less detail
than Larry English’s Improving Data Warehouse and Business Information Quality, but it is more accessible, better organized, and a better place to start.
The book is organized around ten specific bad habits that organizations often fall into, along with the ten good practices that they should be using instead. It begins with an overview and ends
with a charge to action, but the body of it consists of one chapter for each bad habit and corresponding good practice. In each detailed chapter is a description of the bad habit, its impact on the
organization, a description of the best practice, and its benefits. There is also a summary of each chapter that includes a checklist.
The writing style is clear, and the points are highlighted throughout with boxed aphorisms, such as:
[ A proper data structure provides the relevant detail for the intended audience. ]
The ten habits/practices are actually of two kinds: The first group of five are architectural – the habits that pertain to the way data are actually managed; the second group of five are
non-architectural, concerned with the environment within which data are to be managed.
The architectural habits and practices are:
- Data names – This is about the common practice of naming data informally nd inconsistently. Mr. Brackett argues strongly for a standardized, consistent taxonomy of names
- .
-
Definitions – In many organizations, definitions are vague, non-existent, unavailable, short, meaningless, outdated, incorrect, or unrelated to the context of the
term’s use in the organization. Instead, every term used in the data should be given a comprehensive, meaningful, thorough, and correct definition. -
Structure – Mr. Brackett’s characterization of bad data structure includes a structure’s overload of detail, focus on the wrong audience, inadequate business
orientation, and poor use of techniques. While your reviewer disagrees with the particulars of some of his suggestions, he is correct in his argument for employing correct data structure
components, proper detail for the audience, and a formal design approach. -
Integrity Rules – Organizations are often lax in their enforcement of data integrity rules. This includes ignoring a high data error rate, incomplete specification of the
rules, delayed error identification, excessive use of default data values, non-specific definition of data domains and optionality, undefined derivations, and uncontrolled data deletion. What is
required instead is a change in orientation from simply finding errors to eliminating them in the first place. This argues for a formal system of addressing data integrity, including use of a
formal notation, and best practices for rule enforcement. -
Documentation – The documentation of current data is often incomplete, not current, not understandable, and redundant. And this is when such documentation is available,
which it often is not. Indeed, the existence of any documentation may simply be unknown. Alternatively, robust data documentation is complete, current, understandable, non-redundant, readily
available – and known to exist.
Your reviewer was pleased to see that Mr. Brackett is also unhappy about the current vogue term, “metadata”, which he says “has been misused and abuse to the point that the real
definition is often unclear . . . It is ironic that a concept as important as understanding and using the data resource as documentation, cannot be spelled or defined consistently within the
discipline.” Instead he simply calls the required documentation about the organization’s data “data resource data”. More important than what you call it however, is the fact
that it must be complete, current, understandable, non-redundant, readily available, and, most importantly, known to exist.
The non-architectural, managerial aspects of data quality are:
-
Orientation – The first assignment in data quality is for the organization to focus “on a long-term objective to meet the current and future business information
demand rather than a short term objective to meet current needs.” Organizations tend to be overly physical in their approach to data architecture, and too oriented towards processes and
operations. Databases are often developed independently, and without sufficient business client involvement. Instead, data should be oriented toward business subjects, with business clients
actively involved in their development.
Mr. Brackett uses this chapter to introduce his “five-tier” organization for data: strategic, tactical, operational, analytical, and predictive. This supports his conviction that, while
there are different views of data, there should be a single underlying architecture.
-
Availability – Availability is about the physical access to the data. Many organizations have data which are not readily accessible, not adequately protected against
unauthorized use, and not backed up sufficiently to guarantee immediate recovery from a mishap. In addition, many organizations have not yet caught on that individual privacy and confidentiality
are of paramount importance when dealing with data about people. Related to this is the fact that data may often be used inappropriately. Instead, data must be made available to all who need
them, and protected against access and use by those who do not. They must be backed up appropriately, in a way which makes them readily recoverable. A person’s and organization’s
right to privacy must be honored. And finally, there must be a constant review of the appropriate and ethical use of the data. -
Responsibility – Data cannot be managed if responsibility for doing so is not properly identified. As a corporate resource, it must be controlled centrally, with data
stewards being identified to take responsibility for each individual data element. Management procedures must be firmly in place. -
Vision – Restricted data vision is “the situation where the scope of the data resource is limited, the development direction is unreasonable, or the planning horizon
is unrealistic.” An organization’s vision of data may be restricted to automated data, current data, tabular data, or business critical data only. All of these kinds of data are
important, but not to the exclusion of other, equally important kinds of data that are non-critical, non-tabular, historic, or not automated. -
Recognition – “The tenth way to achieve data resource quality is through appropriate recognition of data as a critical resource of the organization and the current
state of that data resource.” Companies may direct a data quality initiative at the wrong audience (too high or too low in the organization). They may focus inappropriately on financial
justification, or may be looking for the proverbial “silver bullet”. They do not realize that simply automating data, or imposing arbitrary standards will not in themselves improve
data quality. To achieve proper recognition of the problems involved, be sure to approach the audience that has the greatest vested interest in data quality. Get these people involved.
The last chapter summarizes the ten habits and practices, and organizes them in terms of “value chains”, providing a sequence for addressing them. Mr. Brackett correctly points out that
you cannot address these all at once: begin with the worst habits and go from there.
The book is an excellent introduction to the issues of data quality, and is sufficiently well-written to use for propaganda in your own campaign to improve data quality. If you hand out copies of
it, there is a chance that they will be read, and if they are only skimmed there is still value to be obtained.