Managing Reference Data in Enterprise Databases

Author: Malcolm Chisholm
Publisher: Morgan Kaufmann (ISBN 1558606971)

Do you truly understand your lookup tables; those entities which define and clarify nearly every entity model? Are they consistent across systems, or are they fed from the same source across your
enterprise’s many databases? Are the meanings clear to all who depend on this reference data? Is the same attention afforded here as transactional or analysis data feeds enjoy?

These topics, and many more, are quite competently discussed in “Managing Reference Data in Enterprise Databases”, which has the byline “Binding Corporate Data to the Wider World”.
This book is published by Morgan Kaufmann, authored by Malcolm Chisholm (a Ph.D. holder with over 20 years in IT), and has ISBN 1-55860-697-1. It is supported by two web sites, most prominently
www.refdataportal.com.

This book is designed to be read linearly, with some reference potential. The tone is conversational, and has many examples and diagrams, using IDEF1X for database diagrams.

Structurally, the book has 19 chapters, six lookup table appendixes, and a very useful matrix/questionnaire for assessing your enterprise’s reference data state. Each chapter is rather short,
and that brings us to the concern I have with the book: the conclusion section is disproportionately large and appears far too often. It seems that true book sections with lead-ins and conclusions
would have been more readable, and I hope that happens in the next edition.

The book opens with definitions to be used throughout the book, on reference data, common data, and external data. It then briefly explains the IDEF1X, as it is used in the book.

We then move to a cluster of chapters describing characteristics and behaviors of reference data. Chapter 3 studies the various reference data types you might encounter, including type codes,
status codes, constant values, global data, classification schemes, and buckets (with a strong focus on the latter). You learn reasons for encoding lists at all. Very good examples are used to
clarify throughout. It ends with a table showing describing what is not reference data and the use of flags.

Next, characteristics of reference data are identified. These include rate of change, risk, and source of updating. From there, the process of assigning identifiers (including abbreviations and
various forms of numbers) is covered. Many practical suggestions are made here.

Chapter 6 describes relationships between reference data and other reference data, particularly hierarchies. Next, issues around redundant data are found and described. In particular, syncing and
merging of data is covered (useful for company acquisition situations).

The next chapter concludes the characteristics and behaviors section. It deals with the life cycle of reference data. It starts with a great study on the impact of referential integrity rules on
reference data. It then describes the change/retirement succession path of codes and descriptions, particularly covering the use of effective dates. A great step-by-step example of a country code
table being refined is provided.

The next six chapters refer to administration of reference data. First, determining usage pattern reporting is described, particularly in foreign key situations. Next, mapping of reference data
across systems is examined, including the need for mapping, a great lesson in semantic analysis, and visual differences. As an aside, this chapter was an eye-opener for me, as I realized here that
reference data with the same identifier and description might be quite different!

In the next chapter, metadata generation is mentioned briefly. Chapter 12’s audience is programmers and their relationship to reference data. Hard coded reference data, inputting reference
data, and updates from transactions are all analyzed.

The final two chapters on administration of reference data cover specific normalization issues and populating of tables. Topics include: benefits to 3NF, merging codes and descriptions into one,
physical implementation, test versus production data, null handling, and unknown values.

The final five chapters focus on the enterprise view of reference data. First, external data handling is described, specifically various ISO tables (including country codes) and credit ratings. We
next learn about multilingual issues. The alternatives of multiple columns for each language versus a language ID in the key are analyzed carefully. Chapter 17 lets the enterprise understand why
reference data is so different from transactional data. The next chapter discusses querying for reference data.

The final chapter covers management aspects of reference data. A good business case for data stewards and their duties and areas of influence are provided. The various distribution methods of
reference data through an enterprise are identified and explained. Finally, the many areas senior management can guide reference data management are described.

This book is a very comprehensive guide to reference data. It is highly recommended!

Share

submit to reddit

About Cristof Falk

Cristof Falk has been involved, for the past 11 years, with the full life cycle of line-of-business applications, particularly finance and human resource. Programming and leveraging databases in many ways have been the means, with a greater focus over time as an analyst, to allow companies to better leverage their data asset in a user-friendly manner.

Top