We’ve been having lots of problems with one of our databases recently, and these problems serve to illustrate some fundamental truths about how to design and build databases (and how not to!).
The database in question is the database for a third-party vendor product that helps us keep track of equipment in our manufacturing plants that is connected to our network (that is, each piece of equipment has an IP address). As a DBA, I usually don’t have much to do with vendor databases. We keep them backed up, of course, so that they can be recovered if need be, and occasionally the vendor will send us update scripts to run. Other than that, we mostly don’t think about them.
But I had to think about this database – a lot – after receiving a frantic call from the application analyst responsible for supporting the product. It seems that a user had mistakenly assigned the same IP address to two different tools, and the product wouldn’t allow him to correct the mistake! I loaded a backup of the database to a test server, so the application analyst could write a SQL script to correct the problem. Second phone call: the update script updated hundreds of records, most of which were not supposed to be changed. Could I check to see why this happened?
What I discovered dismayed – but didn’t surprise – me. The database contained no data integrity constraints whatsoever (except for primary keys built on system-assigned identifiers). The lack of data integrity constraints (including referential integrity constraints between tables) is what allowed two tools to be assigned the same IP address. What I also discovered is that several of the code (minor entity) tables contained duplicate name values (and, in some cases, duplicate IP address values), which is what caused the analyst’s update script to change many more records than intended.
After laboriously figuring out how to write an update script that updated only the records that needed to be changed, I fired off an email to the vendor support rep. After describing the problems I had found in their database, I explained that databases of this sort invite an infinite (and steadily descending) cycle of data fixes, each of which results in some other problem that needs to be fixed, and so on and on. In my tenure as an application support analyst, I had to work on several such databases (which may explain my subsequent interest in data management!). I worked in one shop where the answer to every production problem was to “fix” the data in the database so that the application would stop blowing up. This fix, in turn, caused additional problems for the application; the data would be further “massaged,” resulting in more fixes, etc. This cycle of futility came to an abrupt end when one of the “fixes” deleted most of the records in the database – and there wasn’t a backup!
Most database issues come down to one fundamental question: What is a database? Far too many application developers think of a database as nothing more than a repository for application-specific data (what they call a “persistence engine”). To them, a database is a sort of “Fibber McGee’s closet”1 into which data gets stuffed and then, sometime later, retrieved. We hope that the data, by that time, is correct, timely, business-relevant, and easily consumable, but it probably won’t be. Databases built according to this definition of “database” characteristically have few, if any, data integrity constraints because developers prefer to code all such “intelligence” in the application.2
I advocate a more complete definition of “database.” A database, in my view, is not just a repository for business data; it is a repository for the business rules that constrain the data! This is why having a complete and accurate logical data model is so critical to database development – the logical data model documents the business data requirements (and the business rules) that need to be implemented, in some fashion, in the database. As I explained in my recent article,3 it is not necessary (and usually not desirable) to implement the logical data model as though it were the actual database design. Nevertheless, the database, once implemented, must encapsulate the business data requirements and business rules that are documented in the logical model.
Having a database that encapsulates – and enforces – the business rules surrounding business data helps ensure that the data in the database will always be complete and correct, and that it will always represent, at any given moment in time, a valid view of the business. Data that can be trusted to be correct, accurate and timely, that is easily consumable, and that is business-relevant (and not just application-specific) is data that can be reused, over and over again, in support of multiple applications, multiple business processes, and multiple business initiatives. This, in turn, greatly multiplies the value of the data (and the database) to the business.4
As I always say, data management is the art of turning data into business value. Data that is locked in application-specific databases (especially databases that don’t bother to enforce the correctness or business-relevance of their data) is data that is not going to be of much value to the business!
NOTE: I’d like to make this a dialogue, so please feel free to email questions, comments and concerns to me. Thanks for reading!
References:
- I probably need to explain this reference. “Fibber McGee and Molly” was an old radio comedy program in which McGee (a chronic liar) was always running to the hall closet for some tool or something he needed. Molly (his wife) would shout, “Don’t open that door, McGee!,” and all the contents of the closet would empty themselves onto McGee’s head. This is what flashes through my mind when I hear developers refer to databases as “persistence engines.”
- Larry Burns, In Defense of the Intelligent Database published August 1, 2009, in TDAN, The Data Administration Newsletter.
- Larry Burns, What is “Normal”? published August 1, 2010, in TDAN, The Data Administration Newsletter.
- I’ve written a great deal, in other forums, about the importance of increasing the ROI of data. My basic point is that when you add up the cost of designing and creating databases and database servers (and maintaining them over time), and compare this cost with the value returned by application-specific data, most companies are probably losing money on their databases!