Agile Data Design – May 2013

I was recently approached by a couple of our developers who want to create a database. They presented me with a proposed design, which looked something like this:

These weren’t the actual names, of course. What is being called “Entity” here could be a Part, a Customer, a Supplier, an Order, etc. You get the idea, though – we’re storing attributes of some entity without actually defining what those attributes are. This is an example of what I call a “data dump.” That is, the only function of these entities is to store a quantity of undefined (or minimally defined) data. Or, to use a developer term, a “persistence store.”

Those of you who have read my book1 know that I use the analogy of a hall closet to describe data stores of this sort. You know what I mean: the closet, kitchen drawer or garage workbench where everything ends up because we don’t know what to do with it. The problem, of course, is that once you’ve put something in one of these places, you no longer know you have it, can’t find it when you want it, and often end up buying another one. I actually have two belt sanders in my garage because I couldn’t find one when I needed it, and it mysteriously resurfaced after I’d gone out and bought another one.

We in data management often decry the high cost of bad (i.e., inaccurate) data. However, there are also significant business costs associated with data that you need and don’t know you have, or that you need and don’t know how to find. Businesses end up paying for the same data multiple times, and then paying the cost of maintaining (and reconciling) all those multiple copies of the truth.

Here’s my position on this: a database (at least, a back-office or business-facing database) needs to do more than just store (or persist) data. It also needs to perform the following necessary business functions:

  • It needs to show the business what data it has available to it, in a form that is business-relevant, easy to understand, and easy to work with.
  • It needs to represent the meaning and purpose of this data in business terms.
  • It needs to enforce the business rules and definitions that constrain the data.
  • It needs to enable quick and easy access to the data for authorized users (that is, it must support performance, ease of use, and security requirements).

Given these criteria, it’s easy to see some of the problems around this sort of “data dump” database design: it hides all the data, so you don’t know what data you have; you can’t enforce any relation-based rules or constraints (e.g., a reference to a code table of allowed values); you can’t enforce unique constraints; and you can’t create indexes on anything to improve performance. In a proper back-office database design, you would at the very least need to call out the specific attributes that:

  • Are necessary to define the essential characteristics of the entity
  • Are necessary to define the business rules and constraints associated with the entity (this is particularly important when a surrogate primary key is used; there must be a unique constraint defined on the natural, or alternate, key)
  • Are necessary to support foreign key (FK) relationships with other entities
  • Are necessary to support indexes that allow quick and easy access to data

Note that I’ve made a couple of references here to “back office” databases. These are databases that exist for the purpose of supporting essential business-facing activities such as market analysis, process improvement, customer relationship management, supply chain management, quality improvement and regulatory compliance. Data that supports these essential business functions must be carefully designed, and implemented in a way that enables it to be used and repurposed (by business users) with ease and confidence.

NoSQL databases (databases that merely store data for application-specific use, and don’t attempt to constrain or characterize the data in any way) are not suitable for this sort of back-office use. However, they can be very useful as a high-performing front end to application-specific data collection and data analysis activities. Most online web sites, for example, use NoSQL or XML databases to collect customer order data, which is then sent to back-office databases via web services or XML updategrams. NoSQL databases are also useful in other types of data storage scenarios, for example:

  • In “Big Data” scenarios, where you need to analyze data for which you either don’t know its properties or characteristics, or are trying to identify properties and characteristics that may be obscured by an existing structure.
  • To support data transfer using an Enterprise Service Bus (ESB), where data cannot be rejected if it doesn’t meet validation criteria.
  • As an Operational Data Store (ODS), which stores data in support of transactional applications and serves as a staging area for data warehouses or data marts. One of our divisions has used the “data vault” architecture of Dan Linstedt (which is basically a relational implementation of NoSQL) to implement its ODS.

Having said all this, let me make one additional point: There may be cases where, even in a back-office database, you need to capture miscellaneous attributes about an entity that don’t define any essential characteristics of the entity, and are not needed to support any business rules, constraints, relationships, or indexes. Can you use the “Entity Attribute” approach to store these miscellaneous attributes?

I would say yes, with a caveat or two. Take a look at this slightly modified design:

Notice that I’ve added three properties to the design of Descriptive Attribute: the first is a Required Indicator, which is set to 1 to indicate that a non-null value is always required for this entity (Note: Attribute Value Text is defined as nullable, since NULL is a potentially valid value for an entity attribute).

The second property is a validation rule for the attribute, in the form of a regular expression, or regex. For example, if you have a license plate number of the form ABC1234, the regex validation rule for this attribute would be as follows:

[A-Z]{3}[0-9]{4}

This says the value must contain exactly three capital letters, followed by exactly four numbers. Some DBMSs provide built-in support for pattern matching of data values using regular expressions; in other DBMSs, extended stored procedures written in application programming languages (such as CLR integration in Microsoft SQL Server) can be used.2,3

The third property is a Unit of Measure indicator, which maps the attribute value to a unit of measure such as MPH (miles per hour), KG (kilograms), LB (pounds), INT (integer number), DEC (decimal number), etc. For decimal numbers, you could also add a property denoting the number of decimal positions.

By defining the properties of Descriptive Attribute carefully (and enforcing them in the database using, say, an Insert/Update trigger on the Entity Attribute table), you can create an implementation of this “Entity Attribute” design that allows for easy storage of miscellaneous entity data but still permits you to properly define and constrain this data. There are a couple of advantages to doing this: you don’t have to refactor the data model and database schema for every miscellaneous attribute that’s added, and you don’t end up with hugely “vertical” entities containing possibly hundreds of attributes.

You will, however, want to create a view in the database to allow business users to see a composite view of this data, showing all of the data attributes that are available for a given entity.

NOTE: I’d like to make this a dialogue, so please feel free to email questions, comments and concerns to me at Larry_Burns@comcast.net. Thanks for reading!

End Notes:

  1. Burns, Larry. Building the Agile Database  (Technics Publications, 2011).
  2. The Code Project has done some excellent work implementing regular expression matching in Microsoft SQL Server using CLR integration. Documentation and a free download of the code are available here: http://www.codeproject.com/Articles/42764/Regular-Expressions-in-MS-SQL-Server-2005-2008.
  3. In MS SQL Server 2008 R2 and higher, built-in regex pattern matching functionality is available through Master Data Services. Information is available on MSDN at the following link: http://msdn.microsoft.com/en-us/library/ee633712(SQL.105).aspx.

References:

  1. I should also point out that this sort of “Entity Attribute” data model pattern is not new. You can find an example of it in David Hay’s book Data Model Patterns (1996, Dorset House Publishing). Check out the “Specification” data model pattern on pages 104-105. You can see this pattern repeated with “Observation” (pp. 158-161) and “Summarization” (pp. 162-165).
  2. Len Silverston also makes use of this pattern in his Data Model Resource Book (Volume 1. 2001, John Wiley and Sons). You can see it referenced in his “Product Feature” pattern (pp. 76-77).

Share

submit to reddit

About Larry Burns

Larry Burns has worked in IT for more than 25 years as a database administrator, application developer, consultant and teacher. He holds a B.S. in Mathematics from the University of Washington and a Masters degree in Software Engineering from Seattle University.  He currently works for a Fortune 500 company as a database consultant on numerous application development projects, and teaches a series of data management classes for application developers.  He was a contribut0r to DAMA International’s Data Management Body of Knowledge (DAMA-DMBOK), and is a former instructor and advisor in the certificate program for Data Resource Management at the University of Washington in Seattle.  You can contact him at Larry_Burns@comcast.net.

Top