Agile Data Design – May 2012

I’m about to go into a week-long requirements workshop for a major application development effort. As I think about doing the data modeling for this project, I find myself thinking again about what a data model is, and is supposed to be.

Coincidentally, I’ve just started reading Steve Hoberman’s reissue of William Kent’s classic work Data and Reality (Technics Publications, 2012), and am enjoying Kent’s thoughtful ponderings and Steve’s incisive commentary on the nature and purpose of data modeling.

As a data practitioner who works on Agile software development projects, I have often been confronted by developers who feel that data modeling is a pointless activity. After all, since requirements are constantly evolving, there cannot be any such thing as a “correct” data model (that is, a model that completely embodies all the data requirements for a project). Therefore, shouldn’t we simply allow the data properties of application object classes to evolve along with the functional requirements, and regard the database as nothing more than a persistence engine for application-specific object data?

This leads us into several issues that I have addressed in my book, and in previous TDAN articles:

  • Organizations have data and information needs that lie outside the functionality of specific applications (e.g., customer relationship management, supply chain management, quality improvement, regulatory reporting).
  • Application-specific databases prevent data from being reused in ways that can benefit organizations and increase the ROI of both data and applications.
  • Data is often persisted in application databases in ways that mask (or violate) the business’s understanding of data and business data rules. My book provides an example of a time reporting application written by developers who didn’t understand that a timesheet record had to be a unique instance of a project, a task, an employee, and a week (with disastrous consequences!).1
More to the point, the developer view of data modeling embodies at least two erroneous (and contradictory) assumptions: that a data model is either: a) a complete, correct, unchanging and totally understood representation of an organization’s data, or b) merely the design for the application database. In my opinion, neither of these views is correct! Most data practitioners understand that an organization’s data requirements are never completely understood; that there is not a single “correct” way of modeling them, and that data models are much more than a representation of how data will be stored in an application database.

What then, is the purpose and value of a data model? Here are a few principles from my book that I’ve found useful in explaining the value of data modeling to developers:

  • Data models capture essential information about project requirements (specifically, business data requirements and business data rules) that is not captured in process or functional requirements. For example, the developers of the time reporting application created use cases or user stories around requirements for entering and updating timesheet data. But because they didn’t capture the business data requirements, their application didn’t work!
  • Data models are useful for communicating what is known (and assumed) about data characteristics, relationships, and constraints. A data model isn’t so much a model of an organization’s data; it’s a model of what we know or assume to be true about it.
  • Data models serve as a springboard for analysis and collaboration among project stakeholders. In Agile projects I’ve worked on, I’ve been able to use the data model (as it evolved throughout the project) to drive discussions between business owners, architects, developers and subject matter experts about the best way to turn enterprise data into actionable (and profitable) information. This is, after all, the purpose of an application!
  • Data models provide the basis for creating quality assurance (QA) specifications and test cases. After all, any given application process takes data as input, and produces either data or information as output. The data model provides QA leads and testers with valuable information about how to test data inputs and outputs.
  • Data models provide the architectural basis for creating databases (or other data structures) that can be reused across multiple applications and business processes. This enables enterprise data to be maintained in a form that can be called on at any time to serve any purpose that profits the enterprise (which is the ultimate goal of data and information management).
More fundamentally, I believe that a data model represents not information per se, but rather the potential for information. This is why I have a certain amount of disagreement with William Kent’s use of maps as an analogy for data modeling. Steve Hoberman summarizes Kent’s view thusly:

A map simplifies a complex geographic landscape in the same way that a data model simplifies a complex information landscape.2

I have a couple of problems with this analogy: First, the purpose of a map is not solely, or even primarily, to create a representation of reality. When I use a map, it’s for the purpose of helping me decide how best to get from Point A to Point B. Therefore, a data model, in my view, doesn’t exist primarily for the purpose of representing data in the same way that a map represents a geographic location (although it certainly does this, at least to some extent). The value of a data model is that it shows you what information is possible, given what is known or assumed about the data, in much the same way that looking at a map can show you what destinations are possible given your current location.

My second problem with the map analogy is that a map has to be complete and correct (within the boundaries of its simplified representation) in order to be useful. If I want to drive from Seattle to Portland, a map that stops at Tacoma is not going to be useful to me. Moreover, any change in the reality represented by the map (for example, the changing of a street name) can render the map valueless. A data model, by contrast, does not have to be 100% complete and correct in order to be of value. Again, a data model is not a complete, correct and immutable representation of reality; it is an expression of what is known or assumed about data and the information that can be produced from it, and this knowledge can be communicated, shared, challenged and updated throughout the development process. In this way, the data model drives value throughout the development effort, even as it is being developed.

To put this another way, the primary value of a data model is to create a shared understanding, among all project or process stakeholders, about what is currently known or assumed about the data, and about the business data requirements and business rules that constrain the data. This helps ensure that development of business and application processes is done in a way that enhances, not diminishes, the value of data to the organization.

In this respect, the purpose of a data model is not to be “correct”, any more than the purpose of a philosophy is to be “correct” (although the observations and analysis that drives the development of each should be as thorough and accurate as possible). The primary value of both lies in what they contribute to our understanding of some problem domain, and our ability to articulate solutions to these problems. Ideally, this understanding should not be merely abstract knowledge, but should enable and impel us  to make responsible decisions and to take appropriate action.

NOTE: I’d like to make this a dialogue, so please feel free to email questions, comments and concerns to me. Thanks for reading!

References:

  1. Burns, Larry. Building the Agile Database (Technics Publications, 2011), pp. 112-114.
  • Kent, William. Data & Reality, 3rd Edition (Technics Publications, 2012), p. 37.

Share

submit to reddit

About Larry Burns

Larry Burns has worked in IT for more than 25 years as a database administrator, application developer, consultant and teacher. He holds a B.S. in Mathematics from the University of Washington and a Masters degree in Software Engineering from Seattle University.  He currently works for a Fortune 500 company as a database consultant on numerous application development projects, and teaches a series of data management classes for application developers.  He was a contribut0r to DAMA International’s Data Management Body of Knowledge (DAMA-DMBOK), and is a former instructor and advisor in the certificate program for Data Resource Management at the University of Washington in Seattle.  You can contact him at Larry_Burns@comcast.net.

Top