In one of my previous articles, Agile: The Good News, The Bad News, I introduced my approach to doing database development in an iterative (“Agile”) context. I emphasized that, although Agile is a proven application development methodology with many positive and valuable benefits, it is also an extremely fast-paced and highly collaborative methodology, and it requires a different set of processes (and tools) than traditional approaches, as well as a different mind-set. In this article, I’d like to lay the groundwork for a different approach to database development, one that better supports an iterative development process.
It’s been said before, but it needs to be repeated that the biggest problem we face in data management is confusion between the logical (business owner/user) view of data and the physical(application) view. This confusion is at the heart of what Agile practitioners such as Scott Ambler refer to as the “object-relational impedance mismatch.” When we create a logical data model, we are modeling the business’ view of the data, and trying to make sure we understand the business data domain and the business data requirements. We are not designing the database here! In the physical database design (a step that too often is skipped over in application development), we map the logical (business) data requirements to a choice of implementation (database? XML? flat file?), architecture (server-based? Web-based?), and technology (relational? object?), in a way that meets both business and application requirements. We cannot assume that every business data need must be satisfied with a database, that all databases must be server-based, or that all databases must be relational.
We also need to weigh considerations of security, data integrity, performance, data reusability, and maintainability in our database design. These are what I refer to as the “PRISM”Principles:
Reusability (application independence)
Integrity (of business meaning and value)
Security (of data and information)
Maintainability (of databases and applications)
Generally speaking, in physical database design we want to create a highly cohesive, loosely coupled database schema that is reusable (that is, the same database can be used to support multipleapplications and multiple business uses), and we want to avoid creating a database that is tightly coupled to a specific application (or an application that is tightly coupled to a database schema).This can lead to the following problems:
- The database schema can’t be changed without breaking the application.
- The application becomes platform-bound, and thus difficult to scale or redeploy (as, for example, to a service-oriented architecture).
- It becomes difficult to extend the database design to support new application functionality.
- The business fails to realize the economic value of data reuse.
In application development projects that I’ve been involved in over the years, I’ve seen how the confusion between the logical and the physical can impede application development projects, waste time and money, lead to bad application (and database) design, and cause lots of friction between developers and DBAs. Here are some examples:
In another project, we painstakingly designed and built a relational database for an application, only to discover later that the application didn’t need a database at all!
In some projects, mostly involving outside contractors, either we see no logical model at all (i.e., no attempt to look at the data from a business standpoint), or we see a “logical model” that is nothing more than an application-specific database schema. In one project, we were given a “logical data model” that looked correct from a business standpoint, but it bore no resemblance to anything in the actual database. I suspect they were only humoring us.
In a couple of projects I described in my article, Too Simple, Too Soon, the failure to create a logical model resulted in databases in which data for critical attributes became lost or corrupted. Ironically, these attempts to shield application developers from the complexity of database design actually resulted in more work for the developers, and in slower performance for the application!
So, having recognized this problem, how do we solve it? I would like to suggest a model, or paradigm, for the interaction of applications and data, based on models such as the TCP/IP stack (in which higher-level communication protocols are built on top of lower-level ones). I call this model the Data Services Stack:
We start at the bottom, in the Requirements level. We capture the business data requirements, as part of the normal requirements-gathering process (which our data analysts participate in, along with the application analysts and business analysts). The high-level business entities and relationships are documented in a conceptual data model. As requirements are further developed and refined, the logical data model is developed, documenting the attributes (characteristics) of each entity.
We then develop the physical design of the database, based on (but not necessarily identical to) the logical model. For some databases (especially, those that support transaction-based applications and that contain data whose reuse is critical to the business), this will result in a database schema that is more normalized (and thus, closer in appearance to the logical model). For other databases (e.g., reporting, analytic, content-based and rule-based), the physical design may not resemble the logical model. From the physical design, the base schema of the database (the tables, indexes and constraints) are created.
(Parenthetic comment: one reason for the confusion between the logical model and the physical database design may be that, for relational databases, they are often developed using the same modeling tool, which then generates the DDL to create the database schema. While these tools can save the DBA considerable time and work, care must be taken not to get too locked-in to a too normalized viewof the world.)
Now, just for fun, let’s go to the top of the model. Most applications consist of a user interface (screen-based, form-based, or web-based), on top of a business layer (which contains the application logic). In the business layer, application objects (instances of object classes) interact with the user, and with each other, via methods (things that the objects know how to do). These objects have properties (characteristics), which are the data that the application needs to work with. Therefore, most applications have an application data layer, which usually (but not always) interacts with a database to get the data needed for an object, or to update the database when objects are created (or deleted), or when object properties are changed.
The so-called “object-relational impedance mismatch” occurs when the application data layer (which is object-based) tries to interact with the database (which is usually relational, and often normalized). Data in the database (which may be in several tables) must be joined together before passing it to an application object (which can slow down performance). Data in the application object must be “shredded” (parsed) before it can be updated in the database. How can we resolve this dilemma?
One common approach is to denormalize the database to fit the application class model. This, however, results in an application-specific database, and several of the problems described above: an application tightly coupled to the database schema, a database that can’t be changed without breaking the application, corrupted and non-reusable data, and (sometimes) performance issues fromlocating too much of the data in the same place.
Another approach is to accept (and in many cases mandate) the impedance mismatch. This, however, impedes application development (resulting in lost time, work and money), introduces errors and performance problems into the application, and often results in hostility between the development group and the data group.
The solution I propose is the introduction of what I call the Virtual Data Layer (VDL), which sits between the database schema (base tables) and the application data layer. The VDL consists of a set of views (both base views, which map to single tables, and joined views, which map to multiple tables), stored procedures,functions, triggers, and work tables (or materialized views), which serve to mask the complexity of the underlying database schema from the application, and present data in a form that can be easily consumed (and updated) by the application. As I described in my article, The Key to Database Agility,this approach has several advantages:
- It eliminates the object-relational impedance mismatch (by creating virtual data objects that map to application object classes).
- It reduces coupling between the application and the database.
- It simplifies application code (fewer complex joins to code).
- It increases application performance (since the database code is optimized for quick execution, and runs on the database server, not the application server).
- It reduces network traffic and bandwidth (since less data has to travel back and forth between the application and the database).
Most importantly, it increases “agility” by making refactoring easier. Objects in the Virtual Data Layer can be created and refined incrementally, and they are much easier to change“on the fly” than the database schema (see, for example, Chapter 12 of Scott Ambler’s book Agile Database Techniques for an eye-opening look at what’s involved in refactoring a database schema). Although the current Agile approach to database development involves continual refactoring of the database schema (as in Scott’s book), this is actually the least Agile approach!
In future articles, I’ll talk more about the Virtual Data Layer, how it works, and the benefits it provides to both DBAs and application developers. I’ll describe the process of moving from the logical data model to a well-designed physical database, and I’ll describe how (and when) denormalization can be safely done.
NOTE: I’d like to make this a dialogue, so please feel free to email questions, comments and concerns to me. Thanks for reading!