Introduction To Database Refactoring

Published in TDAN.com July 2006

Material for this article was modified from Refactoring Databases: Evolutionary Database Design by Scott Ambler and Pramod Sadalage (Addison Wesley 2006). www.ambysoft.com/books/refactoringDatabases.html


Abstract:

A database refactoring is a small change to a database schema which improves its design without changing, at a practical level, the semantics of the database. In other words, it is a simple
database transformation which neither adds nor breaks anything. The process of database refactoring defines how to safely evolve a database schema in small steps. Database refactoring enables data
professionals to work in an evolutionary manner, just as modern application developers do. It also provides a coherent strategy for organizations to dig their way out of the legacy database hole.


1. What is Database Refactoring?

In the seminal text Refactoring, Martin Fowler [1] describes the programming technique called refactoring, which is a disciplined way to restructure code in small steps. Refactoring
enables you to evolve your code slowly over time, to take an evolutionary (iterative and incremental) approach to programming. A critical aspect of a refactoring is that it retains the behavioral
semantics of your code. You do not add functionality when you are refactoring, nor do you take it away. A refactoring merely improves the design of your code – nothing more and nothing less.

A database refactoring [2, 3] is a simple change to a database schema that improves its design while retaining both its behavioral and informational semantics – in other words, you cannot add new
functionality or break existing functionality, you cannot add new data, and you cannot change the meaning of existing data. A database schema includes both structural aspects, such as table and
view definitions, and functional aspects, such as stored procedures and triggers. I use the terms code refactoring to refer to traditional refactoring as described by Martin Fowler and
database refactoring to refer to the refactoring of database schemas. The process of database refactoring is the act of making these simple changes to your database schema.


2. Why Database Refactoring?

There are two fundamental reasons why you want to adopt database refactoring:

  1. To repair existing legacy databases. Database refactoring enables you to safely evolve your database design in small steps, making it an important technique for improving the
    legacy assets within your organization. This is clearly much less risky than a “big bang” approach where you rewrite all of your applications and rework your database schema and release them
    all into production at once. Furthermore, it is much better than the “let’s try not to allow things to get any worse” strategy currently employed by the vast majority of data management groups
    which I’ve run into, a strategy which has no hope of success because all it takes is one development team to go around the data management group and do an imperfect database design.
  2. To support evolutionary software development. Modern software development processes, including the Rational Unified Process (RUP), Extreme Programming (XP), Agile Unified Process
    (AUP), Scrum, and Dynamic System Development Method (DSDM), are all evolutionary in nature. Craig Larman [4] summarizes the research evidence, as well as the overwhelming support among the
    thought leaders within the IT community, in support of evolutionary approaches. Unfortunately, most data-oriented techniques are serial in nature, relying on specialists performing relatively
    narrow tasks, such as logical data modeling or physical data modeling. Therein lies the rub – the two groups need to work together, but both want to do so in different manners. I believe that
    data professionals need to adopt evolutionary techniques, such as database refactoring, which enable them to be relevant to modern development teams. Luckily these techniques exist [3], and they
    work quite well, it is now up to data professionals to choose to adopt them.


3. Implementing a Database Refactoring

Sometimes a project team finds itself in a relatively simple, “single-application database” situation, and if so they should consider themselves lucky. With this simple architecture database
refactoring is fairly simple – you merely change your database schema and update your application to use the new version of the schema. What is more typical is to have many external programs
interacting with your database, some of which are beyond the scope of your control. In this situation you cannot assume that all the external programs will be deployed at once, and must therefore
support a transition period (also referred to as a deprecation period) during which both the old schema and the new schema are supported in parallel. For the rest of this article I will assume that
you’re in this situation.

To put database refactoring into context, let’s step through a quick example. You have been working on a banking application for a few weeks and have noticed something strange about the Customer
table depicted in Figure 1[1] – one of the column names isn’t easy to understand. You decide to apply the Rename
Column
refactoring to the FName column to rename it to FirstName.

alt
Figure 1. The initial database schema for Customer.

Agilists typically work together in pairs; one person should have application programming skills and the other data skills, and ideally both people have both sets of skills. The pair begins by
determining whether the database schema needs to be refactored. Perhaps the programmer is mistaken about the need to evolve the schema, and how best to go about the refactoring. The refactoring is
first developed and tested within the developer’s sandbox. When it is finished, the changes are promoted into the project-integration environment, and the system is rebuilt, tested, and fixed as
needed.

To apply the Rename Column refactoring in the development sandbox, the pair first runs all the tests to see that they pass. Next, they write a test because they are taking a Test-Driven
Design (TDD) approach [5, 6, 7]. A likely test is to access a value in the FirstName column. After running the tests and seeing them fail, they implement the actual refactoring. To do this
they introduce the FirstName column and the SynchronizeFirstName trigger as you see in Figure 2.

alt
Figure 2. The database schema during the transition period.

The trigger is required to keep the values in the columns synchronized – each external program accessing the Customer table will at most work with one but not both columns. At first, all
production applications will work with FName, but over time they will be reworked to access FirstName instead. There are other options to do this, such as views or synchronization
after the fact, but I find that triggers work best.

The FirstName column must be populated with values from the FName column. You then need to run both columns in parallel during a “transition period” of sufficient length to give
the development teams time to update and redeploy all of their applications. This transition period could be several years in length, depending on the ability of your project teams to get new
releases into production. In this case we’ve decided that the transition period will run to November 14, 2007.

The pair reruns the tests and sees that they now pass. They then refactor the existing tests to work with the FirstName column rather than the FName column. Once the database
refactoring is completed in their development work environment, the pair promotes their work into the team’s integration sandbox where they rebuild and rerun the tests, fixing any problems which
they find. To update the database schema, the pair runs the appropriate change and migration scripts in the appropriate order.

This promotion strategy continues into your pre-production integration testing environment and then eventually into production. Depending on your need, you could implement and then deploy the
refactoring within a single day, although more realistically it would be several months until the next major release of your application that you would deploy the refactoring along with any other
updates that you’ve made.

After the transition period, you remove the original column plus the trigger(s), resulting in the final database schema of Figure 3. You remove these things only after sufficient testing to ensure
that it is safe to do so. At this point, your refactoring is complete.

alt
Figure 3. The final database schema for Customer.

There is a little more to successfully implementing a database refactoring than what I’ve described. You need a way to coordinate the refactoring efforts of all the development teams within your
organization, clearly something that may prove quite difficult. You also need to get good at deploying refactorings in production, once again coordinating the efforts of several teams. In
Refactoring Databases [3], my co-author Pramod Sadalage and I discuss several strategies for doing each of these things.


4. Why Not Just Get it Right to Begin With?

I am often told by existing data professionals that the real solution is to model everything up front, and then you would not need to refactor your database schema. Although that is an interesting
vision, and I have seen it work in a few situations, experience from the past three decades has shown that this approach does not seem to be working well in practice for the overall IT community.
The traditional approach to data modeling does not reflect the evolutionary approach of modern methods such as the RUP and XP, nor does it reflect the fact that business customers are demanding new
features and changes to existing functionality at an accelerating rate. The old ways simply aren’t sufficient any more, if they ever were [8].

I suggest that you take an Agile Model-Driven Development (AMDD) approach [9, 10], in which you do some high-level modeling to identify the overall “landscape” of your system, and then model
storm the details on a just-in-time (JIT) basis. Take advantage of the benefits of modeling without suffering from the costs of over-modeling, over-documentation, and the resulting bureaucracy of
trying to keep too many artifacts up-to-date and synchronized with one another. Your application code and your database schema evolve as your understanding of the problem domain evolves, and you
maintain quality through refactoring both.


5. In Conclusion

Database refactoring is a database implementation technique, just like code refactoring is an application implementation technique. You refactor your database schema to ease additions to it. You
often find that you have to add a new feature to a database, such as a new column or stored procedure, but the existing design is not the best one possible to easily support that new feature. You
start by refactoring your database schema to make it easier to add the feature, and after the refactoring has been successfully applied, you then add the feature. The advantage of this approach is
that you are slowly, but constantly, improving the quality of your database design. This process not only makes your database easier to understand and use, it also makes it easier to evolve over
time; in other words, you improve your overall development productivity.

My experience is that data professionals can benefit from adopting modern evolutionary techniques similar to those of developers, and that database refactoring is one of several important skills
that data professionals require. Evolutionary development has arguably become the norm within the IT community, and agile software development approaches extend evolutionary methods to become more
effective. My advice to data professionals is to take evolutionary and agile concepts and techniques seriously: they’re real, they work, and they’re here to stay.


6. References and Recommended Reading

  1. Fowler, M. (1999). Refactoring: Improving the Design of Existing Code. Menlo Park, California: Addison Wesley Longman, Inc.
  2. Ambler, S.W. (2003). Agile Database Techniques: Effective Strategies for the Agile Software Developer. New York: John Wiley & Sons. www.ambysoft.com/books/agileDatabaseTechniques.html
  3. Ambler, S.W. and Sadalage, P.J. (2006). Refactoring Databases: Evolutionary Database Design. Boston: Addison Wesley. www.ambysoft.com/books/refactoringDatabases.html
  4. Larman, C. (2004). Agile and Iterative Development: A Manager’s Guide. Boston: Addison-Wesley.
  5. Astels D. (2003). Test Driven Development: A Practical Guide. Upper Saddle River, NJ: Prentice Hall.
  6. Beck, K. (2003). Test Driven Development: By Example. Boston, MA: Addison Wesley.
  7. Ambler, S.W. (2004). Introduction to Test Driven Development (TDD). www.agiledata.org/essays/tdd.html
  8. Ambler, S.W. (2004). The Agile Data Home Page. www.agiledata.org.
  9. Ambler, S.W. (2002). Agile Modeling: Best Practices for the Unified Process and Extreme Programming. New York: John Wiley & Sons.
    www.ambysoft.com/books/agileModeling.html
  10. Ambler, S.W. Agile Model Driven Development (AMDD). www.agilemodeling.com/essays/amdd.htm

Share

submit to reddit

About Scott Ambler

Scott W. Ambler is the Practice Leader – Agile Development within IBM’s Methods Group.  His personal home page is www.ambysoft.com/scottAmbler.html and he maintains several content-rich sites, including www.agiledata.org, www.agilemodeling.com, and www.enterpriseunifiedprocess.com.  He is the author of several books, including Refactoring Databases and the award winning Agile Database Techniques, and is a senior contributing editor with Dr. Dobb’s Journal.

Top