Data Modeling in Agile Projects
This is the third in a series of columns about applying Agile techniques to data projects. The column’s goal is to share insights gained from experiences in the field through case studies. Painful experiences sometimes lead to powerful lessons learned and many lessons are hard won. This series will describe both how to avoid the pain and how to achieve success with winning strategies.
Please visit — Tips and Tricks from the Trenches Column 1 and Column 2 of the series.
In the last two installments, we discussed some ideas for data modelers who are working on an Agile project. This article will introduce three strategies to make data modeling work within an Agile framework, and will elaborate on the first.
—– —– —–
Minimizing the Impact of Change
The last two articles discussed the fact that change is inherent in the Agile process. Requirements are not completely known up front, but are discovered iteratively throughout the project. Changing a data model is usually difficult because of the dependencies involved: change one part of a data model and often other structural elements will need to change, and the code that references the model will have to change as well.
I have worked on several Agile projects using some different variations of Agile. In the last column, I described a painful experience with a long and arduous project fraught with many interdependencies. The data model had been changed so many times that it no longer resembled a data model in any semblance of “normalcy.” We want to be able to prevent this and discover a way to facilitate change. What strategies can we use to help mitigate changes to the model, minimize the effects of change, or even anticipate changes in advance and therefore be prepared for them?
There are three basic strategies that can be coupled with an Agile approach to assist data modelers in making more Agile data models:
- Use Generic Modeling
- Use Data Model Patterns
- Use a Business Rules Approach
These techniques are not new, and have been discussed in various forms in TDAN and other venues over the years. This column will look at these techniques through “Agile-colored glasses” to understand how the data modeler on an Agile project might incorporate one or more of these strategies to obtain a more Agile data model. This column will specifically focus on the first strategy, Generic Modeling. Subsequent articles in this series will address the other two.
Generic Modeling
Generic modeling is accomplished by using the most general tables that are practical for the problem. Software vendors use this technique in various forms to permit the user to customize the product. For example, Kalido is a tool that builds an application from a business model that the user supplies. Kalido provides the environment and tools for the user to create a business model, and then the development environment generates an application and a data warehouse for decision support. Kalido accomplishes this by using a generic data model at its core to store the customer’s business model. This enables Kalido to generate its application based on virtually any business model that the customer can dream up. This is a very ingenious approach.
The most basic generic data model is a thing with a relationship to itself. See Figure 1.
This model is the ultimate in flexibility; it can model anything! But it is not very useful because it is not expressive. The relationship is shown as many-to-many, and by definition it involves more than one thing (the thing has an association with either the same thing or another thing), and/or it has attributes. Therefore, the next step is to create an intersection entity; the association becomes the intersection entity. So now we have Figure 2, which shows that an association links two things together (and it can link a thing to itself). Every row in the Association table will have a Thing1 value and a Thing2 value. No connection to Dr. Seuss!
Suppose we are modeling a college, and things important to the business include professors and courses, among other things. If the focus of your application is modeling the specific relationship between Teachers and Courses, you would start out with the model shown in Figure 3:
However, this is an Agile project, and we must also store information about many other things, not just Teachers and Courses. Therefore, if you consider the Thing table, it can store the following data about Courses and Teachers:
Thing ID | Thing Name |
T1 | Intro to Geography |
T2 | Data Structures |
T3 | Dr. Smith |
T4 | Dr. Jones |
T5 | Intro to Comp Sci |
Table 1. Thing Data.
The Association table from Figure 2 might look like Table 2:
AssociationID | Thing1 | Thing2 |
A1 | T1 | T3 |
A2 | T2 | T4 |
A3 | T5 | T2 |
Table 2. Association Data.
There are some quick observations you would make about this model. First, you would realize that you could use the Thing model to hold many different, diverse things. You can store courses and professors, of course, just like in Figure 3, but you can also put plumbing parts and office furniture and building locations in the table as well. You quickly see that, from an Agile point of view, the model has merit. When you don’t know what you are going to be modeling, you can start with very little structure.
The next observation you would have is concerning the Association table. The “bare bones” table only has the keys, representing the fact that these things are somehow tied together. It becomes apparent that you need an attribute to describe the Association type. See the next example, in Table 3 below:
AssociationID | Thing1 | Thing2 | AssociationName |
A1 | T1 | T3 | Teaching |
A2 | T2 | T4 | Teaching |
A3 | T5 | T2 | Prerequisite |
Table 3. Association Descriptions.
This is much more helpful; it tells us that T1 and T3 are related in a “Teaching” way. What it does not do is tell you which is the teacher thing and which is the thing being taught. It also doesn’t enable you to query the model by only one of the attributes to get all the Associations that any individual Thing would be participating in. You would have to structure the query by stating “where Thing1=value or Thing2=value”. For example: “Select AssociationName where Thing1=T2 or Thing2=T2”. This query would tell you that T2 (Data Structures) was involved in both a Teaching Association and a Prerequisite Association.
Adding Context and Keeping Flexibility
This model doesn’t supply much context to the data. It does not describe the data very well. For example, it doesn’t tell us what type of Thing is being referenced. It lumps courses and teachers all together.
There are several ways that this can be done. One way is to create another table with the relationship types, embedding the meaning and order of the Things as they relate to one another. In Table 3, we have two Teaching associations, and one Course Prerequisite association. We can create an Association Type which has intelligence built into the data; it says there’s a Course (Thing1) and it is taught by what is stored in Thing2. It has another row that says there’s a Course serving as a Prerequisite (Thing1) for another Course (Thing2). Table 4 has two rows in the Association Type table contents:
AssocTypeID | AssocTypeName |
AT1 | Course-TaughtBy |
AT2 | PrerequisiteForCourse |
Table 4. Association Type Table.
The model diagram now looks like Figure 4, adding the new Association Type table as a one-to-many relationship:
Let’s go back to the original Association table we saw in Table 2. We can now link them together by adding the foreign key of the Association Type table from Table 4 onto the Association Table, shown in Table 5 below:
AssociationID | Thing1 | Thing2 | AssocTypeID |
A1 | T1 | T3 | AT1 |
A2 | T2 | T4 | AT1 |
A3 | T5 | T2 | AT2 |
Table 5. Association Table Joined with Association Type.
Variations on a Theme
You can craft flexibility into the model by allowing the Association data to be expressed in another way. Let’s look at the Association table differently. We would like to explain the relationship type by explaining what the Things are. One way to do this is to label Thing1 as CourseID and Thing2 as TeacherID in the Association table. As soon as you do this, you restrict the model to only store a certain type of Association. This does not facilitate Agile development, and is really a variant of the more descriptive model shown in Figure 4.
In the examples above, we embedded the Thing Type within the Association Type; we had “Course (taught by) Prof” as an Association Type. Another way to do this is to embed Association Type in the structure of the Association table in a different way, and have two columns, one to explain the meaning of each connected thing to each other. You would end up with Table 6.
AssociationID | Thing1 | Thing2 | Thing1Meaning | Thing2Meaning |
A1 | T1 | T3 | Course | TaughtbyProf |
A2 | T2 | T4 | Course | TaughtbyProf |
A3 | T5 | T2 | Prerequisite Course | Course |
Table 6. Adding Association Type to Association Table.
This table is of course not normalized; notice that the meaning columns in A1 and A2 are the same. When you normalize the data and split it off into another table, you would end up with the same table as shown in Table 4 above.
A more flexible way to describe the type of Association is by how it relates to each Thing individually. This is another way to add context to your model but at the same time still promote flexibility. This technique is capturing the Roles of the various things, as they relate to each other. We will discuss Roles more in the next article in the series.
Consider this table, shown in Table 7, calling it Thing Association Type:
ThingAssocTypeID | ThingAssocTypeName |
TAT1 | Course |
TAT2 | Teacher |
TAT3 | Prerequisite |
Table 7. Thing Association Type Table.
The New Association table has five columns, shown in Table 8:
NewAssocID | Thing1 | Thing2 | Thing1TAT | Thing2TAT |
NR1 | T1 | T3 | TAT1 | TAT 2 |
NR2 | T2 | T4 | TAT1 | TAT 2 |
NR3 | T5 | T2 | TAT 3 | TAT 1 |
Table 8. New Association Table.
Both of these last two examples encapsulated Thing description information in the Association data. Another way to do it is to include Thing Type as either an attribute of the Thing table, or as a separate Thing Type table; this depends upon whether the Thing can serve as multiple types, or play multiple roles. In the examples above, a Course can serve as a Prerequisite for another Course. But a Course that’s a prerequisite is also a Course. In this case, the Prerequisite can be seen as a Role that a Course plays when it is in a relationship with another Course.
In the simple model we have been working with, we can embed Thing Type in the Thing table as an attribute. See Table 9 below:
ThingID | ThingName | ThingType |
T1 | Intro to Geography | Course |
T2 | Data Structures | Course |
T3 | Dr. Smith | Professor |
T4 | Dr. Jones | Professor |
T5 | Intro to Comp Sci | Course |
Table 9. Thing Type in the Thing Table.
This model loses flexibility if a Professor wants to take a course as a student! A person would be able to play multiple roles with respect to courses. Then perhaps he is not only one ThingType. There would need to be a separate Thing Type table. This is usually the preferable way to model when you are supporting Agile projects. The Thing can play multiple roles; People can play multiple roles with respect to Courses.
Practical Application
Generic modeling is extremely flexible and can support disparate problem domains all together in one model. Several of my colleagues who have supported a number of different Agile projects use this modeling technique all the time. I’ve used a flavor of it for one specific project I had.
Reference Data
One Agile project I supported used two sets of these types of tables for reference data: one set for the category of the reference data (the type of code), and then another set for the specific code and its value. Reference data could also have relationships with other code types, so generic structures allowed for this. It was difficult for new members of the team to understand at first because it was so abstract, but it did have good utilitarian value. Reference data is a good use case for this approach because it is by nature a generic problem. Reference data is plentiful; it is always used in the same way (as a lookup/validation) and there is much variety. It lends itself well to generic modeling.
Advantages and Disadvantages
This modeling approach is not without issues. The model doesn’t give you any clue as to how to query; you must glean intelligence by querying the model first, or examining detailed metadata. The structure, because it is so generic, does not tell you what the problem domain is about. The model is completely data-driven. The data provides the intelligence and understanding.
This is both a boon and a bane. It means that maintenance will require less modification to data structures over time, as requirements and complexity are added. However, it means that the model itself is difficult to understand because it is so abstract. It means that developers must study the abstraction and understand it, and so must users who want to obtain information directly from the database. It usually mandates a level of abstraction in between users and the database because it is so hard to understand. This is not a problem if there’s a robust application that separates the user from the data.
Expressive vs. Generic
This discussion demonstrates that there are two basic types of data models: specific data models that contain lots of information about the problem domain in their structure, and generic models that contain little if any inherent information about the problem domain. The more generic the approach, the less expressive the model, and the less information can be gleaned from just the model itself. In other words, when a generic model is used, you must look at the data to understand the model.
One of the reasons to create a model is to validate how the business works, whether or not a system is created. A model that is expressive provides specific knowledge of the problem domain, and therefore is better suited for capturing and validating business requirements than a generic model. Recall that in Figure 3 the model’s structures were named to reflect the things important to the business (“Teacher” and “Course”). In Figure 2, the generic names do not reflect business entities at all; it is not apparent to the stakeholder that the model expresses their specific business problem. Consequently, it is very difficult to validate whether your model has captured all the important intricacies of the business using generic structures. It is possible to do both: one specific model to validate requirements, and one generic model for database generation. It is very difficult to present a generic model to a business stakeholder. First, it would be very painful to explain how the model works to business people, and more to the point, validating the model with the business is pointless, since the model can hold anything, not just the unique requirements of the business.
Generic models can capture many diverse things in the same data structure and their relationships to each other. This article makes the case that generic modeling can be used for Agile projects as they evolve, especially in the project launch, when little is known about the problem domain. As the project begins to take shape, you have a choice: You can continue to use the generic model, or you can migrate to a more expressive one.
Make sure you catch the next article in this series: Data Model Patterns for Agile Projects. It will focus on bringing the level of abstraction down a notch, adding a little more intelligence into the data structures but still remaining flexible. You might want to read ahead and get prepared. There is a wonderful book entitled The Nimble Elephant: Agile Delivery of Data Models using a Pattern-Based Approach by John Giles.[1] This book is mandatory reading for every data modeler working on an Agile project. We will dive into some of his suggestions and further explore them in the next column. We shall return to the subject of generic modeling later when we discuss using the Agile approach for data warehousing.
References
[1] John Giles. The Nimble Elephant: Agile Delivery of Data Models using a Pattern-Based Approach. Westfield, NJ: Technics Publications, 2012.
©2017 The MITRE Corporation. All Rights Reserved. – The author’s affiliation with The MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE’s concurrence with, or support for, the positions, opinions or viewpoints expressed by the author.