Written by Graeme Simsion – Simsion & Associates / University of Melbourne[1]
By subtitling this article clearing the confusion, I have probably provoked two reactions: (1) “More likely adding to the confusion” says the cynic, not unreasonably. Why do we need another set of definitions for the stages of data modeling when we have so many to choose from already? What makes Simsion think that his version will be any more likely to win general acceptance than anyone else’s? (2) “What confusion?” is the second response. “I thought this was settled years ago” said a correspondent in an email I received shortly after joining an online discussion on the subject[2].
I’ll start with the second question, and establish that there is an issue that needs tackling, not just as a terminological nicety, but because the confusion has a real and damaging effect on practice. Then, I’ll review some of the background – theory and history. One of the frustrations of the data modeling field is the regularity with which the same problems re-surface, and our propensity to tackle them from first principles (or no principles) rather than building on an understanding of established work.
We can then review the purpose of dividing data modeling into stages, and formulate some goals and principles. With those in mind, I suggest some stage boundaries, recognizing that this proposition is unlikely to change the world. But, returning to (a) above, I believe you can change your world for the better, and offer some suggestions as to how to go about it.
Is there a problem?
As part of my research [3] with the University of Melbourne, I asked 55 attendees of advanced data modeling seminars in the US (25 attendees) and UK (30 attendees) to list the stages in database design, including data modeling. After consolidating synonyms and generally tidying up, there were still seven variants on a broad theme of Business Requirements – Conceptual Data Modeling – Logical Data Modeling – Physical Data Modeling theme, plus a few “outliers” (Table 1). (More detailed results, including activity names are at www.simsion.com/research4.htm.)
On the surface, these variations might appear reasonably easy to resolve, but when we dig a little deeper we find that one person’s logical data modeling (for example) is not the same as another’s. When asked to allocate 26 activities (e.g. identifying attributes, specifying primary keys, specifying column names, specifying the datatype for each column, de-normalization for performance, specifying how subtype hierarchies will be implemented) to the relevant stage, respondents provided an interesting variety of answers. Stop for a moment. In which stage(s) do you think each of these six activities takes place? Who is responsible? Are you confident that your colleagues would agree?
Figure 1 illustrates the variation in responses by showing the 26 activities (in an “average sequence[4]” along the horizontal axis) and which of the five most commonly named stages they were allocated to.
I have since used this little test on many occasions, and even amongst people in the same job role within the same organization, it frequently shows up real disagreements or misunderstandings. Does it matter? If it didn’t, there would hardly be any point in having named stages in the first place. There are some fairly obvious problems with disagreeing on stages and what should fall within them: in particular, managing expectations about project deliverables (“I thought the conceptual model would have all the attributes defined”) and responsibilities (“I’m responsible for normalization – so how can you denormalize without asking me?”). These are problems within organizations. In the broader community, discourse is impeded because people are not using the same language. As data modelers, we should be familiar with this kind of problem.
A bit of history
It’s worth looking at where the terms came from. In 1975, the ANSI/SPARC study group on database management systems published the 3-schema architecture, recognizing three levels in a database specification: the conceptual schema, the internal schema, and the external schema. In (approximate) relational terms, these correspond respectively to base tables, physical storage and access mechanisms, and views. The important thing is that the architecture was a way of describing a finished product – a database – not a process for designing one. There was, in fact, a suggested process for getting to this structure: individual users define their own views, which are then integrated into a common conceptual schema, and then reproduced through the external schema. And of course the DBA makes it all work efficiently by designing an appropriate internal schema.
Then, in 1976, Peter Chen published his paper on the Entity-Relationship Model, [5] which provided a common specification language for three different types of database: relational, network, and entity-set. The E-R formalism was a step forward from the older Data Structure Diagrams (“Bachman diagrams” – which pre-dated the Relational Model) in that it reflected real-world terminology a little more closely, primarily through the use of a common symbol for relationships, whether one-to-many, many-to-many, binary or higher degree. A year later, this common front-end was being described by its author as a “pure representation of reality”.
The need to support the three different database architectures has passed – but the idea of a two stage process, beginning with a model (and modeling language) that more closely reflects real-world concepts, followed by a mechanical translation to a conceptual schema, has survived. In the academic literature, the first of these two models is referred to as a conceptual model, and much research has been devoted to finding better formalisms[6]. It is important to note that in the academic world, the conceptual model is detailed enough to allow translation into a conceptual schema; for example it includes attributes (or their equivalent). We see this underlying approach and terminology in some textbooks and practitioner approaches that have their roots in academe – for example the Object Role Modeling[7] (ORM) approach.
An important part of current academic approaches is the use of a different language for conceptual and implementation models, with the latter usually assumed to be (pure) relational. For a long time, this separation of languages was not reflected (or only very dimly reflected) in data modeling practice. The popular “crows-foot” notation, at least in its basic form, is not much more than a diagrammatic version of relational structures. Chris Date has called the (Chen) E-R Model a “thin layer on top of the Relational Model”. If this is true (and I think it is), then the crow’s foot notation is an even thinner one. But over the years, we have seen widely-adopted extensions to both the crow’s foot notation (e.g. subtypes) and to the facilities offered by DBMSs – not necessarily in sync with one another. UML, with a bevy of additional constructs, has emerged as an alternative to the crow’s foot notation. Object-Relational DBMSs offer a range of new facilities, some prompted by developments in the SQL standard, some by the desire to differentiate the product in the marketplace. Slowly, we have moved towards two different (but related) languages, without necessarily paying much attention to whether a model in one can be mechanically translated into the other.
Meanwhile, and not for the only time, practitioner authorities appear to have taken a term from academe and applied their own definition. For most practitioners, “conceptual” modeling is high-level modeling – entities, relationships and no attributes – as illustrated by the brief dominance of the yellow line (peaking above activity no. 3) in Figure 1. Of the 32 respondents to my survey who nominated a conceptual modeling stage, only five assigned identification of attributes to that stage. Communication between practice and academe is challenging enough without the problems introduced by ambiguity of this kind.
With conceptual modeling in practice largely taking the role of preliminary sketching or scope definition, the issue becomes one of drawing the boundary between logical and physical modeling – the light blue and purple lines in Figure 1 (if you’re looking at a black and white printout, the logical line peaks above activity no. 9 and the physical line rises and dominates after activity no. 17). The term physical data modeling in the diagram includes physical database design: the analysis of responses showed that the two stages covered similar activities (and seldom appeared together in the one response). But if the stage was called physical database design, it was much more likely to be seen as the responsibility of the DBA rather than the data modeler. Terminology implies responsibilities.
Why do we need stages?
What are we trying to achieve by breaking data modeling into stages? The academic answer, as described above, is relatively simple: we want to separate the description of the real world from the DBMS-specific structures that we use to implement it. I should add that I think this view is misguided and that the real-world description should not be in the form of a data model, but we can put that argument aside for the moment, because there is a still a valid question of choosing a language that is fit for purpose. At least the academics have a clear principle to work from.
Here are some arguments from a practice perspective:
1. In managing any substantial project, it makes sense to specify interim deliverables to support management and a logical work sequence. Obviously the content of these deliverables has to be clear, so that we can establish dependencies and determine when they have been completed. That said, many different break-ups of the task can achieve this goal.
2. Different tasks require different skills. In this area, the key difference is business skills and logical data structuring skills versus database (and particularly DBMS-specific) skills. Clear hand-offs and associated responsibilities require clear deliverables.
3. Different models may be appropriate for different audiences. Specifically business stakeholders and process modelers / programmers may need to review only some aspects of a database design and may benefit from the model being developed and presented in a different language from that required by the database implementer.
4. We may want to separate certain types of decisions – classically performance decisions – to promote focus, clarity of thinking, and auditability. “This decision was made only for performance reasons – here’s the ideal model that we could use if we acquired more powerful technology”.
5. We may want to develop a DBMS-independent model – and here I’m referring not only to the “invisible” aspects such as index structure, but to the visible structures – perhaps explicit support for subtypes or for complex attributes / columns – that may differ from one DBMS to another. Though often cited, I wonder how important this goal really is: we generally know the target DBMS in advance.
6. We may want to automate part or all of the process – in particular a whole stage. Tools typically provide an automatic default translation from “logical” to “physical” (“conceptual model” to “conceptual schema” in academic terms) – so our “logical” model has to be complete in some quite specific ways.
Closely associated with these issues is the question of what to maintain. Do we maintain only a description of the implemented database or do we try to keep certain interim deliverables (e.g. a logical model) up to date? In my experience keeping earlier deliverables up to date is a “wish list” item that is seldom realized in practice, and of debatable value (especially if the data modeler is not committed to sticking around). When the time comes for database renewal, these documents are cast aside in favor of (at either extreme) a full revisit or simple replication of the existing structures.
Candidate Stages
In keeping with the preceding arguments, I suggest the following candidate deliverables and associated stages. For the moment, I’ll avoid using the terms conceptual data model, logical data model,
and physical data model. I see four logical checkpoints:
- A high-level model, analogous to an architect’s sketch plan. This is to support early exploration of options, including scope. Here, we are not looking for rigor as much as key concepts and an idea of what part of the business is covered. The final model, probably supported by alternatives that were considered, should answer the question “what will the model cover?” and establish a framework for more detailed analysis / design. Entities and relationships, perhaps with some attributes to illustrate, should do the job.
- A business-friendly model. There is an established tradition of using at least a different form of model from that supported by the implementation platform in dealing with the business. Typically the model is in diagrammatic form, and includes constructs that will have to be translated, with or without some decisions by the modeler, into the language of the target DBMS. There are two basic assumptions in having a separate model for the business to contribute to and review: first that they will in fact find it easier to understand than a model in implementation form and second that the translations will be of minimal interest to the business (i.e. their sign-off is not going to be rendered invalid by later changes). Both assumptions may need to be questioned. If the business people will need to become familiar with the ultimate database structure, perhaps through use of a query language, it may be sensible to work in that form from the start – which doesn’t preclude drawing diagrams! Some well-established translations such as choice of subtype implementation do have business impact (in this case on, for example, resilience to business change). Alternatively we may see this model as being modeler-friendly; as helping the modeler to explore different ideas (such as level of generalization) and to defer a range of detailed decisions that do not require consultation with the business. The question is what to include and what to leave out. The purist view that the model should be totally independent of implementation language is probably unrealistic, and certainly not observed in practice. One interesting choice is to take the latest version of the SQL standard, with appropriate diagramming support, as the starting point for what can be represented in this stage. Other options include UML and the ORM language, mentioned earlier.
- A pre-performance conceptual schema. The development of an initial conceptual schema design that excludes performance considerations is prompted by an established (and reasonable) division of expertise. Creating a pre-performance conceptual schema requires knowledge of the business and of visible data structures (the options for which are generically similar amongst relational DBMSs); performance tuning requires knowledge of internal schema facilities, and is traditionally the responsibility of someone with expert knowledge of a particular DBMS and its environment. The deliverable from this stage should be a complete conceptual schema specification – including table and column names. In terms of work sequence, the separation of performance design may not always be a good idea: if performance is paramount, it may need to be designed in from the beginning, rather than added on.
- A conceptual schema. Whatever approach we take, an implemented database is the final goal, and needs to be properly documented. We’re on solid ground here: the nature of this deliverable is determined by the constructs supported by the DBMS.
The above is probably not too far from what you do or what your documentation tool supports. But the names remain a problem. Table 2 provides some examples of names (by no means definitive or exhaustive) that I’ve seen for the above models – including the ones that Graham Witt and I use in Data Modeling Essentials.
It should be clear that we are going to have some serious communication problems – and some misunderstandings about deliverables and responsibilities if we don’t pin down our terms. This should not be news to data modelers!
What to do
As Michael Gorman observed in a posting to the dm-discuss discussion group on this issue, a standard (and, I would add, not just an industry standard, but an industry, research, and teaching standard) would be a good idea. In the meantime, I suggest that you can take the smaller, but important step of clearing the confusion in your own organization – or making sure that there isn’t any.
Basically, you need to ensure that you have a common and workable set of definitions, and a clear idea of who is responsible for what. I’ve found that one of the simplest and most effective ways of tackling this is to start with the concrete and then move to the abstract. First, work with other stakeholders (data modelers and DBAs in particular) to agree the broad stages – and names – perhaps using the descriptions in the previous section as a starting point. Decide who is responsible for each stage. If the responsibility is joint, consider splitting the stage. That’s the (relatively) easy part. Then, together, make a list of activities in database development – from ascertaining requirements to post-implementation tuning and everything you can think of between and beyond. Focus on including any disputed territory – perhaps de-normalization, column names, etc. Then allocate them to stages – collectively. If you haven’t done this before, expect plenty of discussion. When it’s done, look for underlying principles (“this stage doesn’t include decisions made only for performance”) and then review the allocations in that light. The ideal result is to have some principles (and perhaps a few exceptions) so that activities that have not been considered can be readily placed in an agreed stage. And of course, document the result.
There’s nothing dramatic or terribly clever in any of this, and it sounds a lot like data modeling, but in my experience it’s not often done. It’s not always a pleasant process, bringing underlying disputes to the surface, but it provides a chance to solve the problem once instead of every time you build a database.
As you’ve doubtless guessed, I have some strong views on where the lines should be drawn, and even stronger ones about who should do what. But let’s establish a basis for talking first. The third Edition of Graeme’s book Data Modeling Essentials (written with Graham Witt) was published by Morgan Kaufmann in November 2004. Graeme is a regular speaker in the US, UK and Australia – see www.simsion.com for upcoming presentations.
[1] graeme@simsion.com.au. My thanks to Corine Jansonius for her helpful comments on the draft of this article.
[2] On the dm-discuss discussion group – details at www.wilshireconferences.com
[3] Relax, this was just a temporary (but valuable) diversion from practice in the “real world”.
[4] Calculated as the mean of the stage number to which they were allocated (e.g. if half of the respondents allocated an activity to their first stage, and
half to their second stage, that activity would have a mean stage number of 1.5 – and hence be sequenced later than an activity with a mean stage number of (say) 1.4.
[5] I use upper case (Model) to refer to modeling formalisms rather than individual models of a particular business or application.
[6] Which, confusingly for us, are referred to in the literature as models.
[7] See www.orm.net
[8] This term includes the internal schema – i.e. physical placement and access mechanisms