What’s the fuss about?
“I confess that I thought you were a lunatic when I first heard about your conjecture many years ago” wrote modeling authority Alec Sharp in a recent email to me. He was not alone. In 1996 I wrote an article for Database Programming and Design which included the “conjecture”, and it attracted record correspondence, most (but not all) of it taking a position firmly opposed to mine.
This is the statement that caused all the fuss: data modeling is design. Hold it! Before you start addressing letters to the editor, we do have a small problem of definition. What do we mean by data modeling? What do we mean by design?
My research (more of this later) has confirmed something most of us have found from experience: data modelers are not in agreement as to what constitutes ‘data modeling’. For the purposes of this article, I’m going to use it to embrace the main stream of activities needed to produce a conceptual schema definition (base tables, if you like) to meet a set of (stated or unstated) business requirements. For the sake of simplicity, let’s leave out decisions made purely for performance reasons. Please don’t bother writing in on this one: if you have an alternative definition, read on, and see whether what I have to say makes sense anyway.
The word design (and its complement, or opposite, analysis) is so badly abused in information systems writing that I’d prefer not to use it, but I lack a good alternative. A standard book on methodologies characterizes the difference between analysis and design as one of prescription vs description, and this certainly captures the flavor I want, as well as being inline with the way we use the words in everyday life. But in information systems, we more often use “analysis” for the early, business-focused stages of the life-cycle, and “design” for the later, more technical phases, though by no means consistently. This may be a reflection of the days in which the job of the IS professional was to automate existing processes and data structures, and in which the major design challenges were technical.
But today, you will have no trouble finding examples in database textbooks in which almost entirely mechanical transformations are described as “design”, and quite creative processes are described as “analysis”. The problem here is that we don’t forget the everyday use of these words: if something is called “analysis” we are going to be inclined to think of it as being descriptive rather than prescriptive.
Here’s the way I characterize the difference: which of the following two statements better describes data modeling?
The first (“analysis”) alternative suggests that the “answer is out there”; the second (“design”) alternative suggests that the data modeler creates structures to meet the business need. If, in looking at these choices, you’re inclined to answer “both”, think hard. When you deliver that final data model (be it a conceptual schema or some earlier deliverable which you define as being the end of your work) is it documentation of data requirements or is it something you’ve come up with to meet them? There is a real difference.
Does it matter?
It matters a lot.
The way we characterize data modeling – as analysis or design – affects how we do it, how we explain it, and how we teach it, in profound ways. For the academics, it impacts the direction and design of research.
Let’s look at just four of the issues.
One right answer?
Broadly, analysts seek an accurate depiction of some aspect of reality. Designers are involved in a creative endeavor, and their deliverable is a personal or group response to requirements. We don’t expect different designers to produce the same result. On the contrary we often welcome diversity and applaud the designer’s creativity.
In the context of data modeling, how should we respond to two modelers who have produced different models? Do we seek to reconcile them, to eliminate the differences, or do we recognize the diversity and try to choose the better model or select the best (design) features of each?
How do we respond to a “radical” model, full of concepts that have never before entered the business peoples’ heads? Is such a model to be considered as a legitimate, albeit innovative, response to the business need? Or does it fail for not representing the business correctly? As a third option, do we accept the model, but stay with the analysis paradigm by deciding that the modeler has “discovered” the “deep” reality of the business?
When we develop a model do we stop when we have a workable model – or does our approach allow for alternatives to be generated and compared? Do we have criteria for comparing different workable models?
How do we evaluate packages? Do we create a model that represents business requirements, then look for the vendor who offers the best match? Or do we take the view that any model is an individual solution to requirements, and acknowledge that different vendors may have different, and perhaps better, solutions?
Who are we like?
It is odd that that many people who see data modeling as an entirely descriptive activity call themselves “data architects”. Architects are designers. Architecture is quintessentially design, to the extent that Bryan Lawson, author of one of the most influential general books on design “How Designers Think” takes it as his paradigm. If we don’t do design, we shouldn’t choose titles which will confuse others as to what our role entails. Data draftspersons perhaps?
From my position, of course, I am very happy with the metaphor, and have found it immensely useful not only in communicating what we do, but in prompting me to think about our approach. Twelve months working closely with an architect on my own building project gave me plenty to consider – and a remarkable amount which I could apply to my own thinking about data modeling.
From other disciplines, such as architecture, we know something about the challenges of teaching design. To put is very simplistically, people learn design by doing it, and by looking at others’ designs. And they learn it pretty slowly at first.
Anyone who has taught introductory data modeling (or had to manage people who have just returned from an introductory course) will confirm that newbie modelers can know all of the rules, but be incapable of even getting started on a real model. Sometimes when giving guest lectures to postgraduate students who have studied data modeling and can discuss the relative merits of ORM, Chen E-R and the Semantically Extended Inverted Binary Class Model, I ask them to do a simple example – but not so simple that I give them the entity names as nouns and describe all the relationships (that’s drafting, not architecture). They struggle, and are often hugely frustrated because they “know the rules – just never used them”.
This scenario is absolutely typical of design disciplines.
Yet virtually every text on data modeling (my own, predictably, excluded!) implicitly supports the analysis paradigm by providing a single “correct” answer to every situation. Occasionally you will find an acknowledgement that there may be more than one way to model a situation, but invariably the example is one of “is it an entity or a relationship or an attribute?” and the idea is not pursued further.
Valuing the Contribution and the Contributor
In design, we don’t talk about “the right description”; we talk about good, very good, and great solutions. And we recognize that there are corresponding levels of expertise, all the way from incompetent, through competent but uninspired, to genius. It is much harder to see that gradation of expertise in the analysis paradigm: sure, it can be challenging to get information from business stakeholders, but we don’t have the picture of the modeler sitting at the desk or in the park with head in hands trying to find a solution having already understood the requirement.
Of course, if the data model wasn’t a critical component of an information systems design, then it would be self-indulgent to worry about good vs great; we would take a sufficing approach. My position is that for most information systems, the data model is the single most important component of the design. There’s another contentious statement, but one which I won’t pursue here!
A Theoretical Argument
Here are three theoretical reasons for considering data modeling to be a design discipline. I think they’re compelling, but they’re not beyond criticism.
The Business Requirement is not set in Stone
Businesses are human creations. So are their processes and most of their rules, which in turn reflect resources and constraints. When technology changes, processes and rules can change to exploit it. So “business requirements” are not set in stone, except perhaps at the very highest level. Rather, they reflect a negotiated outcome, starting with what the clients think they want (usually framed within the constraints of their own assumptions about what is possible) and what is actually possible.
To use the architecture analogy – a good architect does not simply accept the client’s detailed brief: he or she challenges, proposes, argues. “Have you considered this?”; “Are you sure you need a formal dining area?” “Why not a climate-controlled room instead of a cellar for the wine?”
Returning to data modeling, we can see the modeler taking the architect role and actively proposing approaches to data organization which might (for example) offer greater flexibility or simplicity than those which the business has created for its use in the past.
The counter-argument: this early stage of design is not the data modeler’s job. Perhaps, but it seems that some input from a data expert as to what is possible makes good sense.
A second counter-argument: data structures are intrinsically stable in the face of business change. This view has been around for about 20 years, and, has become a piece of data management folklore. It’s by no means obviously true, except as a statement that data structures tend to persist because they’re so expensive to change, yet I’ve never seen any real arguments or evidence that support it. The only academic study of which I’m aware refuted it. (Take it out of your data sales kit).
Classification is Subjective
Data modeling is, in essence, about classification; we classify real world entities, relationships and attributes (or other objects, depending on our approach) into entity, relationship and attribute classes. There is a raft of academic literature on the subject of classification: the modern position is that classification is subjective. Taxonomy in biology has been described as “part science, part art”.
Everyone who constructs a subtype hierarchy recognizes that different levels of generalization are possible – and a relational DBMS may encourage us to pick only one. More subtly we may simply group things in different ways; a colleague calls it “putting a grid on the world” and grids come in many shapes.
It’s not just entities: attributes in particular can represent different classifications; and every meaningful variation in definition is a variation in classification – a different model.
The counter-argument: while different classifications are theoretically possible, good data modelers will tend to agree on a “common sense” or “natural” classification, or will differ in ways which are relatively simple and easily resolved.
Rules can go in Different Places
When designing an information system, we have at least four possible places to represent a business rule: in data structure, in processing logic, in data values, or external to the automated part of the system. In some cases, the most ‘natural’ place to represent a rule is obvious; some rules are much more easily implemented as (say) processing logic than data structure. But the ‘natural’ place isn’t always the best: sometimes considerations of ease of change, for example, may outweigh simplicity of design.
In any event, we have a choice: for example, we could hold the business rule “Salespeople can only report to a Sales Unit” in:
- Data structure – a relationship between the entity types Salesperson and Sales Unit
- Process logic – the assignment process for Salesperson references Sales Unit only.
- Data values – a look up table showing what types of Employee can be assigned to what types of Organization Unit
- A rule held external to the system.
There are numerous variations on the above options, most of which imply different data structures.
The counter-argument: business requirements will resolve these issues. Indeed, the different models proposed above do have different business implications, but so do different architectural designs for a house – and the reality is that we just don’t see all of the designs and decisions side by side. The designer proposes, the client accepts or rejects the implications without being aware of
every possible alternative.
So what really happens in practice?
Early in this article I referred to some research that I am currently undertaking at the University of Melbourne. My interest was in testing my position, which was based largely on my own reflection and observations of colleagues. Over the last two and a half years, I’ve surveyed around 400 delegates at data modeling and data management seminars, and am in the process of writing the results up as a PhD thesis.
Over the next few months, I plan to publish summaries of the results in a practitioner-friendly form on my website: www.simsion.com.au/research.htm. Here is a very high level overview (without academic precision and qualification) of some of the early results.
- Practitioners asked directly to describe data modeling (an open question: “what is data modeling”) generally aligned with the “analysis” position. What is also interesting is that only around 16% of responses did not (at least implicitly) embody some position on the analysis / design issue.
- Practitioners given an explicit choice of positions (those labeled (a) and (b) early in this article) favored the design position (49 to 37% with 14% choosing “both”).
- Practitioners given a questionnaire based Bryan Lawson’s differentiators of analysis and design characteristics chose the “design” option in 19 of the 20 cases (the exception came down
My hypothesis is that the above results arise from a disjunction between what we are told and what we experience in practice. When asked the big question directly, we give the textbook answer; when prompted and probed, we have to look to our experience.
Besides submitting practitioners to questionnaires, I’ve asked them to do some models, based on real business cases and descriptions. My interest has been in the level of variation amongst models.
- I gave a “simple” business problem – less than half a page of description – to data modelers. 101 of them submitted their models for analysis. To say that there was a great deal of variation is an understatement: the models averaged 8.5 entities, but across the 101 models, a total of 302 different entity names were used.
- Participants were given a model presented as a single entity with attributes, but with possibilities for attribute generalization (“First Quarter Material Budget Amount” could be generalized to “Quarter Material Budget Amount” with the addition of a “Quarter Number” for example, for example). When they were finished, they paired off and compared models. Approximately half observed that their models were “structurally different in important ways” from their partner’s.
- Participants were asked to develop models for two different scenarios. Participants who used more generalized structures in one scenario were more likely to use more generalized structures in the other – an expression, it would seem, of personal style, a characteristic of design disciplines.
I can hear the questions already, and can only respond “space doesn’t permit”. But I have designed and administered these surveys and experiments with the rigor expected of academic research, and I believe I have covered most of the bases.
For the moment, I trust they’ll at least promote some reflection and discussion.
Alec Sharp followed his statement about my presumed lunacy with these words: “but a little self-examination and looking around convinced me you were right.” I don’t expect to convince everyone, but I would like to think that this article promotes a bit of that self examination and looking around.
The third Edition of Graeme’s book Data Modeling Essentials (written with Graham Witt) was released by Morgan Kaufmann in November 2004.
 That’s schema, not model – I’m using the term in the strict ANSI/SPARC sense.
 Olle, Hagelstein, MacDonald, Rolland, Sol, Van Assche, and Verrijn-Stuart: Information Systems Methodologies: A Framework for Understanding, Addison Wesley, 1991.
 There is a discussion in Simsion and Witt, Data Modeling Essentials, Ed 3, pp 8-10.
 Marche, S., 1993, Measuring the Stability of Data Models, European Journal of Information Systems, 2(1), pp37-47.