Data Modeling News from the Ivory Tower

Published in TDAN.com January 2007


Introduction – Back to school

Just over four years ago, after some twenty years as a data modelling practitioner and manager, I decided to spend some time at the University of Melbourne, looking at data modelling issues from a
research perspective. The change was not as abrupt as it may sound; I have tried to keep in touch with research in the field, have taught university courses, supervised a couple of research
students and even published a few articles in academic journals and conferences. Nor did I entirely drop my practitioner work, continuing to run seminars and “master classes” in Europe and North
America, and working with Graham Witt on the third edition of our practitioner-oriented book.

Nevertheless, there was a definite culture shock. As Batra and Marakas note: “there are indeed wide differences between the academic and practitioner focus in conceptual data modelling”. Yair
Wand and Ron Weber, two eminent researchers in the information field, have published a research agenda for conceptual modelling (Wand and Weber 2002); it is fair to say that the topics it nominates
do not come up very often in data modelling discussion groups, or in the practitioner literature.

I chose to look at an issue that I believe has important practical implications: whether data modelling is better characterised as description[1] (reflecting the real world) or design. I defined and
discussed this topic in an earlier article[2] and will not re-visit the definition and implications here. For the moment, I simply note that its investigation involved asking a number of sub-questions that are interesting in their own right. For example, will different data modellers, faced with the same modelling problem, produce the same or similar models? Do data modellers exhibit personal styles that affect the logical structure (as distinct from the presentation) of their models? Or, more basically, what activities do data modellers believe fall within the scope of data modelling?

The research turned out to be a bigger undertaking than I had anticipated, involving three different surveys, sixteen interviews, three experiments, and over 450 practitioner participants on three
continents. I discovered many interesting things, some of which have had a significant impact on my teaching and practice. The results were “summarised” in a 500-page PhD thesis (dissertation),
densely populated with diagrams, graphs, statistics and references, and intended primarily to convince an academic audience that I knew what I was doing[3]. Part of the deal with the practitioners who contributed to this
research was to publish the results in a practitioner-friendly form. This article is a part of meeting that commitment, and is also intended to provide something of a window into data modelling
research in general.

The sections that follow match the chapters of my thesis, and I’ve tried to capture the key results and observations from each chapter. The “compression factor” is very high, so a great deal –
particularly details of method and analysis – has been left out. I would be happy to send full chapters, or the entire thesis, to anyone interested in more detail. You should, however, take note of
an email I received from author David Hay: “Reading your thesis writing style, however, makes me a lot less enthusiastic about taking on such a project myself. I really prefer writing for regular
folks…”


Literature Review

An important part of a thesis is a review of existing publications on the topic. The data modelling literature dates from the 1970s, and it seems to me that the progress in thinking in that period
has not matched that in other areas of information systems. Practitioners settled on relational database architecture and crow’s-foot modelling conventions some time ago, at least until the advent
of UML, and today’s practitioner and teaching literature is not too dissimilar to that of the mid 1980s. Meantime, the academics have devoted much of their time to proposing and comparing
alternative formalisms – and even coined the term YAMA (“Yet Another Modelling Approach”).

Positions on the description-design issue are generally unclear. Where clear statements are made, they reflect contrasting views. The following are typical of the description or “reality mapping”
school:

[a data model is] “a precise and unambiguous representation of organizational information requirements.”(Kim and March 1995).
“The objective … is an accurate representation of reality” (Teorey, Lightstone et al. 2006)

The next two represent the (distinctly less popular) design school:

“The process of conceptual data modelling is sometimes seen as an analytical, descriptive task, but it is better viewed as a design activity where the data modeller is an active participant in the
modelling process and adds value to the quality of the model.” (Chaiyasut and Shanks 1994).
Data modelling is a creative activity in which the modeller “often needs to dream up several different possibilities” (de Carteret and Vidgen 1995 p334).

Neither school of thought offers much in the way of theoretical argument or evidence to support its position.

In seeking a deeper understanding of positions, I turned to Bryan Lawson’s book How Designers Think, which includes a list of characteristics of design problems, processes and products. For
example “design problems cannot be comprehensively stated”; “the [design] process is endless”; “there are [sic] an inexhaustible number of different solutions”. Comparing data modelling, as
described in the academic and practitioner literature, with these characteristics was an interesting exercise. In virtually every case, I found arguments – or at least positions – on both sides.

The empirical (observational, experimental) literature on data modelling was of limited usefulness. I conducted an extensive search of published papers, eventually listing fifty-nine empirical
papers, involving 3210 participants (subjects) in total. Unfortunately this superficially impressive body of work suffers from some severe limitations:

  • Of the 3210 participants, only 147 had one year or more of practical experience. Sixty-five of these were accounted for by one mailed-out survey. Most of the participants were thus students;
    you can form your own judgment as to how representative their behaviour and models would be of those of experienced practitioners.
  • Where practitioners did participate, they were generally few in number and often recruited from a single institution or company, thus making generalization of results risky.
  • The focus was heavily on comparing data modelling formalisms (languages). Thirty-three of the studies compared formalisms or alternative constructs for the same real world concept.
  • The modelling problems were almost invariably very simple, reflecting the capability of the student participants.
  • Most modelling problems were reverse-engineered from a “gold standard” solution prepared by the researcher. The “one right answer” assumption was built into the experimental design.

Hitchman (1999) puts it bluntly: “Assumptions made by researchers result in findings that are divorced from current data modelling practice, cannot be generalized and are misleading.”


Research design and participation

I approached the description-design question using a range of methods – interviews and surveys to assess practitioner perceptions of data modelling, and modelling exercises to assess diversity in
models produced from a common set of requirements. The sections that follow will provide a taste of these.

My interest was in the perceptions and performance of experienced practitioners rather than students. I therefore invited participants in seminars and classes to complete surveys and perform tasks,
and to hand these in if they were comfortable about doing so. In all, 459 attendees at twelve classes and seminars in the UK, Scandinavia, US and Australia handed in questionnaires and / or models.
Ninety-three percent[4] of these
claimed more than a year of data modelling experience. Forty-nine percent were either data modellers or data administrators. Twenty-nine percent included the word architect in their job
title. The most commonly cited method of learning was on-the-job experience, followed by industry courses and books, with tertiary education a distant fourth behind mentorship.


What the thought leaders think

Early in my research project, I interviewed seventeen thought leaders, firstly to see whether the description-design issue really was controversial, secondly to hear the arguments for the two
positions, and thirdly to determine what evidence would be persuasive in settling the issue.

The names will be familiar to most data modellers: Peter Aiken, Richard Barker, Michael Brackett, Harry Ellis, Larry English, Terry Halpin, David Hay, Steve Hoberman, Karen Lopez, Dawn Michels,
Terry Moriarty, Ronald Ross, Robert Seiner, Alec Sharp, Len Silverston, Eskil Swende, and John Zachman. Their contribution was of fundamental importance, and provided practical context and a
reality check absent from much research. If I was to take one lesson from this project to the research community, it would be to seek out and learn from the practitioner thought leaders in their
area of research.

A few quotations will give a flavour of the difference of opinion. It’s important to emphasise that the interviews generally delved quite deeply to ensure that differences were real and not just a
matter of semantics:

On the overall description-design question:

“Data modelling is a certainly a descriptive activity, it’s not a design activity.”
“I believe rabidly and intensely that it’s a design process.”

On the negotiability of the requirements:

“Data modelling is all about helping a business come up with a better way of doing business”
“Data modellers should not resolve business problems”

On the role of creativity in the process:

“Managed properly, it is a highly creative activity – or should be”
“There’s nothing creative about it”

On the diversity of solutions:

If we are both experts we should come up with the same solution.
Given the same set of business rules, two very very good modellers will come up with completely different models.

The table below summarises my assessment of the interviewees’ overall positions.


Position


Number of Interviewees

Strongly supports description
5
Somewhat supports description
1
Supports neither position more strongly than the other
3
Position depends on modelling language
1
Somewhat supports design
3
Strongly supports design
4

It was clear from the interviews that the “one right answer” issue was central, and I therefore put some effort into investigating diversity in models produced by different modellers in response
to a common set of requirements.


Scope and Stages

You don’t need to read much of the data modelling literature – or talk to too many modellers – to realise that not everyone means the same thing by data modelling, or by
conceptual, logical and physical. Without an understanding of what practitioners mean by data modelling, it was not going to be possible to frame questions about
perceptions of data modelling or to interpret the answers.

To gain an idea of perceived scope and stages of data modelling, I asked participants to list all of the stages of database design, then to nominate which of these were “data modelling”. I then
asked them to allocate a pre-set list of activities (e.g. normalisation, specifying column names, specifying indexes) to the appropriate stage. The results were discussed in an earlier newsletter
article[5]. Briefly, eighty
percent of respondents saw data modelling as embracing at least all of the activities involved in specifying a conceptual schema (roughly “what the programmer sees via views”) from a set of
business requirements, prior to compromises for performance reasons. Many saw data modelling as embracing other activities, and there was no consensus on the individual stages or their boundaries.

Despite the overall lack of clarity, it was apparent that the practitioner definition of conceptual data modelling differed from the academic definition. Practitioners see conceptual modelling as a
preliminary “sketch plan” stage, whereas academics see it as producing a detailed conceptual schema.


Espoused positions: How practitioners describe data modelling

This stage of the research looked at how practitioners characterize data modelling explicitly in terms of the description-design distinction.

I asked two questions:

1. An open question:

What is data modelling?

2. A closed (forced choice) question:

Which better describes data modelling?
(a) Describing the data requirements of an organization or part of an organization or
(b) Designing data structures to meet the requirements of an organization or part of an organization.

Responses to the open question were analyzed in terms of the position that they supported or embodied.
Responses to the closed question were just tabulated. The two bar charts below show the results.

Responses to the open question overwhelmingly embodied the “description” position (e.g. “The unambiguous expression of business rules in pictorial form”), whereas, when confronted with a more
explicit choice, more respondents chose the “design” position. The results would suggest that our “elevator pitches” on data modelling may convey positions different from those that come from a
little more reflection.


Characteristics of Data Modelling

The next stage of the research sought to look more deeply at practitioner perceptions of data modelling (description or design). It is well accepted that espoused theory may not reflect practice –
people don’t always walk the talk. A questionnaire was used to ask whether data modelling was perceived as having the characteristics of a design discipline. Bryan Lawson’s list of
characteristics of design (mentioned earlier) provided the reference point. It was relatively easy to include the survey in seminars, as it took only a few minutes to complete, so I was able to
analyse 266 responses from seven locations.

And the answer was…

Overwhelmingly, practitioners saw data modelling as having the characteristics of a design discipline. This was true overall, and for data modelling problems, processes and products individually.
More experienced modellers were more likely to give high “design scores” than their less experienced colleagues.

Because the questionnaire was a new instrument, I benchmarked it with two other professions – accountants (in the context of preparing financial reports) and architects (in the context of designing
a building). As expected, data modellers scored significantly higher than accountants – but they also outscored architects, a result of higher scores on the “problem” dimension. It seems that
data modellers perceive their problems or requirements as more ambiguous and negotiable than do architects.


Diversity in Conceptual Modelling

This research component was the first of two that looked at whether data modellers faced with a common set of requirements would come up with different models – a classic characteristic of design
disciplines. In this case the problem was a real one, presented (unedited) on videotape by the two business stakeholders. It was simple by “real world” standards, and the transcripts amounted to
less than a page, with plenty of overlap in what the two stakeholders said. An accompanying questionnaire asked participants whether they had sufficient time, sufficient information etc (and,
broadly speaking, most said that they did).

Seven measures of diversity were used: perceived difference in models (from asking each participant to compare their model with another’s), number of entities, variety of entity names, number of
entity names corresponding to nouns in the problem description, choice of construct (e.g. entity or attribute) for certain common concepts, level of generalisation for certain common concepts, and
differences amongst selected models as evaluated by independent data modelling experts.

An assistant and I judged about two thirds of the models as workable insofar as a database based on them could support an application that met the requirements.

Two results highlight the diversity in the models:

  1. The ninety-three models used a total of 291 different entity names, after consolidation of obvious synonyms (abbreviations, plurals, etc).
  2. The nineteen independent experts (all but one with over fifteen years of specialist modelling experience) were given ten of the solutions. Five of the models were rated by a majority of the
    experts as of equal or better quality than typical models encountered in practice (suggesting that they would be acceptable to some people at least). There was disagreement as to which was the best
    model, with four different models receiving at least three nominations as the best.

As one of the experts commented: “What a fascinating exercise. Great illustration of the range of models that can come from the same scenario.” And he only saw ten of them!


Diversity in Logical Modelling

The Diversity in Conceptual Modelling results described above are vulnerable to the claim that diversity arose from ambiguities in requirements – although I tried hard to reduce this
possibility. In the next task, participants were provided with a tighter specification – an entity with twenty-two attributes (the example was adapted from practice, and appears in Data
Modeling Essentials
, Ed 3, p174) and asked to produce a final model for implementation. There was room to generalize and to separate attributes into multiple tables, and the eighty-three
participants took full advantage of the possibilities – in different ways. Almost all of the models were workable. Thirty-nine of the models were examined in detail and all were structurally
different
– code written against them to achieve the same purpose would have looked different, beyond simple naming differences.


Style in Data Modelling

The final component of the research looked at personal style in data modelling – as measured by choice of level of generalization. The question was: do some modellers prefer higher (or lower)
levels of generalisation than others, independent of the modelling problem?

The experiment used three modelling tasks, with each modeller tackling two. One involved choice in level of generalisation of attributes while the other two involved choice in level of
generalisation of entities.

The results showed a significant correlation in the decision about entity generalisation, but no significant correlation between entity generalization and attribute generalization decisions.


Summary and Conclusions

The academic world is rightly cautious about the conclusions it draws from surveys, experiments and observation, and you will have to trust me that in 100,000 words I addressed the subject more
rigorously than I can here (or you can read the relevant chapters!). But…

The key conclusion from the research was that data modelling, as practiced, is better characterized as a design discipline than as a process of description or “reality mapping”. This is in
conflict with explicit and implicit characterizations in much of the academic and practitioner literature, and with the way that data modelling is frequently taught. I touch on some of the
implications in my earlier newsletter article, You’re making it up.


Reflections

At the conclusion of a doctoral thesis, the author is expected to reflect on the findings, and on what he or she learned in the process. Though I was aware of it before, the project brought home to
me the level of disconnection between academic research and industry practice, and the possibility that some relatively painless initiatives could bridge that gap. Frankly, I think the problem lies
more on the academic side, but we can do our part by encouraging and assisting those academics brave enough to work with and listen to the practitioner community. Ultimately, we need it: our
profession has relied more on gurus than evidence – helpful in the early days, but not enough if we are to progress.

Next project: does data architecture work?


References

  • Batra, D. and G. M. Marakas (1995). “Conceptual data modelling in theory and practice.” European Journal of Information Systems4: 185-193.
  • Chaiyasut, P. and G. G. Shanks (1994). Conceptual data modelling process: A study of novice and expert data modellers. 1st International Conference on Object-Role Modelling, Magnetic
    Island, Australia, University of Queensland.
  • de Carteret, C. and R. Vidgen (1995). Data Modelling for Information Systems. London UK, Pitman Publishing.
  • Hitchman, S. (1999). “Ternary relationships – to three or not to three, is there a question?” European Journal of Information Systems8(3): 224-231.
  • Kim, Y.-G. and S. T. March (1995). “Comparing data modeling formalisms.” Communications of the ACM38(6): 103-115.
  • Lawson, B. (1997). How Designers Think: The Design Process Demystified. Oxford, Architectural Press.
  • Teorey, T. J., S. Lightstone and T. Nadeau (2006). Database Modeling and Design. 4th Edition, San Francisco, Morgan Kaufmann.
  • Wand, Y. and R. Weber (2002). “Research Commentary: Information systems and conceptual modeling – a research agenda.” Information Systems Research13(4):
    363-376.

[1] In earlier articles, I have used the word analysis rather than description; the change in words seems to reduce the ambiguity in the original framing.

[2] You’re making it up: Data modelling – Analysis or Design?, IRM Newsletter, Feb 2005.

[3] This is indeed the primary purpose of a PhD thesis – something to bear in mind if your interest is in the author’s findings rather than in their expertise.

[4] Participants who did not answer the questions excluded from these figures

[5] Clearing the Confusion: Conceptual, Logical, Physical, IRM Newsletter, April 2006.

Share

submit to reddit

About Graeme Simsion

Graeme Simsion was, for twenty five years, a data management consultant, educator and CEO of a successful consultancy. He is a six-time keynote presenter at DAMA conferences (in the US, UK and Australia), author of two books on data modeling and recipient of the DAMA professional achievement award. He holds a doctorate in information systems and an MBA from the University of Melbourne. At the age of fifty, he decided to try something new, and enrolled in an undergraduate program in screenwriting. When he couldn’t get his movie made, he decided to rewrite it as a novel. The Rosie Project spent over a year on the New York Times bestseller list and was the ABIA Australian Book of the Year in 2014, with translation rights sold in forty languages. Sony Pictures have optioned the screenplay. The sequel, The Rosie Effect was also a bestseller. His upcoming novel, The Best of Adam Sharp, features a database administrator as the romantic hero.

Top