Data Modeling for Quality

In this article, I describe a method of modelling data so that it meets business requirements. Central to this method is that modelling not only the required data, but also the subset of the real world that concerns the enterprise.

This distinction has long been a subject of discussion in the data modelling world: the first edition of (Kent, Data and Reality, 2015) was published in 1978.

[This article is being co-published in association with the IRM UK Data Governance Conference Europe – a Virtual Conference taking place November 15-17, 2021. The article can also be found at IRM UK Connects. TDAN.com is a media sponsor of this event.]

Why Not Model Just the Required Data?

There are good reasons why we shouldn’t model just the required data.

First, there is usually more than one way of representing a real-world subset in data. Decisions as to which data structures to use for a particular real-world subset depend on processing requirements, relative ease of query writing, and the target DBMS or other data platform. While developers need to know which data structures are being used, and may have input to the decision-making process, business stakeholders are usually only interested in what information about the real-world subset they can view and/or update rather than how it is represented internally.

Second, before investing time and effort in a model depicting the data structures to be used, it is important that the modeller(s) have correctly understood the relevant real-world concepts, the relationships between them, and their attributes (including the behaviour of those attributes). This can only be reliably achieved if business stakeholders are given a model to review that they can understand. A logical data model may contain primary keys as well as business identifiers and foreign keys as well as relationships, adding to the amount of metadata to be analysed by a business stakeholder reviewing the model.

An Example

For example, let’s model the data that allows the following questions to be answered:

What flight(s) travel from Airport A to Airport B on what days?
What airport does Flight X start at?
What airport does Flight X finish at?
What other airports (if any) does Flight X stop at?
On what days does Flight X operate?

Some flights have only an origin and a destination (such as QF11 from Sydney – Los Angeles), while other, multi-leg, flights have intermediate stops (such as QF1 from Sydney – London Heathrow via Singapore). These can be represented in a number of ways:

using a SQL:2003 (object-relational) table with an embedded sub-table as in Figure 1
in traditional SQL relational tables as in Figure 2
in a Flight Leg table in which each leg of a multi-leg Flight is represented separately, as in Figure 3.

Flight No	Origin	Stops	Destination
QF1	SYD	SIN	LHR
QF11	SYD		LAX

Figure 1: SQL:2003 Flight Data

Flight No	Origin	Destination	Flight No	Sequence	Stop
QF1	SYD	LHR	QF1	1	SIN
QF11	SYD	LAX

Figure 2: Traditional SQL Flight Data

Flight No	Origin	Sequence	Destination
QF1	SYD	1	SIN
QF1	SIN	2	LHR
QF11	SYD	1	LAX

Figure 3: SQL Flight Leg Data

Note that each of these representations allows the questions listed above to be answered.

In the remainder of this article I shall use the data structure depicted in Figure 3, as this allows for the simplest queries with which to answer those questions. Let’s model that data structure with the necessary columns. Figures 4 – 6 depict three alternative models:

*Figure 4: Separate Days of Operation Table*

*Figure 5: Repeated Day of Operation Columns*

Some enterprises mandate that all tables must have a surrogate primary key. In such an enterprise the data model in Figure 4 would need to be modified as in Figure 7.

*Figure 6: Boolean Day of Operation Columns*

*Figure 7: Alternative Model with Row ID primary Keys*

The data types used in these models are those available in SQL Server. If DB2 were the target DBMS,

the rowid data type could be used instead of integer for each ID column, and
columns with data type tinyint would have to use the smallint data type.

If Oracle were the target DBMS,

columns with data type integer or tinyint would have to use the number data type, and
columns with data type time would have to use the interval day to second data type.

Current versions of each of these DBMSs are SQL:2003-compliant so we could alternatively use user-defined data types (see Figure 9 later).

Which of These Models Should Business Stakeholders Review?

With so many candidate logical data models, one possibility is to provide business stakeholders with more than one of them to review. This is not a good idea as few business stakeholders are in a position to evaluate which logical data model would make for easy query writing and be appropriate for the target DBMS. Nor is it a good idea for developers (who are in a position to choose the optimum logical data model) to choose one then provide it to business stakeholders to review. If this were to happen, those business stakeholders would be presented with a more complex model than one depicting only business information requirements, as it would also contain implementation data such as primary keys (as distinct from business keys), foreign keys (as well as relationships) and additional tables (as in Figure 4).

Instead of a logical data model, the business should be provided with a business information model (as in Figure 8) which contains:

only data items of interest to the business, including only business identifiers (rather than primary and foreign keys),
a simpler structure (as it depicts Days of Operation as an array rather than a list of columns or a separate table),
business data types (see later) rather than DBMS data types
(where relevant) derived data items such as Flight Duration.

Since SQL:2003-compliant DBMSs support user-defined data types (including array data types), these can be used in a logical data model for implementation in such a DBMS (or in an object-oriented development environment), as shown in Figure 9.

Isn’t That Just a Conceptual Data Model?

Zachman’s Enterprise Architecture Framework makes a distinction between Conceptual, Logical and Physical perspectives of a design. Although the original targets of these three types of model were business stakeholders, designers and builders respectively, the evolution of system development environments has led to developers being concerned with logical rather than physical structures, while database administrators continue to deal with physical structures. As a result, the targets of Conceptual, Logical and Physical Models are now generally taken to be business stakeholders, developers and database administrators respectively.

However, the literature includes many models labelled as “conceptual data models” which vary widely in terms of content, being anything from mere lists of entities (without attributes or relationships) to (sometimes incomplete) logical data models, i.e. many have either insufficient information or too much information for effective review by business stakeholders. As a result, the term “conceptual data model” is now useless as a formal term, which is why I prefer the term “business information model” to signify models to be reviewed by business stakeholders.

Business Data Types

Data models for business review often include DBMS data types such as ‘integer’. Integer data types can be used for many different business purposes, such as,

counts, e.g. ‘Number of passengers’
ordinals, e.g. ‘Sequence Number’
days of the week, e.g. 1 represents Monday
identifiers, e.g. Airline ID.

These behave differently. Counts can be added or subtracted and can be used in greater than/less than (inequality) comparisons, whereas ordinals (and days of the week) can be subtracted and used in inequality comparisons, but not added. It does not make sense to add or subtract identifiers, nor use them in inequality comparisons (although they can be used in equality comparisons).

By contrast, a Business Data Type (or Attribute Class) specifies the semantics of an attribute, for example:

Airline Code: 2 alphanumeric characters
Port Code: 3 alphabetic characters
Flight No: 1 – 4 numeric characters
Sequence No: an ordinal
Departure Time, Arrival Time: times of day as hours and minutes, with the range [00:00 – 23:59]
Flight Duration: a duration measured in hours and minutes
Days of Operation: an array of 7 elements, each being a Day of Week
Day of Week: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday or Sunday.

Since it specifies Business Data Types rather than DBMS data types, the business information model in Figure 8 provides more information to business stakeholders than any of the logical data models.

From a business perspective, each attribute is one of the following:

an identifier, used only to identify entity instances and not implying any property of those instances, e.g. Flight No
a category, holding one of a defined set of values, e.g. Travel Class (First, Business, Economy)
a quantifier: an attribute on which some arithmetic can be performed (e.g. addition, subtraction), and on which comparisons other than “=” and “¹” can be performed, e.g. Number of Passengers, Flight Date, Departure Time of Day
a text item, which can hold any string of characters that the user may choose to enter, e.g. Airline Name.

Each of these attribute classes has a number of subclasses, as described in (Witt, 2021).

Modern data modelling tools provide user-defined datatypes to document attribute classes, including array datatypes for multi-valued attributes as the Days of Operation attribute in Figure 9.

Developing a Logical Data Model From the Business Information Model

A logical data model can be developed from the business information model by way of the following steps:

clone the model, so as to preserve the business information model (see later)
remove any derived data items, in this case Flight Duration
rename all entities and attributes to conform to the naming standard in force
if implementing in other than XML, add primary keys and display foreign keys
convert business data types to DBMS (or XML) data types
if implementing in a non-SQL:2003 DBMS, create an additional table for each multi-valued attribute, in this case Days of Operation.

Other steps might be required to handle composite attributes, n:n (“many-to-many”) relationships, subtypes, history recording and data rules, but these are outside the scope of this article. Descriptions of these steps can be found in (Witt, 2021).

Code Sharing

A given flight segment is operated by a single airline, but the operating airline may have a code sharing agreement with other airlines, who use their own flight numbers. For example, LOT Polish Airlines operates a Warsaw-Frankfurt flight LO381, on which seats can also be sold as Lufthansa flight LH5715, Singapore Airlines flight SQ2381 or United Airlines flight UA6847. The legs of a multi-leg flight generally don’t have the same codeshare flights: for example, seats on Qantas QF1 can be sold as Emirates EK5003 between Sydney and Singapore but not between Singapore and London.

The business information model can be modified to cater for this as in Figure 10. After review of the modified business information model, the corresponding modification can then be applied to the logical data model (this is why we retain the business information model rather than convert it to a logical data model):

If the target DBMS is SQL:2003-compliant, the model in Figure 9 can be modified as in Figure 11.
If a traditional SQL logical data model has been produced, a CodeShareFlight table can be added to represent the CodeShare Flights attribute; for example, the model in Figure 4 would be modified as in Figure 12.

*Figure 10: Business Information Model with Codeshare Flights*

*Figure 11: SQL:2003 Logical Data Model with Codeshare Flights*

*Figure 12: Traditional Logical Data Model with CodeShare Flights*

Conclusions

A business information model should be developed and reviewed by business stakeholders before a logical data model is developed.
Business information models are well-defined in terms of what is included and excluded, by contrast with conceptual data models.
The representation of a business information model in a logical data model depends on the target DBMS (or other data platform), processing requirements and ease of query writing.
A logical data model can be developed from a business information model following a series of simple repeatable steps.
Any changes to the business information model can be replicated in the logical data model.

I shall present an overview of this method at the IRM UK Enterprise Data Conference on Tuesday 16 November 2021 at 11:10GMT. Further details and examples can be found in (Witt, 2021).

Bibliography

Kent, W. (2015). Data and Reality. Technics Publications.

Witt, G. C. (2021). Data Modeling for Quality: Delivering benefits through attention to detail. Technics Publications.

MenuMenu

Data Modeling for Quality

Why Not Model Just the Required Data?

An Example

Which of These Models Should Business Stakeholders Review?

Isn’t That Just a Conceptual Data Model?

Business Data Types

Developing a Logical Data Model From the Business Information Model

Code Sharing

Conclusions

Bibliography

Graham Witt

MenuMenu

Why Not Model Just the Required Data?

An Example

Which of These Models Should Business Stakeholders Review?

Isn’t That Just a Conceptual Data Model?

Business Data Types

Developing a Logical Data Model From the Business Information Model

Code Sharing

Conclusions

Bibliography

Share this post

Graham Witt