XML, data modeling, and database design are all ways to structure data. Each has its place. Unfortunately, our industry is somewhat confused as to what those places are. This article attempts to
sort that out.
“Accurate and germane sharing of information across jurisdictions is a critical issue for justice and public safety. Although there has been significant progress in the field of
information technology, the lack of standards for exchanging justice data has not only been a major obstacle to, but also the principal reason for, the high costs involved with justice
information exchange.”1
So says the introduction to an XML model published by the Justice Department to promote data sharing. In response to this the Justice Department has developed the “Global Justice XML Data
Model” as a “collaborative effort among local, state, tribal, and federal visionaries.”2
The problem is that, while XML is a perfectly good vehicle for describing data to be transmitted from one place to another, it is not so good for describing the semantics – the nature of
– the underlying data. It cannot replace data modeling and sound database design.
This article will present this issue from two different perspectives:
- The meaning and significance of different kinds of data models, and
- The difference between storing data and transmitting data.
DIFFERENT KINDS OF DATA MODELS
Three Ways to Look at Data
In 1975, the American National Standards Institute (ANSI) published its “3-schema approach” to viewing data.3 As they the Institute saw it, there were at least three different
perspectives on data. (Actually, in various drafts, they saw many more perspectives than that, but these have survived.)
External
Everyone in an organization looks at its data slightly differently. Each person holds in mind a “schema” or structure that is the representation of his/her surroundings from a
particular point of view. Each of these schemas describing the world as one person sees it is called by ANSI an external schema. While, ideally, these are similar for the
many people working in a company, as often as not they are in fact very different. Misunderstandings about what something means are at the heart of many systems (and management) problems in
companies today.
Because of their diversity, it is not practical to draw a picture of each person’s view of their data, but it is appropriate when surveying these points of view to assemble an ontology or
glossary of the language used throughout the organization. Where terms have different meanings or there are different terms for the same thing, these differences should be made explicit.
Conceptual
All of these participants in an organization are, after all, fundamentally looking at the same business, even if their perspectives are different. The objective of a good systems analyst is to
integrate the set of external schemas to produce a single conceptual schema which represents the fundamental structure of the organization’s data.
The data model that portrays a conceptual schema consists of boxes representing the things of significance to it, along with lines describing how they are related to each other. (This is often
referred to as an “entity/relationship” or “E/R” model.)
These conceptual schemas can be rendered at various levels of abstraction.
- An enterprise abstract conceptual data model is in terms of things that apply to the organization as a whole. This presents a general structure for addressing people
and organizations, geography, physical assets, activities, and corporate objectives. - A purpose-oriented concrete conceptual data model is in terms of a language that is relatively close to a particular area or function of the organization, such as
contracting, human relations, or manufacturing. It represents assertions that are true for anyone working in that area, given the policies and procedures of the overall organization, but these are
usually about things that are specific to that part of it. The project concrete model is then made up of: 1) sub-types of the enterprise entity classes, 2) “views” (analogous to SQL
views) of sets of entity classes and relationships, and 3) new entity classes.
Internal
ANSI originally described the internal schema as the way data are represented in a physical storage medium. This included the use of random-access storage, tracks and cylinders, and so forth. With
the advent of database management systems, however, this was broken into two schemas:
- A logical schema described data in terms of a particular data management technology. In the early days, this might have been a hierarchical database management system
(DBMS), or a networked one. In more recent times, this describes the tables and columns of a relational DBMS, the classes of an object-oriented program, or XML tags. - Database technology has advanced to the point that a true physical schema describing the physical architecture of a database is more important than ever. A set of
tables and columns in a logical database may be implemented in many pieces scattered all over the world.
Note that along with relational database design, XML is one of several options for implementing a conceptual data structure. XML scripts are examples of logical schemas. As with
relational database designs and object-oriented UML diagrams, these should be derived from a conceptual data model – not a replacement for it.
Note that in each of these technologies, trade-offs must be made in translating from the conceptual model. Relational database design does not recognize super-/sub-type structures, so adjustments
must be made. XML is fundamentally hierarchical, so the translation from a two-dimensional data model will involve de-normalization that will sacrifice some flexibility.
The Semantics of an Entity/Relationship Model
It is the conceptual entity/relationship model that describes the business whose issues are being addressed by any system implementation. This is where the semantics of an
organization, as identified by its external schemas, are resolved into a coherent structure. It is this coherent structure that should be the basis for any system design effort, regardless of the
technology.
A conceptual data model is fundamentally a set of assertions about the nature of the organization. It is composed of entity classes4,
attributes describing the data they represent, and lines representing the relationships between pairs of them.
Figure 1 shows a sample data model describing the structure of an Excel workbook. Included is a description of its authors and the companies it applies to, as well as its create and last update
date and version number. These would apply to any document, but this model is concerned in particular with Excel spreadsheets. Specifically, an EXCEL WORKBOOK (a kind of DOCUMENT) may be composed
of one or more WORKSHEETS, where each WORKSHEET may be composed of one or more CELLS. Each CELL is at the intersection of one ROW and one COLUMN. In addition, each CELL must be in a
particular STYLE.
Note that here, relationship names are assigned rigorously, so that a relationship asserts a significant fact about the organization: “Each EXCEL WORKBOOK may becomposed
ofone or more WORKSHEETS.” This approach is intended for presentation to non-technical business people so it sounds more like reasonable English, unlike the more commonly used
structure, “Each EXCEL WORKBOOK has zero, one or more WORKSHEETS.”
Figure 1: A Sample Data Model
Logical Models
Note that the conceptual model describes a real situation, completely independent of any technology that might be used to implement a system. (This particular example is a special case because the
subject of the model is a technology, but the point still holds. This model describes that subject, without concern for how this descriptive information might be manipulated by
technology.) Indeed, the same conceptual model may be used to implement a multitude of completely different kinds of systems: the entity classes can be rendered as either relational tables or
object-oriented classes. The relationships in a relational world become foreign keys; in an object-oriented world, they become behaviors by which each of the classes gains access to other classes.
And in XML, entity classes become tags, while relationships become implied in hierarchical tag structures. (If you are reading this section, contact your author at dch@essentialstrategies.com. The first five people to do so will receive a free autographed copy of his latest book, Data Model Patterns: A Metadata
Map.)
Specifically, XML
The data model in Figure 1 was actually inferred from an XML sample provided by Peter Aiken.5 An example of a document described by this model is shown in Figure 2. After an
introductory line, it has a tag defining a which is the overall DOCUMENT in the above data model. After lines describing the “namespaces” that define the tag, the workbook contains a
tag for . In the model above, these are attributes of the entity class DOCUMENT.
After the come some lines defining , with one block for each . In the model above, these are instances of the entity class STYLE.
After that comes a . These are shown in the data model by the fact that each EXCEL WORKBOOK may be composed of one or more WORKSHEETS. You will note by the ellipses (…) that this is not
the entire XML document. Space constraints prevent showing the rest of the script, in which are portrayed a set of rows identified by , , and . In the much more compact data model in Figure 1, each
WORKSHEET may be composed of one or more CELLS. Each CELL, in turn, must be in one and only one STYLE and must be located in both one COLUMN and one ROW.
Figure 2: A Sample XML Model
Moreover, of course, this is just a set of instances of the model. The “model” itself is written in a variant of XML called “XML Schema.” This is an XML script using
predefined tags that are for the purpose of defining other XML tags. A portion of the schema used to describe the document in Figure 2 is shown in Figure 3.
Figure 3: A Sample XML Schema
As you can see, there are significant differences between these two ways to represent data structure:
- The Data Model, as a representation of the business is much more accessible to human beings. This is a conceptual model.
- XML is clearly a programming language used for physically manipulating data and moving it from place to place. As such, it is a logical model. Even though XML is a
major improvement over traditional electronic data interchange (EDI) techniques and it is possible to understand each piece, it is virtually impossible for the uninitiated to see the overall
organization of things it describes.
Note that a relational database design is the technique of choice for storing the data before they are sent and after they are received. This is yet another kind of logical
model.
To compare these two kinds of logical models, note that the relational database design has the two-dimensional organization of the data model. As such, it is possible to retrieve data from various
directions. The XML script is fundamentally hierarchical. It represents the person who is the primary author and the secondary author, but only with great difficulty can it represent all the
spreadsheets written by an author.
While an XML script (and a relational database design) can be derived directly from a conceptual data model, it is extremely difficult to derive a conceptual data model from an XML script. Your
author can testify as to how hard this is by virtue of that fact that this indeed is what was done here.
Understand, however, that the model shown in Figure 1 was not directly derived from the XML. Liberties were taken and assumptions were made. The actual derivative from the XML above is shown in
Figure 4, below. Note that the hierarchical structure made it impossible to show any of the configurations above where entity classes had multiple parents. In the XML, “Author,”
“Last Author,” and “Company” are de-normalized as attributes of WORKBOOK. CELL could only be shown as being in a ROW, not as being the intersection of a ROW and COLUMN.
STYLE could only be shown as applying to the entire WORKBOOK, not as being different from cell to cell.
Figure 4: The E/R Model Actually Derived from the XML
Now, if the data model had come first, the XML could at least have been made more correct. STYLE could be a de-normalized attribute of CELL. Indeed, a TABLE could be shown as being composed of one
or more CELLS, with the “Row number,” “Column letter,” “Autofit width,” and “Width” as attributes. Again, this is de-normalized, but that is
acceptable when all you are doing is communicating between two (it is to be hoped) normalized databases.
Because the semantics of the data are not visible in an XML script, it is extremely dangerous to create one without having done a data model first.
Storing and Sharing Data
In addition to semantic considerations, there is another reason to be wary of over-enthusiastic XML projects.
A collection of interested federal employees and contractors6 have banded together to create something called the “Federal Enterprise Architecture” (FEA). It consists of a set of
“Reference Models” that go way beyond just data architecture:
The first three deal with the business of the federal government:
- Business Reference Model (BRM) – Function-driven framework for describing business operations of the federal government independent of the agencies that perform them.
-
Service Component Reference Model (SRM) – Business and performance-driven functional framework that classifies service components with respect to how they support business
and/or performance objectives. - Performance Reference Model (PRM) – Standardized framework to measure the performance of major IT investments and their contribution to program performance.
One is concerned with technology:
-
Technical Reference Model (TRM) – Component-driven, technical framework used to identify the standards, specifications, and technologies that support and enable the
delivery of service components and capabilities.
And one goes to the heart of data architecture:
- Data Reference Model (DRM) – A model describing, at an aggregate level, the data and information that support program and business line operations.
The FEA DRM
The Data Reference Model is significant in that it recognizes three different views on how to describe and use data:
-
Data Description, which includes presentation of the underlying structure of an enterprise’s data and through that, of the enterprise itself. This answers the
question, “What does the data mean?”In the case of structured data, this is the “organized description of data to convey semantic understanding, usually through an entity/relationship model.” Describing unstructured
data is more problematic, limited usually to identifying documents and indexing them in various ways. -
Data Sharing, which is about how to communicate data from one organizational entity to another. The language currently most popular for carrying this out is XML. This
answers the question, “How do I exchange data?” - Data Context, which describes the relationship between data and the business functions being carried out by each agency.
These are shown schematically in Figure 5.
From the point of view of the Federal Data Reference Model, then, it is clear that XML is about data sharing. While some people have tried to create XML “databases,” it seems clear that
this is not the technology to use for managing data, controlling access to it, and retrieving it. For data description, then, the technology of choice is currently the
relational database. Relational database technology has been proving itself for over twenty years
While it is true that a relational database can be used for sharing data across organizational boundaries, many organizations are not ready to make their underlying database available to outsiders.
A means of exporting data and sending it (in a controlled fashion) is clearly desirable, and XML fits this bill.
Figure 5: The Structure of the Data Reference Model
CONCLUSION
For years, the worlds of data modeling and database design have had an uneasy relationship. People whose world is logical database design often don’t understand the value of conceptual data
models, while data modelers often do not understand the constraints under which database designers work.
XML is a new kind of logical model technology, so this category of misunderstandings is relatively new in this field. It is the same set of misunderstandings, however, even if it has a new flavor.
Programming languages must reflect the semantics and mission of an organization if they are to be successful. They are not, however, adequate to represent these fully. If you want to understand the
business problems being addressed, create a data model. Then use that data model as the basis for defining whatever technology you want to use to build systems.
Endnotes:
- U.S. Department of Justice. “Building Exchange Content Using the Global Justice XML Data Model: A User Guide for Practitioners and Developers”. June, 2005. p. v. (Request at:
http://it.ojp.gov/index.jsp.) - Ibid. p. 2.
- American National Standards Institute (ANSI). 1975. “ANSI/X3/SPARC Study Group on Data Base Management Systems; Interim Report”. FDT(Bulletin of ACM SIGMOD) 7:2.
- Originally Dr. Chen described an “entity type” as being the definition of a class of objects called “entities.” Over the years we have become sloppy and confused
“entity types” with “entities.” The object-oriented community called us on that, though, pointing out that unlike data modelers, they understood the difference between
“objects” and “classes”. Properly chastised, your author has therefore lately been calling what is represented by a box in a data model an “entity class”—
to show that, yes, he does indeed understand the difference between a class of entities and the entities themselves. - Peter Aiken is the author of XML for Data Managers (Morgan Kaufmann, 2004), among many other books.
- Officially, the group is known as the “Data Architecture Sub-committee” of the “Architecture and infrastructure subcommittee” of the “CIO Council” under the
Office of Management and Budget.