Enterprise data modeling (EDM)1 got a bad reputation in the late 1980s, and for good reason: it took too long to model the entire enterprise, and the model had little or no effect on how
systems were built. It was perceived as a wasted effort, and the practice was abandoned by most companies. However, in the late 1990s it became clear that we had thrown the baby (sound data
administration principles) out with the bath water (tedious and ineffective approaches). As a result, new disciplines started to emerge, such as metadata administration, data governance, data
stewardship, data-quality improvement, master data management, enterprise information management, the information center of excellence, the integration competency center, and so on.
These new disciplines all strive to achieve the same goal as EDM: to create and maintain an inventory of data assets that are unique, accurate, reusable, and traceable. This article explains the
origin of data modeling and the significance of EDM; compares top-down, bottom-up, and middle-out EDM approaches; describes related data administration principles; and shows how data quality and
master data management (MDM) relate to EDM.
Origin of Data Modeling
In the early days of data processing, systems development was focused on automating manual business processes such as order processing or payroll. Thus, all the early modeling techniques in the
1970s concentrated on process diagrams (data flow diagrams, functional decomposition charts, structure charts) popularized by Gane-Sarson, Ed Yourdon, Tom Demarco, Larry Constantine, and others. In
1976, just as process-oriented modeling techniques were becoming standard in system development methodologies, Dr. Peter Chen broke the pattern of process-oriented system development with his
invention of the entity/relationship (E/R) model, which soon became known as the logical data model.
E/R modeling was revolutionary in that—for the first time—data, not processes, were at the center of both business analysis and system design. The implications were enormous: data could
now become a reusable commodity, which meant that every unique data element was identified and inventoried once and only once. Business analysts and data modeling practitioners realized they could
finally create a business model of their organization that would logically portray a non-redundant, single version of the truth of their enterprise in terms of its data resources.
Companies created data administration (DA; also known as DRM, IRM, or EIM)2 departments to manage their business data as a corporate asset, just as they managed their financial assets,
fixed assets, real estate, or human resources.
At the same time, Dr. Edgar F. Codd realized that data could also be physically organized and stored independent of any process, so that many processes could share the same data. In his
seminal article “A Relational Model of Data for Large Shared Data Banks,” Codd presented the relational model. In 1985, a few years after the first relational DBMS (DB2) was born, Codd
published the famous 12 rules of relational databases (actually 13 rules when counting Rule Zero). The power of today’s RDBMSs is still inextricably tied to the concepts of Peter Chen’s
original E/R model.
Significance of Enterprise Data Modeling
The greatest benefits of logical data modeling are achieved from building the 360-degree view of a business (i.e., the EDM). The difficulty in building this view is that the current data chaos in
most organizations is so immense that it may take significant time and effort to rationalize the existing data into an integrated, nonredundant EDM.
Many people confuse data integration with data consolidation. Consolidating data simply means gathering data elements that identify or describe the same business object (e.g., customer data or
product data) from multiple source files or databases and storing them in one table or in a set of dependent tables. Integrating data goes much further. In addition to consolidating data,
integration enforces data uniqueness—the building blocks of the single version of the truth that enable you to reuse the same data without the need to duplicate it. Data integration requires
several actions during enterprise data modeling. An organization must:
- Examine the definition, semantic intent, and content of each logical entity to find potential duplicates of business objects that would otherwise not be discovered because the objects are known
under different names in the systems. - Ensure that each entity instance has only one unique business identifier, which, in turn, is never reassigned to a new entity instance even after the old instance has expired.
- Use the six normalization rules to put “one fact in one place;” that is, one attribute (data element) in one (and only one) owning entity. This means that an attribute can be
assigned to only one entity as either an identifier or as a descriptive attribute of that (and no other) entity. This modeling activity ensures that each attribute is captured once and only once,
and that it remains unique within the data universe of the organization. Hence, the single version of the truth (no synonyms, no homonyms). - Capture the business actions (or business transactions) that connect the business objects in the real world. These business actions are shown as data relationships among the entities. It is
paramount to capture them from a logical business perspective (not from a reporting pattern or data access perspective) because these relationships are the basis for all potential access patterns,
known and unknown, now and in the future.
Top-Down, Bottom-Up, and Middle-Out Approaches to EDM
The top-down EDM technique facilitates fact-finding discussions with participants from various business areas who own, originate, or use the data. The participants identify the major business
objects, the data relationships among them, the unique business identifiers of those objects, the most significant and obvious attributes of those objects, and the main business rules for those
attributes. This data modeling technique produces a relatively valid EDM in a rather short time. However, the EDM will be incomplete because it is too time-consuming to identify and model all the
complex business rules during these types of modeling sessions. (Forcing business people to attend a never-ending number of modeling sessions is how EDM earned a bad reputation in the 1980s.)
Bottom-up EDM involves the painstaking task of normalizing existing process-oriented (denormalized) data structures into “best-guess” logical data models and merging these models into
an integrated, non-redundant EDM. Using the bottom-up data modeling technique produces a relatively complete EDM. However, an EDM based on “best-guess” logical data models cannot be
trusted until it is meticulously reviewed and validated by all the business people who own, originate, or use the data. In addition, many business rules cannot easily be extracted from existing
data structures and would therefore be missing in the EDM, rendering the EDM incomplete at best, and incorrect at worst.
The term “middle-out” logical data modeling describes the approach of building the EDM iteratively using both techniques, starting either top-down or bottom-up and then
continuously alternating between the two approaches. The advantage of this technique is that it can be used by data administrators or enterprise information architects to build an EDM for their
data governance or enterprise architecture programs as well as by BI and DW teams to reverse-engineer source files and databases for their integrated ETL process, thus contributing to the EDM.
Data Administration Principles
An EDM is not merely a simple high-level pictorial representation (E/R diagram) of an organization’s data resources. Its ultimate value comes from applying stringent data administration
principles during the logical data modeling process. For example, most people recognize that creating business metadata, such as data definitions, is very important, but even more important is
applying a formalized structure to the data naming process. Using “favorite” names or blindly copying informal names from source systems is not allowed. Without formal data names, data
elements cannot readily be recognized, or they may be misidentified and therefore misused. An unrecognized data element may be unknowingly duplicated as a synonym, or a data element name may be
reused as a homonym for another data element that is described much more accurately by that name.
There are numerous data-naming conventions, the most popular being the “prime words/class words/qualifiers” convention. It prescribes that every data element must have one prime word,
one or more qualifiers, and end in one class word (qualifiers can apply to both prime and class words). Class words are predetermined and documented on a published list (e.g., date, text, name,
code, number, identifier, amount, count, etc.). Furthermore, every data element must be fully qualified (to avoid homonyms and to avoid limitations on naming future data elements), and it
must be fully spelled out.
An example of a standardized business name for a data element is “checking account monthly average balance.” The main component (prime word) is “account,” which is further
qualified by the word “checking” to indicate the type of account. The class word indicating the type of data value contained in this data element is “balance,” which is
further qualified by the words “monthly” and “average” to indicate the type of balance. In his book Data Resource Quality, Michael Brackett describes other
data-naming taxonomy components, such as data site, data subject, data occurrence role, data characteristic, and data version.
Another DA principle is the formalized process for creating business-data definitions. The definition should be short, precise, and meaningful (a short paragraph). It must thoroughly describe the
data element name (i.e., what it is) and, optionally, it may contain an example. A data definition should never contain information about the source or use of the data element. Data definitions
should be reviewed regularly to ensure that they remain current and correct, and that they are understandable and agreeable to all business people.
One of the logical data modeling rules is “one fact in one place.” This rule refers to the DA principle that every data element must be unique (it must have one and only one semantic
meaning), and, therefore, every data element (attribute) belongs to one and only one entity3. In other words, detail overload is not allowed. You cannot have a general “type
code” attribute where the values A, B, and C correspond to “type of customer” (A= corporate, B= partnership, C= individual) and the values D, E, and F correspond to “type of
product” (D= mortgage loan, E= consumer loan, F= construction loan).
A related DA principle states that all data elements must have a specified business data domain, which refers to data values that are allowed with respect to the data name (specifically the class
word), data definition, and its business rules. Data domains can be expressed as a list of values, a range of values, a set of characters, or pattern masks.
Data normalization is another adopted DA principle. It is fundamental to all other DA principles because, by following the formal normalization rules, you can ensure that all unique descriptive
attributes of an entity are fully dependent on the unique business identifier of that entity. This is the premise for modeling the single version of the truth or 360-degree view of one’s
business.
Data-Quality Improvement with EDM
Four categories of data-quality rules are applied during logical data modeling:
Rules about business objects (entities): Entities are subject to three data-quality rules: uniqueness, cardinality, and optionality. Uniqueness specifies that every entity has a
unique business identifier. Cardinality refers to the degree of a relationship; that is, the number of times one entity can be related to another. The “correct” cardinality in every
situation depends completely on the definition of your entities and the business rules governing those entities. Optionality is a type of cardinality, but instead of specifying the maximum number
of times two entities can be related, it identifies the minimum number of times they can be related. There are only two options: either two entities must be related at least once (mandatory
relationship) or they don’t have to be related (optional relationship).
Rules about data elements (attributes): Attributes are subject to two data-quality rules: data inheritance and data domains. Data inheritance rules apply to supertype and subtype
entities. Data domain rules refer to a set of allowable values, as discussed earlier.
Rules about the dependencies between entities or attributes: This category includes seven data dependency rules: three for entity relationships and four for attributes.
The three entity relationship dependency rules are:
- The existence of a data relationship depends on the state (condition) of another entity that participates in the relationship.
- The existence of one data relationship mandates that another data relationship also exist.
- The existence of one data relationship prohibits the existence of another data relationship
The four attribute dependency rules are:
- The value of one business attribute depends on the state (condition) of the entity in which the attributes exist.
- The correct value of one attribute depends on the values of two or more other attributes.
- The allowable value of one attribute is constrained by the value of one or more other attributes in the same entity or in a different but related entity.
- The existence of one attribute value prohibits the existence of another attribute value in the same entity or in a different but related entity.
Rules about data validity: These rules govern the quality of data values (data domains):
- The data-completeness rule comes in four flavors: entity completeness, relationship completeness, attribute completeness, and domain completeness.
- The data-correctness rule requires that all data values for an attribute must be correct and representative of the attribute’s name, definition, domain, and business rules.
- The data-accuracy rule states that all data values for an attribute must be accurate in terms of the attribute’s dependency rules and its state in the real world.
- The data-precision rule specifies that all data values for an attribute must be as precise as required by the attribute’s business requirements, business rules, intended meaning, intended
usage, and precision in the real world. - Five data-uniqueness rules address entities and attributes.
- Two rules enforce data consistency.
Since the activities for creating an EDM include the validation of the data-quality rules applied during logical data modeling to the existing operational data in the source files and databases,
EDM contributes directly to improving data quality.
Master Data Management Using EDM
The combination of MDM and master metadata management is currently hyped as a new technique to ensure that an organization’s data about its core business objects4 is unique,
consistent, reliable, and traceable. Many definitions exist for MDM:
- A set of disciplines, applications, and technologies for harmonizing and managing the system of record and system of entry for the data and metadata associated with the key business entities of
an organization.” (Colin White, Claudia Imhoff) - MDM comprises the business applications, methods, and tools that implement the policies, procedures, and infrastructure to support the capture, integration, and subsequent shared use of
accurate, timely, consistent, and complete master data.” (David Loshin) - MDM is a set of corporate disciplines that ensures that corporate reference data, such as charts of accounts and customer information, is presented in a clear and consistent manner across the
enterprise.” (Al Moreno and Greg Mancuso) - The organization, management, and distribution of corporately adjudicated, high-quality information with widespread use in the organization for business gain.” (William McKnight)
MDM is certainly much more than EDM because of its technical implementation with various products and tools. However, it does have much in common with EDM. The most striking commonality is summed
up by the title of Philip Russom’s report published by TDWI in October 2006: Master Data Management: Consensus- Driven Definitions for Cross-Application Consistency. As described in
previous sections, one of the DA principles applied during EDM is creating precise, consensus-driven definitions, along with consensus-driven names, domains, and business rules.
Almost all MDM articles mention that MDM needs data governance. Some definitions of data governance include:
- The execution and enforcement of authority over the management of data assets and the performance of data functions.” (Robert Seiner)
- The process by which you manage the quality, consistency, usability, security, and availability of your organization’s data.” (Jane Griffin)
- A process and structure for formally managing information as a resource; ensures the appropriate people representing business processes, data, and technology are involved in the decisions that
affect them; includes an escalation and decision path for identifying and resolving issues, implementing changes, and communicating resulting action.” (Danette McGilvray)
These definitions, especially the last one, also describe EDM.
Conclusion
EDM is a mature and proven technique that spans logical data modeling, business metadata management, data governance, data stewardship, data-quality improvement, and MDM. The use of Chen’s
E/R modeling technique, Codd’s six normalization rules, and the extensive DA principles of the 1980s still make EDM the most effective technique for achieving the single version of the truth
or the 360-degree view of a business, at least logically. Data management groups, such as DG, DQ, MDA, EIM, ICOE, and ICC5, as well as project teams of physical implementation
initiatives such as MDM, CDI, PIM, CRM, EDW, and ODS6, would benefit greatly from leveraging the science of EDM.
1I will use the abbreviation EDM for both enterprise data modeling and the enterprise data model.
2Data resource management, information resource management, enterprise information management.
3This rule does not apply to foreign keys because they are not part of logical data modeling. Foreign keys are physical manifestations of logical data relationships.
4Core business objects are the kernel entities on an EDM
5Data governance, data quality, metadata administration, enterprise information management, information/integration center of excellence, integration competency center
6Master data management, customer data integration, product information management, customer relationship management, enterprise data warehouse, operational data store
REFERENCES:
Adelman, Sid, Larissa T. Moss, and Majid Abai. Data Strategy. Addison Wesley, 2005.
Brackett, Michael H. Data Resource Quality. Addison Wesley, 2000.
Codd, Dr. Edgar F. “A Relational Model of Data for Large Shared Data Banks.” Communications of the ACM
(Volume 13/Number 6).
Duncan-Hepp, Karolyn, and David L. Wells. “Rule Based Data Cleansing.” The Journal of Data Warehousing, 1999.
Moss, Larissa, and Steve Hoberman. The Importance of Data Modeling as a Foundation for Business Insight, White paper, Teradata, 2005.