A Taxonomy of Data Management Disciplines

When talking about data we often confront many ambiguous terms:
  • data
  • information
  • systems
  • architecture
All of these terms are used in a variety of ways. As we proceed in this essay, we shall try to give clarity to their definitions. For instance, data tends to be raw and granular, where as information tends to be derived and more “summary.”
 
Data management (known in the 1970s and 1980s as “data administration”) is really a network of interrelated disciplines, skill sets, goals, and values. The Data Management Association (DAMA) and its various chapters have been the anchor organization in enhancing the maturing of all these disciplines.
 
As enterprise business applications have become larger and more complex, the disciplines to build and support them (particularly the systems development life cycle) also necessarily matured. Data management has matured to a level of professionalism with a body of knowledge (see the DAMA website) and certification.

In the 1960s database design was done intuitively, but it is now done in the context of a more rigorous systems development life cycle (see below) with robust data modeling tools, and standards of design.

Data management is generally confined to large, complex enterprises. A small company (even a sole proprietorship doing invoicing and billing on a laptop) probably does not need data management because the human running it knows where all the data is. However, regular backup and recovery methods are still necessary for preserving the data from catastrophic loss.
 
In the brain of the sole proprietor, each fact is encased in context…additional information providing context (and meaning) for that facts. And the facts are networked to give each other context.


Figure 1

As successful organizations grow, the knowledge once held in the brain of the proprietor becomes fragmented among the various employees whose focus is on their own business function. If information is power, some proprietors are willing to tolerate a situation where they don’t know every fact – they learn to trust their employees. Other proprietors remain controlling, and want to micromanage everything.
 
This fragmentation of data and information is when databases become necessary for capturing that data (previously information) in tabular structures with rules of definition and format, and then making it available to other employees who need it at other times.
 
In reality, information “degrades” into data as it is taken out of the brain and recorded into tabular structure because the context is stripped away.

Figure 2

But those tabular structures (a.k.a. computer files and databases) are the best we have. So it is in the large (or growing) organizations where we need to manage that data to ensure its quality, reliability, definitional integrity, guard against inappropriate changes in facts (the real world changes all the time, and our data must keep up) and ensure that such changes don’t have adverse downstream consequences.

But what you do with data and how it is to be managed depends upon some basic characteristics, which allow the taxonomy discussed here.

Basic Categories of Data

A good way to approach this is to first categorize the kinds of data. Perhaps a good way is to distinguish between raw data and derived data.

Figure 3

Raw data (shown on the left in Figure 3) includes recorded observations in their most granular or elemental state. This is often the “data of record.” This is often what needs most to be governed.
 
It can be observations about natural things and events in reality (tangible and intangible) or it can be normative declarations about reality (such as a VP of sales inventing a name for a sales district).
 
To the right (in Figure 3) are derived data – aggregates, ratios, and such. These tend toward what I will call “information.” They cannot be built without reliable raw data.
 
Data warehousing is a way of organizing and conditioning derived data to facilitate exploration, reporting, graphical expression, and ultimately understanding of business behavior and conditions. A data warehouse contains a copy of production data, organized for these goals. It tends to feed reporting or business intelligence (BI) tools, which may be used by executives or their subordinate knowledge workers.

Data warehouses tend to give visibility to high-level trends and support “macro decisions” (with widespread consequence) while the raw data (on the left in Figure 3) is used more to support “micro-decisions” dealing with individual customers, patients, events, etc.

Parallel to this construct are two major categories of data disciplines.

Figure 4

Data management (in Figure 4) generally addresses raw data (observations and declarations) while information management (on the right in Figure 4) focuses upon derived information, which is more conducive towards management’s understanding of reality and informed strategic decision making.

 
Sub-Disciplines of Data Management

Data administration emerged (in the 1980s) with a primary focus upon data architecture (an abstraction) and data modeling (which was often viewed as a necessary prerequisite to good database design). People talked about the “enterprise data model,” and some large companies tried to build one. (They later discovered that there generally is no single enterprise logical data architecture due to severe fragmentation of data usage from multiple uncoordinated applications).
 
In the actual management of granular data (or “data of record”), there are three primary sub-disciplines. All of these tend (in a religious sense) to view data as a valuable corporate asset (even if the executives have not learned that yet).
  • Data quality. The assessment of the quality of data and the efforts to improve that quality. Quality characteristics include presence, validity, accuracy, precision, reliability, and such.
  • Data governance. A recently emerging field that seeks to preserve data quality, and build procedures and organizational framework to prevent ill-advised changes in critical data from causing major problems downstream. 
  • Data stewardship. A way of distributing and assigning responsibility for “data of record” to individuals in organizational situations with the most knowledge about the data and its business usage. Stewardship also addresses the correct movement of data between applications and organizations.

Figure 5

These three areas overlap considerably. For example, data governance practitioners are always interested in the quality of the data.

While data can be tested in a centralized business function, data improvement is really a distributed effort. Data is created and captured in a wide variety of business processes; and while well-designed screens, forms, and reports are helpful, the quality of these processes is not solely an information technology function.
 
But neither is data governance solely an information technology function. Again, it is business-oriented, and technological nerds may not be the best choices for data governance or data stewardship.

What Data Management is Not

In explaining data management, we probably ought to also point out what data management is not. First, it is not specific to any kind of technology (computer hardware, operating system, DBMS,1 etc.). In fact, true data management (as I will say over and over) is more “business-oriented”2 than it is hardware-oriented.


Figure 6

So while all these infrastructure disciplines are important,3 they are far less business-oriented (except for their support of speed and efficiency) than are the data management interests.

     
Information Management

I don’t think the data administration thinkers of the 1970s realized this kind of taxonomy. But “decision-support systems” became called “data warehouses” as it became clear that heavy queries that swept the entire tables of production databases degraded the performance of online transaction processing resting on those same databases.
 
The solution was to copy the granular production data to a separate database (often on a separate hardware platform) where queries could be run without impacting the essential transactions that keep the business going.
 
These became data warehouses. A variety of ways to organize the data (physically and logically) emerged (such as dimensional databases and star schemas) to enhance the speed and ease of posing ad hoc queries against the data.

Moving the data from its granular production environment to the information environment required methods (and programs and tools at times) of extract, transformation, and load (ETL) to copy the data.

Data warehouses also benefit from integrating into them data from sources outside the enterprise. These data flows require particular care because of frequent differences in logical architecture and perhaps a lack of reliability of the consistency and quality of the input flow.


Figure 7

Business intelligence was the downstream activity from data warehousing (upper right in Figure 7) and includes traditional reporting, and data visualization (graphic techniques).

Data Architecture and Modeling

All of these data disciplines require a rich and understanding of the business meaning of data. This is primarily found in the understanding of logical data architecture (not to be confused with infrastructure4 “architecture”). The best expressions of that data architecture found in logical data models, physical data models, and entity-relationship (ER) diagrams (ERD).


Figure 8
: A very simple data model for a small college.

Data architecture and data modeling constitute an essential skill set that supports all of these data management functions.

Figure 9

Data modeling and data architecture are also essential elements of the systems development life cycle through which high-quality business application software is built.


Figure 10

A logical data model (symbolized by a small ER diagram icon in the Figure 10) should be built in the requirements phase, and inform everyone’s understanding of the business problem. The logical data model may then be used to create a physical database model that is used by the DBA5 to build a test version of the database underneath the application, and ultimately the production database.
 
Other Related Data Disciplines

The knowledge about the enterprise and its data, its logical business architecture, etc. must be captured in a way that is preserved and easily conveyed to other people around the enterprise who need it.


Figure 11

This is frequently done through data documentation that is often stored in a metadata repository. Such commercial off-the-shelf repositories are quite expensive, and often difficult to implement. Repository projects often fail, or are abandoned, before they achieve full usage. A useful fall-back position is to create a library (or folder) of more simple documents (text, spreadsheets, data models) that can be found (navigation ease) and viewed easily by anyone seeking such information.6

Other Important Data Management Functions

Data quality is actually a discipline and function that has application in many parts of the enterprise, especially where data is flowing from one database or source to another. Hence, the flow of granular data into a data warehouse (which requires ETL as previously mentioned) also may require data quality surveillance in ensuring the consistency of the data being transferred. This applies to data acquired from within the enterprise (with hopefully rigorous standards of quality and definition) and data acquired from outside the enterprise. I would call this “data acquisition” (symbolized by the maroon box connecting data management and data exploitation in Figure 12).


Figure 12

In addition, bringing data from diverse sources together in the data warehouse in a way that they make sense requires thoughtful analysis of the meaning, architecture, and behavior of each source, and the semantic data integration (SDI) of those data into the target data warehouse database.
 
Merely getting the data on the common physical platform is not enough. It must be semantically integrated in a way that it makes business sense. Codes from one source must have the same domain (list of valid values) and meaning as codes from another source. Numeric fields must behave in a similar way, be of the same units of measure, and conform to the same business meaning.

This overview of the data management disciplines and areas of interest should help navigating the large amount of literature in this area.

In the Context of the Full I.T. Component Model

Many readers of my material will remember the basic Information Technology Component Model (see Figure 13) that I have used to show how data relates to the business (above it) and the application software (below it).


Figure 13

This model is very useful in explaining many things. For example, the substantial distance from the hardware infrastructure to the business. Additionally, the skill sets appropriate to each layer (see Figure 14) are quite different, appealing to different personality temperaments.7


Figure 14

Those different temperaments are why a person (“geek”?, “nerd”?) who is not particularly social and loves to be left alone and work on hardware is not the best person to be involved with data management, data governance, and data quality. All these “higher” disciplines (on the chart) are business-oriented, not hardware-oriented.

Some businesses try to recruit “developers” who have a full range of skill sets covering all these levels. That is often unrealistic, but it is often done by CIOs who are infrastructure-oriented and don’t understand the “soft” skills of gathering business requirements and meeting the semantic needs of business data.

The taxonomy proposed in this essay does not address the taxonomy of other levels (such as applications, operating systems, and infrastructure). It is at the data level (See Figure 15).

Figure 15

All these disciplines should be aware of each other and mutually supportive.

End Notes:

  1.  DBMS = database management system
  2. When I say “business-oriented,” I am talking about whatever the primary purpose of the enterprise is. If it is healthcare, it is oriented towards the patient, the diagnostic and treatment activities as well as support of generalized hospital processes. If it is education, it is student-oriented and learning-oriented.
  3. Data architects and database administrators are often grouped in the same organizational unit or cost center. They do need to talk to each other a lot.
  4. Infrastructure, meaning the platforms, networks, operating systems, and servers underneath the business software.
  5. DBAs (database administrators) specializes in Oracle, DB2, etc., and worry about table partitioning, table spaces, indexes, and other features to improve speed and performance.
  6. Generally, the users of metadata are often project designers, systems analysts, developers and programmers who are not familiar with the metadata resources, or not skilled in navigating. If they have experience with legacy applications, they often make unwarranted assumptions about the to-be logical design, biased by their previous experience.
  7. The Myers-Briggs model of temperament or personality is very useful in understanding why systems programmers may not be the best people to do business analysis. This author is an INTP. I recommend learning this model as a part of your basic tool set in dealing with other business and I.T. personalities.

Resources:

The Data Management Association: DAMA is a non-profit professional organization with over 30 chapters around the United States and more overseas.

International Association of Information and Data Quality: A spinoff from DAMA in 2004, IAIDQ is the primary non-academic organization for practitioners of data quality. It holds annual conferences around the U.S.

The Data Warehousing Institute (TDWI): TDWI is a for-profit organization putting on conferences and on-site training in the various sub-disciplines of data warehousing. It also has a network of non-profit chapters in major cities.

Data Governance Institute: The primary emerging authority on the data governance discipline.

Share

submit to reddit
Top