Data management (known in the 1970s and 1980s as “data administration”) is really a network of interrelated disciplines, skill sets, goals, and values. The Data Management Association (DAMA) and its various chapters have been the anchor organization in enhancing the maturing of all these disciplines.
As enterprise business applications have become larger and more complex, the disciplines to build and support them (particularly the systems development life cycle) also necessarily matured. Data management has matured to a level of professionalism with a body of knowledge (see the DAMA website) and certification.
In the 1960s database design was done intuitively, but it is now done in the context of a more rigorous systems development life cycle (see below) with robust data modeling tools, and standards of design.
Data management is generally confined to large, complex enterprises. A small company (even a sole proprietorship doing invoicing and billing on a laptop) probably does not need data management because the human running it knows where all the data is. However, regular backup and recovery methods are still necessary for preserving the data from catastrophic loss.
In the brain of the sole proprietor, each fact is encased in context…additional information providing context (and meaning) for that facts. And the facts are networked to give each other context.
This fragmentation of data and information is when databases become necessary for capturing that data (previously information) in tabular structures with rules of definition and format, and then making it available to other employees who need it at other times.
In reality, information “degrades” into data as it is taken out of the brain and recorded into tabular structure because the context is stripped away.
But what you do with data and how it is to be managed depends upon some basic characteristics, which allow the taxonomy discussed here.
Basic Categories of Data
Raw data (shown on the left in Figure 3) includes recorded observations in their most granular or elemental state. This is often the “data of record.” This is often what needs most to be governed.
It can be observations about natural things and events in reality (tangible and intangible) or it can be normative declarations about reality (such as a VP of sales inventing a name for a sales district).
To the right (in Figure 3) are derived data – aggregates, ratios, and such. These tend toward what I will call “information.” They cannot be built without reliable raw data.
Data warehousing is a way of organizing and conditioning derived data to facilitate exploration, reporting, graphical expression, and ultimately understanding of business behavior and conditions. A data warehouse contains a copy of production data, organized for these goals. It tends to feed reporting or business intelligence (BI) tools, which may be used by executives or their subordinate knowledge workers.
Data warehouses tend to give visibility to high-level trends and support “macro decisions” (with widespread consequence) while the raw data (on the left in Figure 3) is used more to support “micro-decisions” dealing with individual customers, patients, events, etc.
Parallel to this construct are two major categories of data disciplines.
Sub-Disciplines of Data Management
In the actual management of granular data (or “data of record”), there are three primary sub-disciplines. All of these tend (in a religious sense) to view data as a valuable corporate asset (even if the executives have not learned that yet).
- Data quality. The assessment of the quality of data and the efforts to improve that quality. Quality characteristics include presence, validity, accuracy, precision, reliability, and such.
- Data governance. A recently emerging field that seeks to preserve data quality, and build procedures and organizational framework to prevent ill-advised changes in critical data from causing major problems downstream.
- Data stewardship. A way of distributing and assigning responsibility for “data of record” to individuals in organizational situations with the most knowledge about the data and its business usage. Stewardship also addresses the correct movement of data between applications and organizations.
While data can be tested in a centralized business function, data improvement is really a distributed effort. Data is created and captured in a wide variety of business processes; and while well-designed screens, forms, and reports are helpful, the quality of these processes is not solely an information technology function.
But neither is data governance solely an information technology function. Again, it is business-oriented, and technological nerds may not be the best choices for data governance or data stewardship.
What Data Management is Not
The solution was to copy the granular production data to a separate database (often on a separate hardware platform) where queries could be run without impacting the essential transactions that keep the business going.
These became data warehouses. A variety of ways to organize the data (physically and logically) emerged (such as dimensional databases and star schemas) to enhance the speed and ease of posing ad hoc queries against the data.
Moving the data from its granular production environment to the information environment required methods (and programs and tools at times) of extract, transformation, and load (ETL) to copy the data.
Data warehouses also benefit from integrating into them data from sources outside the enterprise. These data flows require particular care because of frequent differences in logical architecture and perhaps a lack of reliability of the consistency and quality of the input flow.
Data Architecture and Modeling
Figure 8: A very simple data model for a small college.
A logical data model (symbolized by a small ER diagram icon in the Figure 10) should be built in the requirements phase, and inform everyone’s understanding of the business problem. The logical data model may then be used to create a physical database model that is used by the DBA5 to build a test version of the database underneath the application, and ultimately the production database.
Other Related Data Disciplines
Other Important Data Management Functions
In addition, bringing data from diverse sources together in the data warehouse in a way that they make sense requires thoughtful analysis of the meaning, architecture, and behavior of each source, and the semantic data integration (SDI) of those data into the target data warehouse database.
Merely getting the data on the common physical platform is not enough. It must be semantically integrated in a way that it makes business sense. Codes from one source must have the same domain (list of valid values) and meaning as codes from another source. Numeric fields must behave in a similar way, be of the same units of measure, and conform to the same business meaning.
This overview of the data management disciplines and areas of interest should help navigating the large amount of literature in this area.
In the Context of the Full I.T. Component Model
Some businesses try to recruit “developers” who have a full range of skill sets covering all these levels. That is often unrealistic, but it is often done by CIOs who are infrastructure-oriented and don’t understand the “soft” skills of gathering business requirements and meeting the semantic needs of business data.
The taxonomy proposed in this essay does not address the taxonomy of other levels (such as applications, operating systems, and infrastructure). It is at the data level (See Figure 15).
- DBMS = database management system
- When I say “business-oriented,” I am talking about whatever the primary purpose of the enterprise is. If it is healthcare, it is oriented towards the patient, the diagnostic and treatment activities as well as support of generalized hospital processes. If it is education, it is student-oriented and learning-oriented.
- Data architects and database administrators are often grouped in the same organizational unit or cost center. They do need to talk to each other a lot.
- Infrastructure, meaning the platforms, networks, operating systems, and servers underneath the business software.
- DBAs (database administrators) specializes in Oracle, DB2, etc., and worry about table partitioning, table spaces, indexes, and other features to improve speed and performance.
- Generally, the users of metadata are often project designers, systems analysts, developers and programmers who are not familiar with the metadata resources, or not skilled in navigating. If they have experience with legacy applications, they often make unwarranted assumptions about the to-be logical design, biased by their previous experience.
- The Myers-Briggs model of temperament or personality is very useful in understanding why systems programmers may not be the best people to do business analysis. This author is an INTP. I recommend learning this model as a part of your basic tool set in dealing with other business and I.T. personalities.
The Data Management Association: DAMA is a non-profit professional organization with over 30 chapters around the United States and more overseas.
International Association of Information and Data Quality: A spinoff from DAMA in 2004, IAIDQ is the primary non-academic organization for practitioners of data quality. It holds annual conferences around the U.S.
The Data Warehousing Institute (TDWI): TDWI is a for-profit organization putting on conferences and on-site training in the various sub-disciplines of data warehousing. It also has a network of non-profit chapters in major cities.
Data Governance Institute: The primary emerging authority on the data governance discipline.