Data versus MetadataA common definition of metadata is “data about data,” but it is a fairly useless one because it disregards the complexities of metadata. Others have put forward more profound definitions (e.g., “Data + Metadata = Information”) or all-inclusive definitions, such as “all physical data and knowledge from inside and outside an organization, including information about the physical data, technical and business processes, rules and constraints of the data, and structures of the data used by a corporation.” (David Marco). It is probably safe to assume that the reader who is still with me at this point has a good understanding of the significance of metadata. It should be noted that this article is strongly biased toward ER modeling, though one should be able to translate the ideas presented here into other methodologies such as UML or ORM.
Meta is a relative notion, referring to a concept at a higher abstraction level that defines a lower level concept. This also implies that meta is a recursive concept, in that a meta level can be described by even higher meta (meta meta) levels, limited in depth only by practical relevance. As a consequence, one person’s metadata may be another person’s data. For example, in the context of data administration, this so-called “metadata” that describes enterprise data is actually the core-business “data” for data administration, managed in a database, just like enterprise data. This metadata database is generally called a repository (or data dictionary); but from a physical perspective, a repository is just another database.
The data in a repository is like any other business data, in that it should be stored in a reliable database and, in many cases, acts as primary data that controls and/or influences automated processes, such as software development tools and report generators. Metadata is like ordinary data in that it can be modeled in a data model, which is often called a meta model to differentiate the two.
Crossing the DivideSome data items (attributes) have sharply defined domains (e.g., Gender (male/female) and OrderType (buy/sell) indicators). The actual representation of such attributes can vary, and no one will question the wisdom of documenting the allowed values of these attributes as metadata in a repository (for example, F=female, M=male, B=buy, S=sell). To consider attributes like these as concept types in their own right and to implement them into separate database tables is probably regarded as a silly exercise for data modeling zealots without any practical experience in software development. However, as the volatility or the number of allowed values increases, it becomes more viable to model such attributes as separate entity types, stored in a separate table. When the list of values becomes larger but still remains static (e.g., the 12 months of the year or the 50 states of the USA), the distinction between data (stored in an active database) and metadata (documented in a repository) becomes less clear. As this progression continues (e.g., a list of ISO country codes or area codes), it makes sense to store these codes in an operational database. This gray area between simple enumerated metadata and operational data is often a topic of debate amongst data (base) administrators and software developers when it comes to deciding how to physically implement validity checks: as a lookup table, as metadata that feeds a software generation tool, or hard-coded in software. Furthermore, maintaining multiple copies of this (meta)data in both the repository and the operational database(s) creates a synchronisation problem.
Another way to view this problem is to ask whether such attributes are considered to be just Domains or Entities in their own right. The implementation of a logical Entity can be a simple lookup table, but the implementation of a Domain is not so straightforward because of proprietary DBMS mechanisms (e.g., Check Constraints, Triggers and User Defined Types) that reduce portability. It would be desirable to have a uniform way to specify reference data values at the logical level, independent of implementation issues.
The Data-Metadata ContinuumTraditional repositories maintain only “pure” metadata. These tools do not attempt to bridge the data-metadata gap, but do enable enumeration of a limited number of allowed values for each domain or attribute. At my company, we have been using a custom-built repository and data modeling tool for 20+ years to support data management. Over the years, the tool has evolved to meet changing requirements. To manage the data-metadata continuum, we’ve developed metamodel extensions that cross the data-metadata divide, in order to enable a single point of control of reference data, as illustrated in the following diagram:
The diagram shows a simplified representation of our meta model and the gray area between data and metadata. Some object types may need some clarification:
- UoD (Universe of Discourse) represents an independent environment with its own interpretation of the real world, applied to data modeling.
Examples are organizations, standardisation bodies (e.g., ISO), and software packages. When enterprises interact with each other, there are often (semantic) mismatches that must be resolved to allow information exchange. The UoD concept acknowledges this diversity and is the foundation for managing information interchange across heterogeneous environments.
- Domain represents an information item, including its semantics, as defined within a UoD. Only domains within the native UoD (internal company data) are fully modeled in entity-relationship structures. We do not attempt to fully model external UoDs that we cannot influence; we only document the domains that are relevant for establishing interfaces with the outside world.
- Values are the enumeration of actual symbols (codes) used to represent information. An attribute may, depending on its role and context, use only a subset of values of its underlying Domain.
- Data Mapping represents a set of rules that define:
– the association between values and their meaning (the semantics)
– transformational rules [i.e., how data values must be translated in order to transfer information from one system (or UoD) to another]
Using the structure shown in the meta model, both small and large enumerations can be managed uniformly in the same repository. Values and Data Mappings can be exported to external data stores, such as flat files or Excel sheets. These can then be loaded into the operational reference databases and further used to drive software development tools to facilitate enterprise information integration. Conversely, reference data can be imported back into the repository, along the same path. When the size of a reference data file is too large, importing becomes impractical, but a direct hyperlink to the source data could be the solution for viewing purposes.
OwnershipWho will manage this reference (meta)data depends on who is accountable for the actual content.
- Common reference data (used across multiple business lines) is best managed centrally by Data Administration, in the repository, and exported/ published accordingly.
- Reference data that is business-line specific is best managed by the business, either in the reference database (and imported in the repository), or (by proxy) in the repository itself and then exported to the reference database.
- Reference data that is owned by an external (standards) organization should not be managed, but maintained and distributed via a subscription from a data vendor. For convenience, it may be periodically imported into the repository, or viewed via a hyperlink.
What’s Next? Having a single point of control for the gray area between data and metadata provides a basis for a single point of truth and gives the data administrator (more) control over the actual values of reference data. With this extra control comes extra responsibility of synchronizing with real-world data. For example, removing a value from a list of allowed values may be a simple administrative action, but could have a large impact on the referential integrity of operational databases (i.e., removing a primary key occurrence may be restricted by the DBMS or else may result in orphans in dependent tables). I will elaborate on possible solutions for such issues in a future article.
ConclusionThe mechanism described here provides a means of managing information in a repository that goes beyond traditional metadata and extends seamlessly into the realm of real-world data. Taking control of this gray area between metadata and data simplifies the choice of where and how to store this information regarding data values: in the repository, in the operational database, or both as often happens. There are obvious advantages in the area of data quality when the data-metadata continuum can be spanned by a single repository tool that enables single source maintenance.