Introduction
Subtypes can be found in many data models. They occur when an entity (the supertype) consists of other entities (subtypes) that exist at lower levels of abstraction and have their own particular
attributes. For instance, in a model of a medical database, Person could be a supertype, and Male Person and Female Person could be subtypes, as
shown in Figure 1. One particular attribute in the supertype is used as a category discriminator. The value in this column in the physical database indicates which subtype each record in
the supertype is associated with. A special reference data table houses these values, and contains one record for each value that can be used as a category discriminator. In the physical
implementation of Figure 1, the table Gender will contain two records, one with a Gender Code of “M” (Male Person) and the other with a Gender Code of
“F” (Female Person). Each record in the Person table is associated with a record either in the Male Person table or the Female Person
table depending on the value of Gender Code in that particular Person record.
All this is well supported by data modeling methodologies and tools. However, in my personal experience subtypes are not seen all that often in data models. This is somewhat surprising because many
sets of business rules operate on only a specific subset of records in a table. This can easily be seen in SQL statements that update different columns in a table, but which have identical, and
often rather complex, WHERE clauses. Similarly, sets of reports are often found which deal with only a specific subset of records in a table. The structure and behavior of applications would,
therefore, imply that there should be a lot more subtypes in a database design than are apparent. In reality, the subtypes are there, but are hidden, and because they are hidden, they are not
properly administered. Understanding hidden subtypes is necessary if the enterprise’s information assets are to be properly managed on a number of different levels.
Why Do Hidden Subtypes Exist?
The reason that so few real or “formal” subtypes are found in data models appears to be partly due to the complexity they introduce into programming. For instance, if a new
Person record has to be inserted in the design shown in Figure 1, then either a new Male Person, or a new Female Person record should
simultaneously be inserted in the appropriate subtype table. However, a system operator does not always know the gender of a patient right away, making it impossible to do the two inserts. The
alternative of inserting a record in the supertype table and waiting until the gender is known before inserting a record in the appropriate subtype table is difficult to program. Given these
complexities, programmers greatly prefer having only one table that incorporates the supertype and the subtypes.
A second problem is that one supertype can have different sets of subtypes, which is difficult to produce a data model for, even more difficult to program against, and incredibly difficult to use
for writing reports. For instance, the Person entity shown in Figure 1 may break down into a number of independent sets of subtypes in the context of a medical database as follows:
- Subtypes: Insured Person and Non-Insured Person
- Subtypes: US Citizen and Alien
- Subtypes: Resident of State and Non-Resident of State
- Subtypes that cover the person’s religious affiliation
Each of these subtypes may have its own unique attributes, and may have some business rules that apply uniquely to it. If a data modeler rigorously implements these subtypes, the
Person entity will have over 10 subtype entities. Any programmer presented with such a design would not be very happy, and any user trying to query a database built on such a
design would probably not know where to begin.
A third issue that often emerges is that a subtype can exist with only one category. That is, the supertype exists with only one subtype that has its own unique attributes. Perhaps the
Person entity could have a Juvenile subtype, with some special attributes, but no corresponding Non-Juvenile subtype. In such a case, the category
discriminator can be a simple indicator. There is no need for a separate reference data table to hold what would only be a single record for the category discriminator. Cases of single subtypes are
quite common, and do not fit well with the theory behind formal subtypes.
All of these factors contribute to reducing the inclusion of formal subtypes in data models. Instead, a single entity is implemented that includes the supertype and all subtypes. In reality
subtypes still exist, but they are hidden within entities that have no associated subtype entities.
The Role of Reference Data and Indicators
When hidden subtypes are implemented instead of formal subtypes we typically see a database table surrounded by a cluster of reference data tables. The term reference data is used for data
that represents things that the enterprise does not manage (e.g. Country), or data that is used only to categorize other data found in the database. Reference data tables are sometimes called
“lookup tables” and “domain values”, and usually consist of a single code column and a single description column.
Figure 2 shows how reference data tables are used to identify hidden subtypes in the Person record. This design resembles the “fact table” and “reporting
dimensions” that are commonly found in dimensional data modeling approaches used for building certain kinds of data warehouses and marts. Although the reference data tables in Figure 2 can be
used as reporting dimensions, they represent much more than this. They identify the subsets of records in the Person table that represent hidden subtypes. Specific attributes
belong to each of these hidden subtypes, and unique sets of business rules apply to each hidden subtype.
One big advantage of using reference data tables to identify hidden subtypes is that an additional subtype category can very easily be accommodated in a physically implemented database simply by
adding a new record to a reference data table. This makes it much easier for databases to change with business needs. By contrast, formal subtypes would require introduction of a new table. Of
course completely new hidden subtypes (as opposed to an additional category for an existing hidden subtype) will require a new reference data table.
In addition to reference data, attributes that are indicators can also identify hidden subtypes. In Figure 2, Juvenile Status Indicator serves this purpose. Indicators are attributes that
have two values like “True” and “False”, or “Yes” and “No”, to indicate the presence or absence of something. In terms of hidden subtypes, indicators
can only identify a subtype that contains one category, like Juvenile in Figure 2. As noted above, there is no Non-Juvenile subtype.
One problem with indicators can occur if the hidden subtype needs an additional category added to it. When this happens an indicator no longer works. There is now a need to distinguish between two
different subtypes, not simply a need to identify a single subtype. Unfortunately, indicators tend to be corrupted when this happens. They are usually populated with different codes, but no
corresponding parent reference data table is introduced. This is a problem that exists in many databases. One way to eliminate it is to avoid using indicators to identify hidden subtypes, and to
use reference data tables instead, even if they contain only one record.
It should be noted that both reference data and indicators play roles other than identifying hidden subtypes in data models. Just because they are present does not mean that hidden subtypes are
present.
Loss of Subtype Metadata
Hidden subtypes make database implementation easier in general terms than when formal subtypes are used. However, they also create a lot of problems that are typically not mitigated by data
modelers, DBA’s, or other IT professionals. These problems originate from the fact that a subtype is something that needs to be tracked. If a subtype is not present in the metamodel of a
database design, then no information can be tracked for it. IT professionals then have to rely on personal knowledge, or information dispersed in documentary form.
If a subtype exists, then its existence should be recorded via some basic metadata. Each subtype should have a name, a description, and should be related to other subtypes of the same category.
Each subtype should also be directly related to the record in the reference data table (or indicator) that serves as a category discriminator. Unfortunately, this last requirement crosses the
boundary between metadata and physical data in a database, making it unpalatable for many data professionals.
A major problem with hidden subtypes is that it is impossible to tell from a data model if a particular attribute belongs exclusively to one hidden subtype, or if the attribute applies to the
supertype, and thus to every record in the implemented database. Some data modelers try to use naming conventions, or groupings of attributes, or even color schemes to show which attributes belong
together. However, data modelers do not always understand why they are doing this. The reality is that a hidden subtype exists to which a subset of attributes uniquely applies. Since there is no
formal metadata for hidden subtypes, these groups of attributes cannot be related to the subtypes to which they belong.
This problem also makes a complete nonsense of the NULL / NOT NULL options that exist for physically implemented columns in databases. Everyone involved in database design eventually comes across a
database column that in some circumstances must contain a value, but in other circumstances must never contain a value. The attribute Person Responsible for Juvenile in the
Person entity in Figure 2 is like this. If a person is a juvenile then a person who is responsible for them must be identified, and Person Responsible for Juvenile must be
NOT NULL. However, if a person is not a juvenile, they are responsible for themselves and Person Responsible for Juvenile must be NULL. Forcing a data modeler to choose between NULL and
NOT NULL for an attribute misses this point entirely.
Poorly Identified Hidden Subtypes
The situation in Figure 2 is actually somewhat ideal. Sometimes, hidden subtypes can be so well hidden that it is nearly impossible to detect them. This happens when neither a single record in a
reference data table, nor a single indicator column, identifies a hidden subtype.
An example might be found in a customer database where customers who have not placed an order in the past 90 days and still owe an outstanding balance are subject to special business rules and
special reporting. Such customers could belong to the hidden subtype Customers At Risk of Default. The Customer entity could have one attribute Days Since Last
Order Placed, and another attribute Balance Outstanding that identify this hidden subtype. Each time this hidden subtype needs to be processed the value in Days Since Last Order
Placed has to be compared to 90, and the value in Balance Outstanding has to be compared to 0. This may have to occur in many business rules, reports, and queries. The risk is that
programmers, and perhaps users will make mistakes when repeatedly implementing the logic to identify the members of the hidden subtype Customers At Risk of Default. Furthermore,
what happens if the definition of the hidden subtype changes to customers who have not placed an order in the past 60 days and still owe an outstanding balance?
Hidden subtypes are poorly identified if complex logic alone is used to identify them. It is quite difficult to be aware of their existence. In these circumstances an indicator column should be
added to identify the hidden subtype. Thus, an attribute Customer At Risk of Default Indicator could be added to the Customer table. It should be set once, by one business
rule, and then used to identify members of the hidden subtype.
Use of indicators in this way not only makes it easier to identify hidden subtypes, but it reduces the risk that comes from repeatedly implementing complex logic to identify hidden subtypes. In
particular, it is very useful for keeping business rules atomic. A business rule that has to contain complex logic to identify a member of a hidden subtype, and which also contains a calculation
for update purposes is really two business rules and is not atomic. Explicit identification of hidden subtypes through indicators or reference data helps to keep business rules atomic.
Conclusion
Hidden subtype are frequently found in database designs, but are not supported by data modeling tools and techniques. Furthermore, few data professionals are aware of them or the implications that
hidden subtypes have. An increased awareness of the existence of hidden subtypes could have a very positive impact on the quality of both the designs of databases and the applications built on
these databases.