Over the last few years I’ve been more and more involved in the production of Canonical Data Models as a key artefact for stitching together enterprise data processing environments.
In reality, like most corporate data modellers, I’ve been doing these sorts of models since the year dot mostly under the guise of a Business Information Model or Logical Data Model or some such name.
However, one principle that seems to cause an awful lot of disagreement when discussing a Canonical Data Model with analysts and developers involved with its production is:
Or to elaborate:
Normally I’d regard this as a Business Analysis principle in that it’s one of the key criteria for checking that the analysis fully covers the domain being analysed and the system specifications contain a full set of required functions.
However, in the highly distributed data environments that exist in many organisations nowadays, it’s rare that all of the business analysis is completed as a single coherent piece of work or that one group is necessarily aware of the detailed analysis that another group has performed in the past.
Consequently, this principle gets forgotten or simply ignored unless it is picked up as part of the Enterprise Data Architecture.
Like all the best principles the principle of optionality is really a summation of a lot of secondary conditions and constraints that must be true in order to fully meet the business requirements.
So, this time around I thought I’d explain the rationale for this principle in more detail.
What Does “Optional” Really Mean?So what does it mean when we say there is no such thing as optional in a data model?
I think this assertion is best explained by way of an example, so let’s consider the following fragment of a data model describing a Customer Account:
The first point to note is that a data model usually describes a set of Entities, Attributes and Relationships that the business is interested in along with a set of constraints that define or restrict the values that can be assigned top the various things of interest.
One of those constraints is “minimum cardinality” (i.e., whether the characteristic is optional [minimum cardinality = zero] or mandatory [minimum cardinality > 0]).
However, what is regularly overlooked is that a data model nearly always describes minimum cardinality in terms of what must be defined when an instance of that Entity is created and what must continue to be true throughout the life of the Entity.
In the Customer Account model fragment, when a new Customer Account is created (such as a new prospect registered by the Sales Department), then the model states that we must know the Customer’s Name and Registered Address but do not need to know the Billing Address or the Credit Limit of the Customer because they are specified as optional (i.e., potentially unknown at the point that the Sales Department registers a new Customer Account).
But just because a Billing Address isn’t required to create a Customer does not mean that it isn’t mandatory later on!
Following on from initially creating a new Customer Account, sometime later the customer might decide to place an order with the company. At this point, we need to know where the goods need to be sent, where the invoice needs to be sent and, if we were a particularly prudent company, whether the business will even accept the order because the Customer Order will exceed the Credit Limit we’ve set for them.
That is the Billing Address and Credit Limit attributes have become mandatory in order to carry out the ordering process.
So the question is where does this information come from?
Billing Address could be entered as part of the Order (though the customer might not like doing that every time they order something) or there could be a rule stipulating that the Registered Address is used if the Billing Address is undefined.
But, irrespective of how it’s done, knowing that we must have this information indicates that there must be some process, separate to the Create Customer activity, that allows this information to be defined or derived and recorded against the Customer Account.
In addition we have discovered a business rule, which also needs to be recorded somewhere, that states something like:
Or, in more structured terms:
Then we need to consider the value of Credit Limit.
This would certainly need to be set to some value in order to decide whether we accept the new order from the customer but the customer would certainly not be providing this information nor would the sales team. Instead it would probably be defined by the Finance department based on some credit risk assessment and until then the assumed Credit Limit would probably be zero. So, in the Credit Limit case the “unknown” data-item actually defaults to a fixed value and is not really unknown.
Note: There are many other things we might also consider here such as whether the Credit Limit would reflect the available credit for a Customer or whether this would be calculated separately from the Account Balance. The analysis activity could easily be recursive!
In fact, in pretty much every case where a data-item is defined as initially optional, we will identify some subsequent business activity where the data-item is required, a business function that allows someone to set the required values and probably a set of rules for working out what the default value should be if it isn’t set.
In other words, every data-item in a data model will eventually be mandatory at some point during its lifecycle.
Why Is This Important?In the good old days of centralised processing and single purpose applications, knowing exactly what is optional and the circumstances in which it is mandatory was considered nice to know but not essential (probably why this depth of analysis is no longer done). If anything didn’t work properly, then we knew the problem was localised to that application and it was a relatively easy task to identify the problem and fix it.
However, as mentioned in the preamble, in highly distributed data environments, it’s rare that all of the business analysis is completed as a single coherent piece of work or that one group is necessarily aware of the detailed analysis that another group has performed in similar areas in the past.
In addition, businesses that partition their activities in some way, such as having a separate operating business unit for defined geographic areas or target demographic markets, might also have multiple business applications that duplicate sets of business activities (see previous article on “Domain Decomposition” for a discussion of this).
Also, in these sorts of complex environments, there is also a lot of data flowing around between systems where data from multiple business sub-systems flow into a single system or a single business system feeds multiple other systems. Often these “data feeds” are snapshots of the data held in the source system, and it is left to each receiver of the data to figure out what is or is not valid about the data they receive.
Finally, there is also the widespread growth of central data warehouses for managing master data and numerous data marts designed for business intelligence and operational monitoring which consume data, transform it, analyse it and produce new data from the results.
Essentially a data model, or some parts of it, might be reused and implemented in many different areas by many different development teams.
As well as now having multiple points of implementation we also have multiple methods of implementation where these rules may or may not be enforced.
In many cases, the optionality rules will be embedded inside the consuming application code where the decisions of what to do when unknown data is encountered are made when the data is used. In other cases, they will be implemented in the functions that maintain the data; and in other cases, they might even be implemented as triggers against a database table.
In other words, they could be just about anywhere or nowhere with an associated risk of inconsistent process behaviour and poor decision making.
To summarise, the “Nothing is optional” principle touches on some key business principles which are:
- A data model normally only describes the initial state of a business entity and that state may change over time as more information is gathered.
- There must be a reason for having the data (i.e., some external person activity must have a need for it that it cannot complete unless the data is defined).
- The data must come from somewhere (i.e., some external person or process must be responsible for providing the information so that it can be used later).
- There are conditional rules, not captured in the normal “boxes & lines” model, which define constraints that need to be enforced depending on the use of the data.
- “Unknown” can have different meanings and can imply an actual value which should be used if an actual value isn’t supplied.
If we accept the importance of discovering these rules, then we need to decide where to document them so that they are available to all interested parties.
In the last few years Canonical Data Models (a.k.a. Business Information Models as discussed in a previous article) are becoming more and more common which act as a Platform Independent Model (PIM) from which other Platform Specific Models such as database schema, software class models and (most commonly) Service message interface specifications in XML Schema are generated.
By capturing the rules for optionality in this Canonical Data Model, we have a central point of documentation where we can explain what the rules are and how to apply them. Then, if using Model Driven Generation techniques to produce the derived Platform Specific Models, we can also ensure that the rule is applied consistently in all the different areas where the relevant data-items are used.
We can also ensure, as the Canonical Data Model is part of the Data Architecture, that we have full coverage of all the functional business requirements.