At the end of my column on domain decomposition, I briefly mentioned the concept of the “coherence boundary” defined as:
In linguistics “coherence” is the establishment of a consistent and shared understanding within a subject area (the universe of discourse or domain) so that all parties communicating within that domain have a single view that is both exhaustive and consistent.
Establishing coherence requires the following activities:
- Establishing a language grammar that defines:
- A set of canonical concepts forming a dictionary of words or phrases (the tokens) whose meaning has been defined within the context that the token is being used and that each canonical concept is represented by a single token.
- The rules for constructing sentences, paragraphs and documents (the statements) from the underlying tokens and, just as importantly, deconstructing a statement back into its constituent tokens.
- Where the language grammar has localised variations, there will also be a set of rules (such as synonyms) that can be used to convert tokens or statements from one local dialect1 of the language grammar into another. For example, the English language has regional variations such as Australian English, American English and British English and within British English there is Thames Estuary English, Cockney English, Yorkshire English and so on.
- Establishing the canonical set of actual statements of fact2 (or facts) that have already been made and communicated between any two parties that provide context for any subsequent facts and themselves need to conform to the established grammar.
Following on from this, the points where coherence needs to be applied is a coherence boundary, defined as:
Any group of two or more people that have any sort of conversation with each other always implicitly establish a coherence boundary prior to any conversation taking place usually by selecting a common language or dialect. In more formal environments, such as a court of law, this selection might go as far as selecting which specific words will be used and sentence structures are acceptable. Often this involves the unconscious translation of a statement from one local dialect to another local dialect.
The inverse of coherency is incoherency – the situation where some statements that are made conflict with others statements and it is not possible to decide which are valid and which are invalid just by examining them.
Also, the importance of establishing a coherence boundary increases as the number of parties to the conversation increases because of the likelihood that there will be some sort of misunderstanding over the meaning of a statement. Any misunderstanding might cause subsequent statements to be inaccurate or incorrect which eventually leads to an incoherent state where all statements are untrustworthy and everything is doubtful.
This process of establishing coherency in communication has direct applicability within enterprise data architecture where many people, systems or organisations share data and information, manipulate or store it and potentially pass it on to other people, systems and organisations.
It also touches on many significant threads of data management, distribution and processing including data normalisation, master data management, service-orientation or data quality that significantly impacts the ability of the enterprise to carry out its activities.
Not governing the boundaries where information is exchanged results in incoherence and is where most of the risk of process failure will occur. This then results in poor data quality and data errors with a cascading impact on downstream business activities and decision making.
Consequently, it is important in an enterprise data architecture to address data coherency and adopt a set of principles to minimise this risk; so this time around, I’m going to discuss the importance of data coherence in a distributed data environment and the architectural principles that need to be considered in order to minimise the impact of potential data incoherency.
Identifying Coherence Boundaries
With the “Outside World”
The ultimate coherence boundary is the one between the enterprise and “the outside world” (i.e., the suppliers, customers, regulatory bodies, financial services and everyone else that the business needs to deal with during the course of its business activities).
All organisations, no matter how big or small, have this boundary where information passes from the controlled environment inside the organisation into an uncontrolled “outside world” where it has little control over how the information is used. It’s rarely stated as a business requirement as “it goes without saying3” that all the information distributed outside of the organisation must be consistent with all of the information held within the organisation.
For example, consider the following data-flow diagram showing how sales data containing details of products purchased might flow around a particular organisation and be passed to a customer:
The customer receives the sales invoice details from two different sources via the sales invoice itself and also via a periodic sales account statement that summarises the outstanding invoices that need to be paid. The sales account manager4 also receives a sales report analysing the sales activity which they might use as part of the conversation they have with the customer when they visit them (this still does happen with some organisations ).
And this is just the tip of the iceberg, there might also be purchase orders received from the customer, possibly a product catalogue published by the marketing team that the customer may refer to, delivery notes detailing what he has received and so on.
However, no matter how many different points of interaction there might be with the customer, it would not be unreasonable for the customer to expect the information they receive regarding their purchases to be consistent. That is, they should expect products to be consistently described, with a consistent product ID being used for each product, invoice details on the statement of account being the same as the details on the invoice itself.
Between “Functional Areas”
In single-site, centralised business achieving the necessary coherency with the outside world is relatively easy, especially where all the business systems are supported by a single bought-in application such as SAP or Oracle E-Business, because these integrated solutions are designed to present a single consistent view of the business information to the outside world.
However, what if the enterprise divides its business operations into business units supporting regional operations or vertically partitions business activities, with each business unit having autonomy over some aspects of its activities so each region may have their own bespoke business applications to meet their specific needs (e.g., separate sales [CRM] and finance applications). In addition, to meet management reporting requirements there might also be a group reporting function responsible for providing consolidated information for the entire group (e.g., business intelligence and corporate accounts).
The overall business activities and external data flows don’t change, but internally we might have something like this:
Unfortunately, although this highly distributed organisation might make a great deal of business sense, it means that we have the classic data denormalisation problem because some facts (such as the customer details, invoice values, products sold. etc.) may now exist in more than one place (i.e., in both regional operations and group reporting) but, due to internal constraints such as frequency of update, the two repositories may not be consistent with each other.
As a consequence, for example, the customer might now receive information from the sales account manager that conflicts with the information they were given by the finance operation via the sales account statement.
In addition to the external interfaces, we also now have internal interfaces (e.g., the data flow from invoicing to sales ledger or the data flow between regional operation and group reporting, where the potential for data inconsistency exists because of the potential for the two parties to misunderstand what is being exchanged or make erroneous statements.
In the worst case, such as a service-oriented or event-driven environment, every single interaction between any two functions might be considered a potential coherence boundary that if not properly controlled might eventually lead to data inconsistence being propagated across the organisation.
As well as the information that flows across boundaries between different applications or the outside world, there is also internal consistency of facts within a collection of facts such as where one fact could be derived one or more other facts.
For example, let’s consider the following simple example of a customer, sales invoices and some of their basic details:
Although not explicitly defined, the following derivation rules also apply:
- Sales-Invoice-Item::Net-Value = Price * Quantity
- Sales-Invoice-Item::Tax-Value = Net-Value * Tax-Rate
- Sales-Invoice-Item::Gross-Value = Net-Value + Tax-Value
- Sales-Invoice::Net-Total = Sum (Sales-Invoice-Item::Net-Value )
- Sales-Invoice::Tax-Total = Sum (Sales-Invoice-Item::Tax-Value )
- Sales-Invoice::Gross-Total = Sum ( Sales-Invoice-Item:Gross-Value )
- Sales-Invoice::Gross-Total = Sales-Tax-Total + Invoice-Net-Total
- …and so on
For the Invoice::Gross-Total value, there are two distinct ways in which it could be derived so both of these must be true for the fact to be internally coherent with the other facts.
Hence, to achieve internal coherency all of these “derivation rules” need to be captured in our grammar and applied to all facts wherever they may be recorded.
Coherence Over Time
In the above model fragment, there is also a question about whether the Sales-Invoice::Billing-Address needs to be consistent with the Sales-Account::Invoice-Address. This should be consistent at the time that the Sales-Invoice was created but might not necessarily be the same as the current Sales-Account::Invoice-Address (i.e., Sales-Account::Invoice-Address may have changed since the Sales-Invoice was raised).
If the sales invoice was recreated at a time after the change occurred and always reflected the information held against the Customer-Account, when the recreation takes place then we have incoherence because the information now given to the customer no longer agrees with the information originally given to them. The customer might well challenge the accuracy of the information.
This is an example of “coherence over time” where because the Sales-Account::Invoice-Address and Sales-Account::Invoice-Address should initially be the same but may differ later on, we need to ensure that the original facts are captured and recorded accurately and correctly reported when subsequently examined.
Translating between Local Dialects
Finally we also have to consider incoherence arising from translations between local dialects which in a distributed data processing environment would be the various platform specific languages that may be in use across the enterprise. Nowadays there is frequently at least three separate data definition languages involved in even the simplest data processing activity, which are:
- The language used to define the storage structures in a database, usually an SQL variant. For example, for the declaration of an INVOICE_STATUS we might have:
[Column] INVOICE_STATUS VARCHAR2(10) BYTE CHECK
( INVOICE_STATUS in (‘Open’, ‘Cancelled’, ‘Paid’, ‘Queried’) )
- The programming language (e.g., Java or C#) used to define in-memory working storage where data is temporarily held whilst it is being processed by a specific application, system or component. For example (in c#):
( Open = 0
, Paid = 1
, Cancelled = 2
, Queried = 3
- The data exchange language, such as XML schema, used to define the structure of the messages or files passed between applications, systems and components. For example:
< xs:restriction base=”xs:string” >
< xs:enumeration value=”Open”/ >
< xs:enumeration value=”Cancelled”/ >
< xs:enumeration value=”Paid”/
< xs:enumeration value=”Queried”/ >
< /xs:restriction >
< /xs:simpleType >
Although not the case in the above examples, these purpose specific grammars may superficially look the same but with their own local definitions of what a legal value might be (e.g., both C# and XML have something that roughly equates to a “string” but the allowed characters might not be).
“Special characters” and “reserved words” are a particular consideration when performing translations because most languages have them, but there isn’t a universally recognised set that are consistently translated by all languages. For example, I’ve seen more than one case of & being printed as part of a company name because of the text being translated into XML format and then not correctly translated back into the original text (e.g., “Miley Watts & Associates Ltd.” instead of “Miley Watts & Associates Ltd.”).
Consequently, when translating between these platform-specific languages, there is a possibility that the resulting fact (the data value) may not be the same as the original fact when translated back into the original language.
The Principles of Data CoherenceAs mentioned in the introduction addressing data coherency touches on many of the current threads in data architecture and data management. In all of the scenarios described, the underlying causes of incoherency are:
- Not specifying a complete set of grammar rules that subsequently allow statements to be misinterpreted or invalid statements to be made.
- Transforming data from one language grammar to another language grammar
- Recording the data in multiple locations
As a consequence, it is important that we adopt a set of architectural principles that either remove the possibility of incoherency or, it that isn’t feasible due to other business requirements, minimises the number of points where incoherency can occur to a manageable level.
Surprisingly, although there are many, many ways that data coherency could be implemented, the architectural principles boil down to the following handful of core statements.
Establish a Canonical Business Information Model
Natural language may contain zillions of words to identify all possible concepts with multiple words to separate finely nuanced variations of each concept. As a result, there are an almost infinite number of distinct statements (certainly a number beyond counting) that may be constructed from the available words and construction rules.
Fortunately, it is a much easier proposition to establish a language grammar when dealing purely with data because there are generally only a few thousand concepts that need to be described and a few hundred ways in which they can be associated with each other.
The common language grammar is the Business Information Model (discussed in May-2009) which identifies all the data classes, attributes and associations (the tokens) that the enterprise is interested in. It provides a common agreed set of definitions for each of those tokens and defines the allowed ways that the tokens can be combined together to form valid statements of fact.
As well as the structural rules of how composite statements are constructed from discrete values, the Business Information Model would also define all validation rules that need to be applied in order to be internally coherent.
Public Interfaces in Canonical Form
The problem immediately following on from establishing a canonical Business Information Model would be ensuring that all facts (the actual recorded data) conform to that Grammar.
For new data processing applications (i.e. those that are not already been deployed) the simplest way to meet this requirement is to ensure that all derived adapt artefacts i.e. interface specifications, database schemas and so on, are created by direct generation from the Business Information Model.
This approach is the thrust of Model Driven Generation5 in its original form (as implemented by many data modelling tools for the last 20+ years) where we apply specific transformation patterns to algorithmically generate specifications that are reproducible and traceable back to the canonical source.
For “legacy” applications (i.e., those that are already deployed and operational), we will, of course, have a potential gap between the grammar rules as described in the Business Information Model and the rules described by the platform specific data model for the legacy application. Effectively we have a local dialect that needs to be manually translated (mapped) to the public grammar and any mismatches resolved.
In both cases, we can then create a tightly bound contract for what the data looks like but not how it can be used. Essentially we arrive at a coherent set of data models that interact like this:
By enforcing direct derivation we ensure that any facts that pass a public boundary conform to the common grammar and will be understandable to all users of the data.
Single Point of Update
The “single point of update” is a central principle of data management in that there should only be a single point that a particular change needs to be applied and all other copies (if they exist) should be transparently synchronised from the point of update.
This is closely related to the practice of data normalisation, which is a basic concept for anyone that has worked extensively with data of so shouldn’t need much explanation. The main difference is that the data normalisation principle is extended to cover the case where non-overlapping subsets of the data are stored in horizontally partitioned repositories (e.g., different repositories for different geographical locations).
Where data is horizontally partitioned into discrete and exhaustive sub-sets, then the rules for partitioning are explicitly defined and the data processing environment is aware of which location is the single place that a particular update must be recorded.
For example, if the sales orders or different sales regions are managed by separate sales order processing applications and stored in separate data repositories, then the rules for deciding which region a sales order belongs to needs to be established and for any given sales order there must be one and only one repository that the sales order can be stored in.
Single Source of the Truth
The “single source of the truth” is the principle that at any point in time there is only one version of the truth that is valid and can be trusted (i.e., the truth [whatever we assert to be correct] may change over time, but in order to be consistent we only allow one version of something to be true at any point in time).
This is most evident on documents such as product prices (only valid on a certain date), account statements (only containing information prior to a given date) and anything else that has a date on it. In all cases, the only date that the information can be assumed to be correct is on the date that it is issued and for any date range explicitly stated on the document.
Where a fact is time dependent, then we explicitly timestamp the data with the point in time that it was valid for and, subsequently, whenever the data is retrieved and passed on to another party we must always report the related timestamp.
A consequence of a “single source of the truth” is master data management where globally used information such as reference codes and metadata is managed centrally and then either distributed out into local data repositories acting as replicated data caches or remotely accessed on demand.
Another aspect of a “single source of the truth” is the database of record – the master copy of a fact that is always assumed to be correct and replicated (or denormalised) copies in existence that differ from the master copy are regarded as incorrect.
ConclusionUnfortunately time and space constraints prevent me from digging too far into the principles of data coherency, but hopefully I’ve covered the subject in sufficient detail to explain its significance and the issues that a coherent data processing environment needs to address.
- For dialects, it’s important to ignore the accent of the language and concentrate on the tokens being used and the rules for structuring them into statements. Often it is the transcribed or written form the identifies the commonality of apparently different dialects rather than the way it is pronounced.
- In semantic analysis branch of linguistics is usually better to ignore works of fiction or fantasy, such as novels, because these are notoriously inaccurate in the statements that are made and the rules that a writer may use to construct statements.
- “It goes without saying” is a term that I really dislike because it implies that someone decided not to mention a significant business requirement on the assumption that it should already be widely known and accepted. Such assumptions can become extremely costly mistakes if the assumption proves to be incorrect and it’s much safer to assume that nothing goes without saying and all requirements should be stated.
- Within IT we frequently forget that people as well as systems are part of the boundary between the Enterprise and the Outside World and the information they communicate to other people also has to also be consistent with any other sources. Not allowing for this is why some companies have a field day litigating against other companies.
- Model driven generation is frequently referred to as model driven architecture but really isn’t an “architecture” in its own right because it only covers one aspect of creating a data processing environment and ignores many other significant architectural aspects such as business continuity. Really it’s just an approach to generating artefacts from a model using design patterns and generation rules.