The phrase “data architecture” often has different connotations across an organization depending on where their job role is. For instance, most of my earlier career roles were within IT, though throughout the last decade or so, has been primarily working with business line staff. When I present at conferences, seminars, or DAMA chapters, I ask the audience to raise their hand if their job role reports to IT, and then to raise their hand if their job reports to a line of business. The purpose of this question is to ensure that the examples I choose will resonate with the differing background(s) of the participants.
In this column, we’re going to explore the topic of data architecture and how it is tightly integrated with and dependent on sound data management practices, emphasizing the business perspective. Historically, data stewards, and the business data experts with whom they coordinate with, have been expected to come up with high-level data requirements. The data architects in IT – data modelers and database designers – then take over and do the rest. There is no denying that technical skills, experience, and expertise in crafting well-organized, high-performing data stores are essential. Today though, we’re going to walk through some important data management processes and practices pertaining to data architecture, so that business staff can augment their understanding and gradually undertake a more expanded role. (Your mission, if you choose to accept it….)
The benefits to your organization of your increased understanding, input, and engagement in determining build, buy, or hosted solutions, will prove to be quite significant over time.
A quick story – a large financial firm hired a new analytics leader, who worked directly with IT to select the future toolset for self-service analytics. The selection was presented to lines of business executives, who had not been engaged in the selection process. Well, as you may expect in a financial organization, some business data experts had thought deeply about what data sets needed to be provided through the toolset, and the features and capabilities they needed for reporting and predictive modeling. When they heard the presentation and asked informed questions, they concluded that this was the wrong solution for the organization, as the intended data scope was too narrow and the features didn’t meet the requirements. Luckily, the purchase had not been finalized, so it was back to the drawing board for the analytics leaders to carry out an inclusive, requirements-driven selection process in collaboration with the business lines.
The responsibility goes both ways in this story. It’s a fact that the analytics leader had, to his chagrin, failed to engage the business lines to find out what they wanted. On the business side, they knew the selection process was underway, but no one had proactively provided their data or features requirements to IT and the new hire.
The business has the responsibility to know what data is needed, and to give thought to how it should be acquired, ingested, created, modified, and distributed by the platform or software. There is a big gap in most organizations between the vision documented in the data strategy (if one exists) and the transition path from what currently exists to what is desired. Overall, this gap is the failure of business lines and IT to work together.
Why should you care? Because we can, with reasonable accuracy, categorize the processes and practices of data management to these ‘Big Three’ and inter-related buckets:
- Data Architecture – what data stores the organization builds or buys to store and distribute data
- Data Governance – how the organization manages and controls data
- Data Quality – how the organization improves the condition and characteristics of your data.
Leaving technology aside for now – ‘How’ the data will be housed and delivered and what features are offered – as befits our business perspective, let’s think about data architecture primarily as the ‘What’. This includes the data components and content of the collection of databases and repositories the organization builds or buys. Here the word ‘component’ means an element of a larger whole – for example, a component may be a system of record, a master data hub, an enterprise data warehouse, a data mart, a data lake, etc.
The complete collection of the data architecture components that currently exist in your organization can be referred to as the ‘As-Is’ (or ‘implemented’) data architecture. Typically, although application systems and their corresponding data stores are reasonably well documented; virtually no organizations have complete documentation or even diagramming of the overall As-Is data architecture, especially when interfaces are included. If you don’t know what you have, how can you know what path you should take to achieve what you want?
At a data warehousing conference, a session presenter showed a diagram of several critical operational data stores in his organization, with each incoming or outgoing interface depicted as a line. There were hundreds of lines all over the page, an intricate tangled spiderweb of overlapping multiple connections, which he referred to as ‘our data on drugs.’ He then described how his organization was re-architecting and consolidating these data stores to eliminate excessive redundant data, along with about 75% of these ad hoc point-to-point interfaces. They were embarking on the path to the ‘To-Be’ (or target) data architecture, the desired future state of the data – how it should be best organized, how to reduce redundancy and streamline the number of sources, and how to meet shared data needs as well as provide specialized information to business lines as needed.
Information technology cannot develop an optimal To-Be data architecture without substantial and sustained input from the business lines. Another quick story – an Enterprise Architecture group at a large organization developed a well-structured To-Be data architecture which exemplified best practices and was internally consistent and complete. When they finally presented it to the business line executives, they were roundly criticized with comments such as ‘this isn’t how we intend to use the data,’ ‘we can’t abandon this custom repository for a vendor product,’ and ‘you’ve proposed the wrong data stores.’ In short, it was rejected for rational and sound reasons – another example of the fact that IT needs the business to create the optimal data architecture.
The diagram below shows some of the key data management practices that are, by their nature (AKA, ‘essentially’) business-centric. All of them have a close bi-directional relationship to data architecture and the likelihood of success in crafting the To-Be architecture and creating a practical transition plan to get there. Data stewards and business data experts benefit from understanding and applying these practices to become active partners with IT in designing the future, and to improve the data they create and manage.
What do we mean here the ‘What?’ The philosopher Aristotle wrote extensively about essential and accidental properties of persons, places, things, concepts, and events (summarized as ‘entities’). For example, a table can be made of wood or metal, have three legs or four (accidental properties)[1] but it is, regardless, a table, with the corresponding defined function (essential property). Translating this principle to the data architecture realm, essential properties are those that identify and define a person, place, thing, concept or event, and accidental properties are those characteristics that describe aspects of an entity but do not affect its essential nature, e.g., a person, who may have green eyes or blonde hair, may be 5’8” tall, may have siblings, etc.
So the business needs to know, and gain consensus about, what the essential properties of its data are, and the first step is defining shared concepts, which we call a Business Term – a key concept that is shared, and which must be precisely defined and agreed upon for the organization to communicate accurately and share data effectively across business lines.
An example of an important shared concept for many organizations is the term ‘Client’ – basically, ‘a person or organization that receives services from a professional person or organization in return for payment.’ This is a simple concept. However – who is a client, when they become a client, and when they cease to be a client – may differ in the same organization, depending upon the purpose of the business unit creating or using the data. For Marketing, a client may be anyone who has expressed interest in the organization’s services, even if they have not signed up for services, or have used services once but not for some years. For Sales, a client often begins as a ‘prospect,’ and when a certain event takes place, such as signing a contract, they become a client. For Operations, the same person may not be considered a client until services have actually been initiated. If this organization wants to centralize core data about its clients (identification, name, address, contact information, client type, etc.) and to organize the data in a manner meaningful to the business lines, you can see that discussions and agreements about the ‘What’ are needed.
It’s also important to know what information the organization wants to capture and store about the client data. For example, it may be useful to know who created the initial record; when it was created; who modified it; what is the recognized official source for the latest information; who owns the data store(s); who maintains the data stores, etc. Information about data is known as Metadata, and the lines of business need to decide what information is necessary.
Data Governance is the operationalized function of collective decision-making about shared data across business lines. It represents the organization’s ability to apply and encourage staff resources to build, nurture, sustain, and control the data in a collaborative fashion for the benefit of all.
Effective governance is critical for many data architecture activities and decisions. For example, an organization may want to achieve more accurate identification of its clients by implementing a client data master hub to serve all business lines who produce or consume core client data. In the current environment, there may be multiple existing systems in which a client record is first created. There may also be multiple systems in which a client record can be changed, and these systems may require different mandatory data about a client. In our example above, Operations may require that a new client have a service status of ‘active’ before a record is created in their tracking system. This requirement can cause the client data to be out of sync with Sales, so that aggregate reporting is negatively affected, and some business processes may fail to be applied. A basic starting question for governance in this situation might be: ‘Do we agree on what a client is, and can we accurately count our clients at any point in time?’
Governance is the means by which these decisions must be made. What data should be included in the master data store? E.g., should problem resolution contacts with customer service be included, or handled elsewhere? Have the business terms been defined and agreed upon? E.g., have we decided what the different types and statuses of clients shall be? Have the metadata properties been defined and agreed upon?
In evolving towards the To-Be data architecture, Data Standards are important – for how data is represented (naming, acronyms, abbreviations, lengths, type codes, etc.); for security (e.g., what kind of access controls should be applied, who can view a record, who can change a record, will the scheme be role-based; for regulatory requirements, e.g. does the organization need an audit trail, does it need to keep original records as received and for how long, etc.
Typically, the determination of standards is mostly left to IT, but the organization benefits if the business is engaged. For example, in the client master data store of our example, let’s say there are five different existing data stores where a client can first be created, and four of them have a data element called ‘Client Status.’ In one system, the acceptable values may refer to whether a client is ‘Active,’ ‘Pending,’ or ‘Inactive.’ In another system, the values may refer to payments, such as ‘On Time,’ ’30 days Late,’ or ‘In Collections.’ In a third system, the values may represent still another categorization scheme. So, the questions are, which of these belong in the master data hub, what are the acceptable values, and if two of them make the cut, what name should we select to clarify the meaning of the second one? IT can assist with this decision, but only the business can make the final decision. They need you, you need them, QED.
As we noted in the ‘Big Three’ diagram, data architecture is a huge factor in achieving improved data quality. I’ll restate our industry’s standard comment here – 50% of data quality defects and anomalies are caused by bad design – primarily, legacy data stores which were developed by separate project teams without consideration of impacts to the To-Be data architecture. The example above, of three different meanings and values for the same term, also applies here. Another example, flipping that around, would be if a large data warehouse in one organization had multiple names for the same entity, the listing center’s assigned letter abbreviation for a company’s stock (MSFT = Microsoft, ORCL = Oracle, etc.) – ‘Issue Symbol ID,’ ‘Ticker Symbol,’ ‘Stock Symbol,’ ‘Stock Ticker.’ What data element are you going to use for aggregate reporting? What if the list differs?
The To-Be data architecture should reflect the organization’s Data Quality Strategy (if it exists – and if it doesn’t exist, what are you waiting for?). A primary purpose of a planned evolution of the data architecture is to reduce or eliminate redundant data and standardize names, datatypes, lengths and values over time, creating a well-organized collection of components enabling all users to easily search for, find, and use meaningful data. In the example of the client master data store, the organization backed into the importance of focusing on data quality through correcting design flaws in, and among, the client-creating systems. If a data quality strategy were in place, the sequence plan would undoubtedly reflect something like ‘In Year 1, we will concentrate on client data quality. Who needs to decide where improved quality is most important, and where should efforts be concentrated? The business lines.
Sound data store design principles and practices, with business input, can correct many data quality issues. However, the other 50% of quality defects and anomalies is caused by allowing erroneous data to enter data stores in the first place. Since you don’t know what you don’t know, it is wise to interrogate important data stores to discover if there are problems. Data Profiling is applying out-of-the-box features of data quality tool, and supplementing with custom queries as needed, to find out if there is missing data, anomalous values, incorrect ranges, incomplete addresses, and many other potential problems. In the example we’ve been using, the master data store, the organization would want to profile data from the selected sources systems to determine if there are errors in the client data prior to selecting the sources for integration and consolidation in the master data hub.
For example, in the data source with the ‘Client Status’ values of ‘Active,’ ‘Pending,’ and ‘Inactive,’ which might be represented as ‘A’ ‘P’ and ‘I,’ the profiler may find an unexpected value, like ‘R.’ The ‘R’ could be a data entry error, a corrupted record, or a new valid value that has been added but not yet documented. Only the business line will know if the latter option is true. Other defects, such as a missing first name, or an incomplete street address, will also be discovered. The business line needs to review and weigh in on the profiling report and determine which results are valid, and if they are of sufficiently high impact to need immediate correction. Profiling will also often surface the need for quality rules to be applied to the data as it is entered or ingested. When a data store is being consolidated or undergoing redesign, profiling is a recommended first step to improve the design.
Data Quality Assessment is the business-driven evaluation of whether the condition of the data is acceptable – what number or type of errors interfere with deriving informed business decisions, how good is ‘good enough,’ and what quality rules need to be applied to the data to improve its accuracy and completeness. In the case of the client master data store, the business lines supplying or consuming the data need to determine what rules should be applied to the data ingested from the source system(s). Should the master data hub accept a customer record from Source 1 with a 2-character name, a missing zip code, etc., or should those records be routed to a data steward for analysis?
In the same master data scenario, rules for precedence in the case of conflicting client information from more than one source system are an important element of Data Lifecycle Management, based on the business-determined designation of which source is authoritative with respect to what data. For example, if the ‘A’ ‘P’ ‘I’ values of ‘Client Status’ are the selected categorization scheme, and there are two client data stores with those values, the business line needs to specify to IT which source will be accepted into the master data hub if there is a conflict – e.g., one system has ‘A’ and the other ‘I.’ In a similar manner, to provide input to IT for Data Integration, only the business can determine what data needs to be integrated, how it should be represented, and what the valid data relationships are, with the assistance of a data modeler. Modeling for integration is a vital nexus where business lines and IT must achieve agreement; your engagement and decisions will affect how your data is planned, structured (architected), and organized for years to come. It is highly recommended that data stewards learn how to interpret and validate a data model; organizations can provide training or materials for self-study, for instance, the book “Data Modeling Made Simple” by Steve Hoberman.
Finally, back to square one – Data Requirements, the detailed level of architecting data. For any data store, only the business lines can determine what data is required to satisfy business needs. The business role in determining and specifying data requirements is undeniably critical. For example, with today’s increased emphasis on data privacy, it could be decided that the data element ‘Opt Out’ with the values ‘Y’ for Yes and ‘N’ for No, originally captured in the organization’s eCommerce system, should be included within the client master data store, because every business line may need to know about a client’s preference.
For every functional requirement involving the system acting on data, there is a corresponding data set. Let’s say that a functional requirement calls for the system to derive an average client acquisition count, which would allow the user to learn how many new clients the organization has acquired. The business needs to know what calculations are required and validate that the ‘Client Create Date’ and ‘Client Status’ value of ‘A’ is sufficient enough to determine the count of new clients within a designated time period, and then average the count by time period, such as a year, a quarter, or month. If business line staff are actively engaged in thinking through and developing data requirements, the design and completeness of data stores will significantly improve.
Remember that there is a demonstrated cost curve in application / data store development. The time and effort required to add features or data in the requirement phase is much less than discovering new requirements in the technical design or implementation phase. Therefore, if you want better, more accessible, high quality data, work with IT hand in hand from the beginning. Enjoy your expended role in data architecture but be careful – you might morph into a “Data Person.”
[1] He further decomposed nine type of accidental properties, which are actually quite useful to explore in building a fully attributed data model: quantity, quality, relation, habitus, time, location, situation / position, action, and acted upon. Presented for your interest, and now we’ll move on.