During the course of this regulatory compliance effort, I hope to accomplish the following data-related tasks:
- Begin an enterprise-level conceptual model
- Identify the most authoritative source(s) of data for each entity in the conceptual model
- Understand and document (in the data model) the relationships between these sources of data
- Assess the quality of the data from each data source
- Determine the most cost-effective way of accessing the data needed from each data source.
One of the major decisions I’ll need to make over the course of data design for this project is whether (and how) to make use of each piece of data at its source, or whether the data needs to be replicated and integrated into a common data store (such as a data warehouse or an operational data store). Generally speaking, I like to use data at its source whenever possible, as replicating data usually just increases the cost of provisioning it, without providing much in the way of additional benefit. In some cases, though, creating a common integrated data store can enable us to provision higher-quality data in a more easily consumable and business-relevant form. This enables more successful data reuse throughout the business, increasing the value (and ROI) of our data.
Those of you who have read my book1 know that I approach a lot of data management questions from an economic point of view. That is, which approach will produce the most business value at the least cost? Too many developers feel that all the data they want or need should be available within each application database, no matter how many times a particular set of data is replicated. In turn, too many data managers make database decisions based on philosophical inclinations, rather than assessing the costs and benefits involved with a given approach.
The reality is that there is no free data. This is a logical extension of the economic principle of TANSTAAFL (There Ain’t No Such Thing As A Free Lunch) that we all learned in Economics 101. Replicating data (and supporting multiple replicated copies, and resolving numerous data discrepancies) costs money. So does designing, building and supporting data warehouses, data marts, and operational data stores. Even data federation (using all data at its original source) can be expensive if your data suffers from quality, currency, or consistency issues; many companies have suffered from very expensive (and very public) bad decisions based on incorrectly federated data. So any given approach to data management needs to consider the economic impact of numerous considerations, including the quality, consistency, currency, accessibility and general business usability of data at each data source. If you have data problems in your organization, the Data Fairy isn’t going to fix them.
I also must keep in mind that these data design decisions need to be made in the context of the data architecture roadmap that we are developing. In particular, our data architecture needs to address the issue of consumption vs. consolidation, and provide guidelines for when data should be integrated into data repositories (consolidation) or accessed at its source (consumption). The extent to which you take either path depends on the degree to which your data, at its source, is of sufficiently high quality, is easily accessible and shareable, is sufficiently current, and is usable across multiple areas of the business. For data that is overly fragmented (across multiple application databases), of poor quality, or not easily accessible, it probably makes sense to create an integrated data repository and/or a set of master or reference databases. A hybrid approach to the consumption vs. consolidation issue might be to create a set of master or reference databases, along with a few integrated data stores designed around specific areas of the business (e.g., Order Processing, Customer Relationship Management, Supply Chain Management, etc.). Then overlay these trusted data sources with some sort of data virtualization (data federation) capability that allows the data to be easily shared and reused across the business.
I’m probably going to end up advocating the hybrid approach in the data architecture roadmap I’m developing for our company. During the stakeholder interviews I’ve been conducting, I’ve gotten an earful about how fragmented our company’s data is, and how difficult it is for people in our business units to identify trusted sources of data and integrate data across disparate applications and platforms for reporting and analysis.
This means that as I’m working with the various data sources needed to satisfy our regulatory compliance effort, I’ll be simultaneously assessing the trustworthiness of each source of data with an eye toward designing the integrated data stores and master/reference databases that will become a part of our data architecture roadmap, and our long-term plans for implementing EIM (Enterprise Information Management) at our company.
NOTE: I’d like to make this a dialogue, so please feel free to email your questions, comments and concerns to me at Larry_Burns@comcast.net. Thanks for reading!
References:
- Burns, Larry. Building the Agile Database (Technics Publications, 2011).