We look at the interesting connections and similarities between various data architecture components. The connections where one component forms a prerequisite or a composite of another component are the direct patterns that emerge. Also there are scenarios where the organization of one component has lent its model to inspire similar organization in another. This is similar to genetic algorithms, where biological metaphor has inspired the computing world. We can see such similarity in network data models within a database and distributed database network of relational or object databases.
Data Architecture Components and Techniques
Data architecture components address the following functionalities for enterprise data organization toward meeting the business goals. They are:
- Foundational Frameworks: Data governance framework, metadata management framework and repository.
- Data Flow Architecture Components: Data extraction, data quality, data warehouse, data marts, semantic layers between components, data lineage services.
- Data Services Architecture Components: Data security and identity management services, data privacy services, data audit and compliance, interoperability services, master data management services, data profiling services, data analysis services, data mining services, data synchronization services, data rules and constraints services.
- Techniques Used: Distributed databases, replication, statistical and data mining techniques.
These components interact in various ways to satisfy the functioning enterprise’s information needs.
Interplay Between the Data Architecture Components
In this section, we will take a component at a time, and serially go through the others that are related to it discussing in detail their relationship.
A. Data Governance
Data governance is the set of policies, procedures and guidelines for data management through its life cycle. Data governance is a prerequisite for data quality and master data management (MDM).
Without an established data governance framework to follow and data stewardship to enforce, it becomes difficult to run a sustainable data quality or MDM program.
The main focus areas of data governance could be any of the following, and hence a prerequisite for these initiatives:
- Data quality
- Data warehouse and business intelligence
- Compliance to regulations
- Security and privacy
Data governance will ensure that the objectives of the above initiatives have a reliable road map to achieve their goals.
B. Metadata Management
Metadata, Data Quality and Data Analytics: Metadata is gathered from documentation, tools, metadata repositories, program documentations, business rules, interviews with subject matter experts, data stewards, data owners, and users covering all the stakeholders. Data is validated against this metadata.Data quality involves a two-pass algorithm . During the first pass, a complete sample involving all the significant use cases is taken, covering all variations with corresponding weightage to the actual volume. Data profiling is done over this sample and the metadata is collected to identify and locate possible errors. At this stage, these errors are studied in depth to validate the metadata. For example, if metadata says that grade has to be one of four values (A,B,C and D) and we find out that all the data has only A as a grade, then validity of the metadata itself becomes questionable and it should be reviewed. In this scenario, following possibilities emerge:
- Metadata is correct, but data is not available at the data entry point. In view of this, the data entry clerk picks up a default option. Remedy is to fix the business process to capture the data at a later point when it becomes available.
- Data is available, but data entry clerk is not educated about the use of selecting this option, and it does not impact his work. Hence there is no incentive for him to pay proper attention to this detail. Remedy is to train him.
- Data is available, but could have been lost in transcription and handoffs between people. Also distance and time lag between the actual event and recording of the event have an inverse relation to the accuracy of the data recorded. In this case, it is better to record data at the time of the event and automate the recording process instead of keeping it manual.
- The application is faulty and does not allow any other selection. In this case, it is necessary to fix the application and release that version to production use.
Based on the investigation, the root cause is found out and remedial action taken (like a change in the business process or application or training the user). The metadata that is not valid after such investigations is removed, or the correct metadata that has been investigated and found correct is included. Now the metadata is complete, validated and fit to be treated as a reference for data quality measurements.
During the second pass against this complete and validated metadata, all the data available is validated for errors. Erroneous records are grouped as per the errors and dealt with accordingly.
Metadata is also used for ongoing monitoring of data quality and to see the trend of data from various sources.Metadata and Data Warehouse: Metadata is used to prepare the specification for the logical data map for building the data warehouse. This logical data map is the specification based on which extract-transform-load (ETL) programming is done. It is the glue that holds the data warehouse together.
Metadata is used to profile and qualify sources for extract in a data warehouse using data profiling and data quality methods.
Metadata is used for classifying users for security configurations, impact analysis of any changes to source schema, and to estimate the cost of change.Metadata and other Data Governance Initiatives: Seamlessly integrated metadata management across all the data architecture components with appropriate semantic interconnect layers in between them is a prerequisite for any enterprise-wide initiatives like MDM, data quality or data security.
Metadata itself could be subject to analysis and mining to understand the changes and trends in their structures and their relationships.
C. Data Analytics
Data Analytics with Data Warehouse (DW) and Business Intelligence (BI): Business analytics is carried out on the cleansed, integrated and historical data in a data warehouse. So data warehousing is a prerequisite for data analytics both in terms of quality and quantity.
Data Analytics with Data Quality: Business intelligence normally involves running analytics on key business process (KBP) measures and key performance indicators (KPI). For data architects, this translates to analytics applied to data itself at the meta level. Analyzing the statistical distribution of data is known as data profiling. It is a precursor to data quality.
Data Analytics with MDM: Analytics, like data profiling, can be run on master data in a data warehouse, from various sources to consolidate them in the first step toward factoring MDM out.
Analytics generally is done on facts or measures of business processes to monitor key performance indicators. In some cases, master data also yields itself for business analytics, in which case master data from MDM assumes the role of a measure. For example, the fact that the high frequency with which a customer has changed his cellular mobile phone’s subscription plan indicates that he might switch over to a competing service provider. Frequently the spending patterns of customers help in classifying them as high value and low value transacting customers. Here, the measure (amount spent) has become a dimension (classification scheme), assuming the opposite role.
Data Analytics with Metadata: Analytics is used on a mutually exclusive and collectively exhaustive group of samples of data to verify metadata quality. After metadata has been shored up by this process, then the correct metadata is used to validate the entire volume of data to identify the errors.
Metadata can also be subjected to analytics in metadata repository or warehouse. Here the distribution, discrepancies, exceptions and outliers in terms of adherence to standards, policies and procedures could be profiled. Also depending on these analytics studied over time, we can predict what would be the evolution of data structures in the future for given situations like a merger/acquisition, business expansion, recession, etc.
Data mining could also be used to analyze data, as well as metadata. This will yield three categories of information:
- Business intelligence (intelligence on performance measures)
- Data quality intelligence (intelligence on metadata)
- Metadata intelligence (intelligence on metadata’s behavior over time, and other dimensions)
In such analysis, we use dimensional analysis to analyze our own IT business, recursively. This is akin to relational metadata also being represented in relational data dictionary, and in XML we have XML schemas representing the metadata of XML structure, but they themselves adhere to XML standards. Metadata intelligence will help us study the behavior and performance of our design choices during our data architecture. Depending on our key performance indicators and their trend over time, we can adopt those which return good value on our investment in time and resources.
Replication is generally seen as a technique that is used to improve performance by introducing controlled redundancy. It forms part of the physical database design, where the performance considerations assume prominence after the logical design gets over. Apart from this obvious reason, there are various other applications of replication listed below:
Indexing for performance: Indexing is essentially replicating the key columns one more time along with associated physical address of the rows.
Replication in columnar databases: To achieve performance and for row materialization, columnar data models frequently have replicated columns.Replication in relation to distributed databases: Replication is considered after calculating the benefits of query compared to the cost of updates. If there are lots of queries from a particular site, it may call for replication in a distributed database configuration.
Disconnected computing: Mobile device disconnected computing in case of salesmen travelling to remote places where connectivity is not reassured. Replication and synchronization provide a solution.
Systems of reference in distributed master and metadata repositories: Frequently for availability reasons, master and metadata repositories that are distributed follow the systems of reference, where the data gets replicated so that dependent systems do not go down if the network connection goes down or the central server crashes.
Fault-tolerance: In any database, there are logs written on each transaction commit in addition to writing to the database files that happen at checkpoint intervals. Even though the transaction changes are written both to the log and the database (a form of replication of transaction in both files), the log writing is more done to satisfy the durability property of a committed transaction during failure recovery.
Change data capture in operational data stores and subsequent replicated data flow through the data warehouse: In a data warehouse architecture, the transaction processing systems feed the data warehouse with the current detail data. This is also a form of replication where the detail data finds its way through data quality, data staging areas, dimensional data warehouse and multidimensional OLAP cubes. Any analytics that can be done by streaming transaction data through the analysis query instead of these multiple stagings would be a step forward toward real time business intelligence.E. Distributed Databases
Distributed Databases with MDM and Metadata: Distributed database architectures depend on distribution of data, autonomy, heterogeneity of data model formats and database products [3,4].
In both MDM and metadata components, the distribution of master and metadata could be distributed. In addition to normal OLTP data being distributed, here distribution is applied for master data and metadata repository styles and configurations like centralized or distributed, federated or non-federated.F. Data Warehouse
Data Warehouse and Data Quality: Both data warehouse and data quality benefit from each other. Data quality can be achieved when data from various sources is integrated. It helps to compare with one another and normalize to a common granularity, unit of measure and terminology. The central place for such integration could be a data warehouse. Without such a component, achieving data quality any where else would not be complete and comprehensive. Here the data warehouse acts as a prerequisite infrastructure component that enables data quality activities.On the other hand, a data warehouse’s primary purpose is decision support.
Decision making depends on:
- Quality of data
- Quality of analysis done on data
- Quality of interpretations out of the analytic results
So the quality or severity of the decision is directly dependent on the quality of data that is used to arrive at the decision.
Data Warehouse and MDM: MDM can be factored out of a data warehouse, or a pre-existing MDM can facilitate a data warehouse initiative by reducing the ETL load on a data warehouse.
If a data warehouse exists first and you want to refactor MDM out, then we:
- Profile master data
- Validate master metadata
- Validate master data
If MDM exists first and then building a data warehouse, all we need to plan in the logical data mapping are the facts from the OLTP transactions. The dimensions will be loaded from the MDM.
Data Warehouse and Metadata Repository: Metadata repository is data warehousing applied to metadata. Here all the metadata from various tools used in the different life cycle stages of data is brought to one place, and conformed and aligned together. Metadata repository plays two distinct roles. Firstly, it acts a registry or lookup for all the information artifacts. Secondly, it acts as a translation layer between semantically different systems like the one between online transaction processing (OLTP) system and object-oriented application layer or XML-based OAGIS data over the enterprise service bus or online analytical processing (OLAP) cube.Metadata can be studied under two categories:
Design time metadata. Design time metadata is similar to master data. It has a long life time and value. It is referred multiple times, and has a one-to-many relationship with the run time metadata. This is the intentional metadata, that we configure a system, so that it is supposed to run that way at run time.Run time metadata. Run time metadata is similar to transaction data. Its life is only that of the duration of that particular instance of the run. It references the design time data, but represents real-world events captured as data. It is the actual data about that particular run of, for example, extract-transform-load (ETL) routines that happened in the real world on a certain date and time. It is also historical in this way in a data warehouse and can be used for analytics and to study trends. Metadata can have a metadata warehouse of its own. Also a data warehouse needs metadata as the glue that binds the entire data flow architecture that feeds the data warehouse and derives business intelligence out of it.
G. Data Quality
Data Quality and MDM: Data quality applied to master data is MDM. Here we identify various sources and locations where master data gets created. Then we find out and decide on the system of record. System of record is where the creation of master record happens. We also decide the systems of reference where master data is referenced extensively. Replication could be used to facilitate system of reference. Depending on these considerations, we choose a style of MDM implementation such as federated with peer-to-peer updates, centralized life cycle activities or mask the distributed sources of master data by transparent access through consolidated views.Significance of these Interrelationships
The interrelationships indicate that none of these components can exist in isolation. But at the same time, it is not practical to go live and implement all the components at once. Data architecture evolves with the business, its users and their requirements along with the technologies and trends available for the enterprise. These interconnects serve dual purposes – first being aware of the dependencies between the components for their effective functioning together to deliver their promised results, and for cross-application and re-use of their styles and patterns elsewhere in the data architecture.
We have elucidated some of the patterns, styles and behaviors observed from the organization, relationship and functioning together of various components of data architecture. This will help us reuse and apply some of these techniques in various other contexts where such similarity of challenges will make them compelling to consider their reuse.
The first two authors Sundara_rajan and Anupama_Nithyanand are grateful to their mentor and third author, S V Subrahmanya, Vice President at E-Commerce Research Labs, for seeding and nurturing this idea, and Dr. T. S. Mohan, Principal Researcher, Dr. Ravindra Babu Tallamraju, Principal Researcher, and Dr.Sakthi Balan Muthiah, Manager-Research at E-Commerce labs at Education & Research, Infosys Technologies Limited, for their extensive reviews and expert guidance in articulating these ideas. The authors would like to thank all their colleagues and participants of authors’ training and knowledge sharing sessions at Infosys Technologies Limited that contributed positively to these ideas.
The authors would like to acknowledge and thank the authors and publishers of referenced papers and textbooks, which have been annotated at appropriate sections of this paper, for making available their invaluable work products which served as excellent reference to this paper. All trademarks and registered trademarks used in this paper are the properties of their respective owners / companies.
- Jack E. Olson, Data Quality: The Accuracy Dimension, Elsevier, Morgan Kaufmann, 2003.
- Allen Dreibelbis, et. al., Enterprise Master Data Management – An SOA Approach to Managing Core Information, Dorling Kindersley (India) Pvt. Ltd., 2008.
- Stefano Ceri, Giuseppe Pelagatti, Distributed Databases – Principles & Systems, McGraw-Hill International Editions, 1984.
- Tamer, Ozsu, and Valduriez, Principles of Distributed Databases, Pearson Education, 2001.
- Sundararajan PA, Anupama Nithyanand and Subrahmanya SV, “Dimensional Analysis of Data Architecture,” published earlier by TDAN.
- Sundararajan PA, Anupama Nithyanand and Subrahmanya SV, “Dimensions of Data Architecture,” published earlier by TDAN.
- The Zachman Framework for Enterprise Architecture, Zachman Institute for Framework Architecture (www.zifa.com, www.zachmaninternational.com).
- Ralph Kimball and Margy Ross, Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd Edition, John Wiley & Sons, 2002.
- Dan Sullivan, Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing and Sales, John Wiley & Sons, Inc., 2001.
- David C. Hay, Data Model Patterns: A Metadata Map, Morgan Kaufmann, 2006.
- Barry Devlin, Data Warehouse: From Architecture to Implementation, 1996.
- W. H. Inmon, Building the Data Warehouse, Wiley, 2005.
- Kamran Parsaye and Mark Chignell, Intelligent Database Tools & Applications, John Wiley & Sons,1993.
- Peter Cabena, Pablo Hadjinian, et. al., Discovering Data Mining: From Concept to Implementation, Prentice Hall, 1997.