In this article, we focus on identifying the various dimensions of data architecture, their hierarchies and their levels.
Dimensional analysis can be defined as identifying the various facets of a problem and solution by measuring it along different dimensions that have an influence on it. It is useful in studying the structural and behavioral aspects of the problems and solutions both in isolation, and in matching them appropriately. The first step in dimensional analysis is identifying and extracting the different dimensions along which the enterprise problem can be classified. The next step would be to understand the organization of each classification in terms of any hierarchical relationship they exhibit, and how many levels exist in that hierarchy. It will also be interesting to note whether some of these dimensions co-exist or are mutually exclusive in a particular enterprise’s data architecture at any instant. Similar study can be done about the sub-dimensions whether they are overlapping or disjointed.
Multiple Dimensions of Data Architecture
Dimensions are orthogonally independent ways of looking at a particular measure. A business looks at sales by region, period, product and promotion, which are four dimensions that decide the sales figure during analysis. Similarly, we analyze data architecture by the requirements, technologies available, and the constraints that decide the choices, the domain that they relate to, etc. In this context, dimensions are also called as perspectives.
In Figure 1, three such perspectives (administrator’s view – vertically from top to bottom, user’s view – horizontally from left to right, and requirements’ view – as functional in the front, and non-functional behind it) have been depicted as a cube. Our visual cognition allows us to visualize only three orthogonal dimensions, hence the maximum limit is three and we call it a cube. But many other dimensions and sub-dimensions are also possible.
The other dimensions that could be added are data types, domains, classification of data, and security levels depending on the importance of data, accessibility requirements of differently abled users, geographical distribution or location of users and the life cycle time of the business process. Some dimensions could be combined like domains and data types, where there are some correlations. But they can remain and studied as independent relative to each other, as well.
Figure 1: An Example of Three Possible Orthogonal Perspectives of Data Architecture of an Enterprise
A. Life Cycle Phases of the Industry
Like the life cycle stages of an enterprise, the life cycle stages of its industry also play an important role in the architecture. To start with, there is the birth of an industry. Next, the nascent growth of the industry where multiple entrants address various disparate, unorganized requirements. Gradually there emerge some dominant players who consolidate by various combines that result from acquisitions, mergers and takeovers. In a new industry, there are no domain experts and there are no established standards to follow. This stage is characterized by very little industry and government regulation. A well matured industry has established players, standard set of practices, and a lot of domain experts who have shaped the organization of the industry. For example, the metadata standards for IT industry are not quite standardized, though many efforts are under way through Common Warehouse Meta Model (CWM), Meta Object Facility (MOF), and Open Information Model (OIM) and Meta Data Interchange Specification (MDIS) through the Meta Data Coalition (MDC).
B. Life Cycle Stages of an Enterprise
The life cycle stages of an enterprise play an important role in the architecture. To start with, there is the birth of an enterprise. Next, the survival of the enterprise amidst its competitions nurtured by its customers. The birth and survival of the enterprise could happen in any of the life cycle stages of the Industry itself. Depending on it, the enterprise might be in a position to be a leader at birth of a nascent industry, or have to wrestle that position from its competition in a mature one.
C. Layering of Data Architecture
Data architecture layers can be of various types.
- Horizontal data flow layer from data source to data users.
- Vertical layered architectures that depict the services view of data.
- Abstraction levels for administrators.
- Abstraction levels for users.
Layering of data architecture is a best practice that could be used in various perspectives. Firstly, based on data flow from creators to consumers of data, it can be used to classify layers as online transaction processing (OLTP) or operational data store (ODS) where most of the consolidation, integration and data cleansing happens for operational reporting; data marts where individual business processes are analyzed with conformed dimensions across data marts drawn as dimensional models; online analytical processing (OLAP) cubes where multidimensional calculations are available for quick, ad hoc analytics on large data volumes; and finally the scorecards and dashboards that serve the visualization for the end users using the web medium. This is a horizontal flow architecture based on how data flows from its creation to its consumption – a cradle to grave kind of data life cycle lineage trace.
Secondly based on data service to data storage layering is used to classify the tiers (business objects services layer, master data services layer, analytics services layers, business intelligence services layer) that form the top layer. Below this there is a data as a service layer where any data is available as individual entity without the semantics of business objects, master data, analytics or business intelligence, but just as order master, vendor-part contract, etc. Below this are the application services that serve the application specific data like those from Oracle Financials or from SAP. Below that is the data integration layer where data is directly wrested out of the data stores, databases and files using custom written programs. Below this is the actual storage in the file internal physical representation. This is a vertical flow where same data is served in various levels of abstraction of semantics, depending on the sophistication of the customer of this data services.
Layering could also be based on administrators of data, from governance to data architects to designers to data management implementation.
Finally, we can use layering based on users of data, ranging from strategic decision makers, who use data with high leverage, to tactical managers who benefit analytical intelligence, down to the operational users who use it for routine transactions that propel the day-to-day functioning of the enterprise.
D. Stakeholders or People
People dimension forms two sub-dimensions. These are data administrator’s and data users’ perspectives. The data administrator’s perspective is similar to Zachman Framework’s [2] various perspectives as horizontal row headers. Each of the levels deal at a different level of abstraction. There is a semantic interface involved between the adjacent layers.
- Data stewardship view (planner’s / custodian’s / trustee’s) view: Governance view is the outlook of the guidelines and directives that establish the control and organizational structure of data architectural design and the operational data management activities for implementation of the architecture.Here the conceptual mental model (conception, evaluation, mental modeling, analysis, design, problem, and solution) takes shape, gets form and finally implemented. These could extend to data modelers and database administrators.
- Data users’ view: Data users view has sub-dimensions that are overlapping. First, we have strategic users, tactical users and operational users. Second, there are internal users and external users for the enterprise. Third are creators and consumers of data, denoting data flow architecture. Finally, there are owners (like an account manager for account master data) and users (sales assistant using the account master data for transactions data entry).
E. Requirements Classification View
Requirements are broadly classified into functional and non-functional requirements. Functional requirements involve reusing an industry accepted data model, compliance to laws and regulations by government, industry and other regulatory bodies that are mostly specific to that domain or the enterprise, and enhancing the data model to suit the particular enterprise’s requirement. Major portions of these activities are generally done by the enterprise data architect with the technology available.
Non-functional requirements are enabled by technology vendors through innovative products and partner offerings, and the choices offered are considered and selected by the enterprise data architect.
Here the major portion of the work is done by vendor as product offerings. This might involve choosing between shared-everything or shared-nothing architecture, scaling up or scaling out decisions.
F. Intent, Data and Knowledge
Intent is the metadata, rules and constraints as per the design of the system (intent to pay based on salary certificate, appraisal of co-laterals, title check of co-lateral, intent to drive well based on passing the driver license exam and vision test). Don’t forget to see the correlation between rules and data.
Data is the actual event and facts recorded from the real world experience (Check paid on 31st March 2009, value of property at 31st March 2009, title of property at 31st March 2009).
Knowledge is the belief that depends on the historical pattern of the above data, observed over a period of time, and inferences arrived with sufficient confidence, based on data rather than philosophical, intuitive, emotional or impulsive nature (will pay based on credit history, would have done this crime based on police records and criminal history, will drive safely based on driving history).
G. Data Types View
Data types can be extended for specific domains – for example, software requirements and software architecture – each of them having attributes defined (like IsFunctionalRequirements, IsAPerformanceRequirement, etc.). Such things could evolve into repositories or warehouses for requirements and architectures. A configurable matching engine based on relevance and the significant dimensions and constraints selected could rank the possible matching architectures. This is very similar to SQL doing the matching between the records stored in the database (in this case, the architecture solutions), against the selection criteria given in the query ( in this case, the requirement specifications). Cost parameters for the operators, query execution plans, and indexing have to be developed for such custom data types. They are called Data Cartridges (Oracle), Data Extenders (DB2), and Data Blades (Informix).
Data types can be classified broadly into structured, unstructured and semi-structured or extensible structured.
Structured Data: Structured data consists of data that has been properly classified as belonging to descriptive and measurable properties of entities and their interrelationships. Each piece of data has an associated metadata that defines its type as alphanumeric or numeric, its length, its pattern and valid values that can be present. Many data models capture such organization, and they are generally classified as navigational and value based. In navigational models, the records of various entities are physically linked through their interrelationships, and hence very performant during query access but they are not flexible for changes in structure and ad hoc queries. Network, hierarchical, and object-oriented models come under this classification. Hierarchical model is a special case of network model where it branches only in top-down direction.
Additionally the network model handles many-to-many in terms of links.
Object model is good for applications that are extremely complex and have rich semantics that help handle the high complexity. Relational model is good for enterprises where a lot of interaction happens with various systems, and understanding the model is quite necessary for all the stakeholders to draw various kinds of ad hoc reports and to enable enterprise-wide systems integration.
Semantic data model represents entities and relationships as triples. Social network is one example where social network analysis can be done based on these triples in the database. Graph data models are similar to semantic data models. Composite data models are also possible where more than one type of model (network and relational) can co-exist in a database. Also some databases support multiple types of storage engines catering to various types of data depending on their usage. A transaction engine optimized for updates, and a columnar store optimized for queries.
Extensible/Semi-Structured Data: XML, text and natural language representations, web structure, web content, and web usage through clickstream data are all examples of semi-structured data that falls somewhere in between the other two extremes of structuredness.
Unstructured Data: Unstructured data consists of multimedia data such as images, audio and video.
H. Security Classification View
Security aspect casts its influence on the data architecture heavily since security breach is all about data being compromised. Depending on the nature of classification of information involved, it would call for various special considerations. A good example would be master data management hubs. These are high risk and vulnerable targets for data thefts. This data is collated from all over the enterprise, cleaned and feeds the lifeline of an enterprise real time. So gaining access to this data is a very high value proposition for people on the lookout for such wrong opportunities. Also government and regulatory bodies mandate that such risky data be separated out and put behind multiple security filters. But the same may not be true for relatively low value ordinary transaction data. Similarly depending on the enterprise’s nature of business defense, national government, space-related projects might have a different framework that does not bother about monetary concerns but have their goal as mission critical. Next in line come business enterprises that are multinational, national and regional. Some of them differ by the industry, and the type of information they store like customer’s credit card number, or social security number. They are not only at risk of their information, but are liable for their customer’s information as well, as in personal information in credit card and banking-related applications or medical history of a person in the domain of healthcare and insurance industry.
I. Domain View
Various domains such as retail, healthcare and insurance, biological life sciences, weather, banking and capital markets exhibit different unique characteristics. As a data architect moves up his career ladder, his expertise in domain-specific technology practices assume larger importance. A principal Architect, for example, would be expected to know not just the technical aspects but also the unique aspects that characterize the domain. While some domains may require unconventional data types, others might follow different rules per country. Still others might require streaming analytics.
J. Product or Project
Data architectures for a product differ significantly from a program containing multiple projects or a single project. A product’s data architecture needs to be able to talk to various other components, supported by multiple vendor products. Multiple releases, backward compatibility, support for earlier versions and bug-fixes have to be planned. A sub-dimension of product could be the development methodology as to proprietary, open source or community development. Project or an ongoing program-based enterprise data architecture might be dealing with problems similar to mergers and acquisitions, regulatory compliance, and deploying real time operational intelligence.
K. Volume and Nature of Data
Volume and nature of data have significant impact on the type of data architecture considered. Small volumes of static data like lookup or reference data may be good candidates for keeping in cache, while large volumes of static data might call for a data warehouse architecture. Streaming data like those of a network traffic or stock market ticker data might need stream-processing techniques and associated architectural components. Large volumes like explosive data on the web may not be amenable to data warehousing, simply because they continuously change and are so huge that they require analytics to be done from where and how they are.
L. Location of Data and Location of Processing
Data appliances come in varied architectures ranging from data-centered computing, conventional shared memory (random access memory) processing and stream processing. Based on where data gets referenced, many models exist including in-memory databases and conventional disk-based processing. Various other massively parallel processing architecture-based database machines exist that use shared-nothing principle of parallelizing storage and processing to enhance performance.
M. Source of Data
Based on the source of data, we have web farming external data into the database for 360 degree view of an enterprise from the point of view of suppliers, customers, competitors and regulators.
Enterprise data is also augmented by customer database marketing agencies and external address data providers.
N. Distribution & Replication of Data
Distribution of location, autonomy and heterogeneity of data and its model and representation affect the data architecture decisions. Replication techniques used varies between batch, synchronous and asynchronous types.
O. Nature of Data
- Master Data: These are high value reusable assets that cater to both operational and analytical requirements. Their lifetime is high, also their usefulness. They are reusable.
- Audit: This involves the audit data from monitoring privacy, security, reliability, compliance and safety-related activities.
- Transaction: This consists of the key business and business support transactions that keep track of routine activities. The lifetime of transaction data is quite small compared to lookup, reference or master data.
- Lookup/Reference: This consists of standard data available in the public domain like list of country names or the unique DUNS numbers for businesses.
- External Data: This includes address data (US Postal Service codes).
- Metadata: As an intent or a rule that defines what the business enterprise looks like. Later the actual data is analyzed statistically to align this intent as to how much of it actually works out in the real field.
This data serves many purposes:
- As a help desk or directory for the enterprise’s information assets.
- As a semantic translation layer interconnect between components that have impedance between them. As the dimensions grows, at the intersection of each dimension, there is a semantic layer, a mapping that is necessary for the translation.This can be visualized in case of:
- Logical data map (ETL ) bridges OLTP source to DW/OLAP target impedance.
- OLAP cubes bridges OLTP to BI impedance.
- ORM mapping bridges object to relational impedance.
- Semantic layer bridges data as a service (BoD) to application-specific data model views.
- Conceptual layer bridges application data views to physical data layer.
- XML schema mapping bridges XML BoD (OAGIS) to relational table schema.
- Views and synonyms are used to achieve location, fragmentation transparencies to distributed databases to make them appear as a single whole.
- An accessibility interface would be the one that translates the bits into something the blind can touch-feel-and read like Braille.
- Same is played by a loudspeaker and phones that translate voice into data or electric signals, and electrical into voice back again to handle human interface.
These bridges are a rich source of metadata (like the Rosetta stone that acted as a translator device between Greek, Demotic and hieroglyphic languages, since it contained the same message in all these three languages – this was the earliest metadata repository semantic layer combine equivalent – in its function from the bygone era). Such universal metadata helps to trace from whatever domain in the cube of enterprise metadata to wherever else in the metadata cube.
Conclusion
We have described data architecture using various terms, dimensions and sub-dimensions. It has helped to appreciate the richness and variety of the complexity of the data architecture problems and solutions. This awareness will help get a better idea of where we’re headed and what might be the future challenges.
Acknowledgement
The first two authors Sundara_rajan and Anupama_Nithyanand are grateful to their mentor and third author, S V Subrahmanya, Vice President at E-Commerce Research Labs for seeding and nurturing this idea, and Dr.T.S.Mohan, Principal Researcher, Dr. Ravindra Babu Tallamraju, Principal Researcher, and Dr.Sakthi Balan Muthiah, Manager-Research at E-Commerce labs at Education & Research, Infosys Technologies Limited, for their extensive reviews and expert guidance in articulating these ideas. The authors would like to thank all their colleagues and participants of authors’ training and knowledge sharing sessions at Infosys Technologies Limited, and contributed positively to these ideas.
The authors would like to acknowledge and thank the authors and publishers of referenced papers and textbooks, which have been annotated at appropriate sections of this paper, for making available their invaluable work products that served as excellent reference to this paper. All trademarks and registered trademarks used in this paper are the properties of their respective owners / companies.
References
- Sundararajan PA, Anupama Nithyanand and Subrahmanya SV, “Dimensional Analysis of Data Architecture,”
communicated earlier to TDAN. - “The Zachman Framework for Enterprise Architecture,” Zachman Institute for Framework Architecture (www.zifa.com, www.zachmaninternational.com).
- Ralph Kimball and Margy Ross, Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd Edition, John Wiley & Sons, 2002.
- Dan Sullivan, Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing and Sales, John Wiley & Sons, Inc., 2001.
- David C. Hay, Data Model Patterns: A Metadata Map, Morgan Kaufmann, 2006.
- Barry Devlin, Data Warehouse: From Architecture to Implementation, 1996.
- W. H. Inmon, Building the Data Warehouse, Wiley, 2005.
- Kamran Parsaye and Mark Chignell, Intelligent Database Tools & Applications, John Wiley & Sons, 1993.
- Peter Cabena, Pablo Hadjinian, et. al., Discovering Data Mining: From Concept to Implementation, Prentice Hall, 1997.