Agile Data Design November 2013

Over the last couple of months, I’ve been working on a Big Data white paper for my company, explaining what Big Data is and how it can be used to drive business value. I’ve also been considering how Big Data technologies fit into our current data and BI domains. In particular, I’ve been pondering the following questions:

  • What is the difference between Big Data analytics and traditional BI?
  • Will Big Data analytics replace traditional BI?
  • Will real-time analytics make BI databases (e.g., data warehouses) obsolete?
  • Is semantic (business) metadata still important in the era of Big Data?
  • Is data governance still relevant (or even possible)?

For those not familiar with the term, “Big Data” is a shorthand term denoting a number of cutting-edge data management and data analytics capabilities, including:

  • The ability to process large volumes of data (typically in the petabyte range or higher; a petabyte is 1000 terabytes or 1 million gigabytes).
  • The ability to analyze data in real, or near-real time; this includes the ability to analyze streaming data (data as it is being created) as opposed to batch processes operating on stores of static data.
  • The ability to do predictive analytics; that is, the ability to use data to predict what will happen in the future, rather than explaining what has occurred in the past.
  • The ability to store and analyze non-structured data types, including graph (network) data, geospatial data, and free-form text data.
  • The ability to glean meaning and value from data that is not precisely defined, and whose values are not known.

It’s important to understand, though, that “Big Data” is not just about managing a set of technologies; it’s about changing the way organizations collect, manage and use data to create business value and manage stakeholder relationships. It’s also about managing ambiguity and uncertainty (what is often referred to as VUCA, a military term denoting volatility, uncertainty, complexity and ambiguity).1 For this reason, an effective and comprehensive approach to Big Data will include the following five aspects:

  • Architecture: Big Data technology should be deployed as part of an integrated analytics architecture that includes traditional BI. Optimally, companies should have an integrated analytics environment that seamlessly supports both structured and unstructured data, and enables both real-time and traditional analytics.
  • Technology: We need to understand the technology options available to perform large-scale real-time data analysis, including petabyte and exabyte databases, parallel processing, complex event processing, non-relational data stores, and analytics technologies such as Hadoop/MapReduce and Hive.
  • Business Process: It is also important to understand the business information processes that will be created or transformed by Big Data initiatives. What value is going to be delivered, and to which group(s) of stakeholders?  What stakeholder behavior will be influenced or changed? What is the desired business outcome? Are organizational or business process changes required to support the initiative?
  • Data Governance: Data governance addresses the question of how both structured and unstructured data resources can be effectively managed in order to provide the maximum business benefit at the least cost. It also addresses the issue of managing, and reducing as much as possible, the uncertainty associated with imprecise, ambiguous and/or rapidly changing data values. This helps ensure that the results of data analysis are applied in the right way, to the right business problems.
  • Human Factors: Human factors are involved in both the creation and consumption of Big Data. On one hand, we need to consider the types of people and skills needed to make a Big Data project successful, and the organizational structure needed to support them. We also need to consider the human impact of acting on the results of Big Data analytics, in ways that engage, rather than alienate, a company’s stakeholders.

The difference between traditional BI and Big Data analytics can be summarized as follows:

  • Traditional BI attempts to extract more business value from data whose properties and characteristics are known. The data inputs are predetermined, and the data is cleansed (to make it conform to known business definitions) before being loaded to a data store for analysis. Analysis of data is done “after the fact,” analyzing events that have already occurred.
  • Big Data analysis attempts to determine the potential business value of data whose properties and characteristics are not fully known. The data inputs are not predetermined or pre-cleansed, and analysis of events is done in real or near-real time, enabling companies to respond to business events as they occur (a capability known as Complex Event Processing, or CEP).

These two approaches to data analytics are not mutually exclusive; indeed, each can complement and enable the other. The structured master data that is the output from traditional BI activities can be used as input to Big Data analysis of unstructured data (to help filter out unneeded or incorrect data and increase the signal-to-noise ratio of what remains).

Similarly, the results of Big Data analysis can be put into structured form and used to create and maintain a company’s master data. Groupon, for example, uses Big Data analysis of its website activities to create and manage its customer master data. This enables Groupon to target its customers with specific products and services, using applications such as its new GrouponNow service.2 uses natural language processing of documents (such as obituaries, birth notices and city directories) to create a searchable archive of data that includes occupation, address, telephone number, spouse, children and surviving relatives.

Big data analytics can lower the barrier to entry for traditional BI. Creating data warehouses, data marts and master data databases using traditional ETL (Extract, Transform and Load) processes can be time-consuming and expensive. Using Big Data technologies, the time and cost of creating master data can be greatly reduced.

Two additional points should be made: First, the use of semantic (business) metadata becomes even more crucial for Big Data than for traditional BI because of the need to manage the uncertainty around the data inputs and the analyses derived from them. In traditional BI, you can have a high degree of confidence in the accuracy of your data and analytics; with Big Data you need to manage complexity, ambiguity and uncertainty. In other words, you have to be able to estimate the degree of confidence that can be placed in the results of the analysis and the degree of risk involved in making use of these results.

Second, data governance is also critical for Big Data analytics, but is approached differently. In traditional BI, data governance policies are used to control the accuracy (and business relevance) of data before it is stored for analysis. In a Big Data scenario, data governance policies are applied to the outputs (results) of the analysis, rather than the inputs. That is, data governance helps ensure that data with the highest possible degree of confidence is being produced, that its value and limitations are understood, and that it is applied in appropriate ways to appropriate business problems.

Traditional data governance tools and techniques are still applicable to Big Data projects, and indeed are essential to their success. According to a 2013 Forrester/IBM survey,3 65% of companies that have successful Big Data initiatives have mature data governance and master data management (MDM) processes in place. This figure increases above 80% for companies whose year-to-year revenue growth exceeds 15%. Express Scripts,4 for example, credits its Big Data success to a heavy multi-year investment in master data management and a centralized data governance process.

The Forrester/IBM survey also notes that, for the majority of business users, the most critical Big Data issue was having access to data that puts the results of Big Data analytics in a proper business context so that it can be used to make business decisions. For companies like Express Scripts, structured master data provides the business context in which Big Data analytics can be understood and acted upon.

It’s important to understand that data governance processes are not “one size fits all.” Not all data needs to be managed in the same way, or to the same degree. Data governance requirements are determined and implemented iteratively, as the data is better understood and as uses for the data are developed. David Corrigan of IBM suggests that data governance activities be directed to those areas that need them most, rather than trying to manage everything.

Finally, let me end this column with a list of success factors for Big Data initiatives:

  • Drive Big Data initiatives from the business, not IT
  • Feed the results of Big Data analysis to process improvement teams in the business
  • Focus on stakeholder relationships and positive behavioral change
  • Use data governance to direct the analysis and application of data
  • Enlist support from senior management
  • Set specific business goals, with time/budget/ROI targets
  • Create  teams of business and subject-matter experts
  • Be willing to be data-driven

NOTE: I’d like to make this a dialogue, so please feel free to email questions, comments and concerns to me at Thanks for reading!


  1. King, Julia. “Managing Chaos (As Usual).” Computerworld, September 9, 2013:;=1
  2. Boyd, E. B. Interview with LinkedIn’s Reid Hoffman on Groupon’s Big Advantage: Big Data. Fast Company, November 18, 2011:
  3. Corrigan, David and Michele Goetz. “Building Confidence in Big Data with Information Integration and Governance”. InformationWeek webinar, sponsored by IBM and Forrester, September 17, 2013.
  4. King, Julia. “Deep Thinkers”. Computerworld, July 15, 2013: See also


submit to reddit

About Larry Burns

Larry Burns has worked in IT for more than 25 years as a database administrator, application developer, consultant and teacher. He holds a B.S. in Mathematics from the University of Washington and a Masters degree in Software Engineering from Seattle University.  He currently works for a Fortune 500 company as a database consultant on numerous application development projects, and teaches a series of data management classes for application developers.  He was a contribut0r to DAMA International’s Data Management Body of Knowledge (DAMA-DMBOK), and is a former instructor and advisor in the certificate program for Data Resource Management at the University of Washington in Seattle.  You can contact him at