I’ve been thinking of Mr. Toad a lot since interest in BI and analytics has exploded at the company I work for. It seems like everybody and his dog is running out and buying BI tools and data appliances, and arranging for POCs with every BI vendor under heaven (and lord, aren’t there a lot of them!).
As our company’s BI architect, it’s my job to impose some degree of order and coherence on our BI/analytics infrastructure, and provide some guidance and direction to the business managers who are trying and buying these products, without quenching their enthusiasm for BI itself. After all, the emerging interest in things like BI and Big Data represents a golden opportunity for those of us in data management. For decades, we’ve been preaching the business value of data reuse, and advocating for data management practices that improve the quality, reusability and business value of our data. Now, at long last, our companies are climbing aboard the data reuse bandwagon. This is the acid test of how good (or bad) our data management practices have been!
The best definition of BI I have ever encountered comes from knowledge management expert Thomas Davenport (2007): “BI is a set of technologies and processes that use data to understand and analyze organizational performance“.
This definition tells us two very important things about BI:
- BI is not simply the use of a tool, or set of tools. There needs to be a defined process associated with the use of BI technology.
- The purpose of BI is to understand our current organizational processes and improve them; that is, BI needs to be done in the context of process improvement within an organization.
- Data management – the process of ensuring the quality, accessibility, reusability and business value of an organization’s data assets. This is usually an IT function.
- Data governance – the process of determining and allocating responsibility for the definition and content of data. This is usually a business function.
- Process improvement – the process of determining where and how incremental improvements in a company’s operating processes can be made.
- Stakeholder management – the various processes of managing a company’s relationships with its key stakeholders (e.g., customer relationship management, supply-chain management, human resource management, etc.).
- Identification of one or more business processes that will be improved (or created) using the results of the data analyses (e.g., Customer Retention), along with accompanying business stakeholders and subject matter experts.
- A target metric, or set of target metrics, to be applied to the business process. For example, “Reduce customer churn for server by 10%”.
- Identification of the source data needed to solve the problem. The BI tool should provide the ability to sift through large volumes of data quickly and easily, allow the business user to identify and extract the data that is needed, and document the metadata associated with this data (e.g., source, currency, business meaning, etc.) in a metadata repository.
- Cleansing and transformation of the data. The BI tool should provide the means for doing any necessary cleansing and transformation of the source data (especially if the data is coming from transactional databases), so that computations and aggregations can be done. Again, it should be possible to document these transformations in a metadata repository for future reference.
- Identification and correction of bad data. The tool should provide the ability to identify data that needs to be corrected, and there should be a defined process for correcting this data, both within the BI tool or repository, and at the source of the data.
- Identification and extraction of “golden” data. The tool should provide the ability to identify data that can be used across the organization as master data, and allow this data to be extracted out to a separate master data repository for general reuse.
- Data validation. The tool should support some process of validating the data (as well as the results of the data analyses). Can the results be compared with reports or analyses from other sources to provide at least some degree of reassurance? Do summarizations of the results pass a “smell test”? Also, the means by which the data was validated, and the results of the validation process should be documentable as metadata so that users of the analyses will know how much confidence they can place in them.
- Data security. The tool should support a process for defining access both to the source data and to the results of data analyses, so that only authorized persons (ideally, specified via Windows Active Directory groups or some similar mechanism) can view the data. The tool should support both row-based and column-based partitioning of the data.
- Data distribution. The tool should support the publication or access of the results of analyses to authorized subscribers, either through email distribution of results (push) or through some sort of portal (pull).
- Metadata management. As already mentioned, the tool should provide a means of maintaining metadata about both the source data and the results of data analysis. People looking at a report or a spreadsheet will want to know where the data came from, how current the data is, what transformations or manipulations of the data were done, what formulas were used in the calculations, and some direction as to what business purpose(s) the data can and cannot be used for. Metadata is crucial for managing both expectation and risk; that is, people should be made aware of both the business value of the analysis and the risk to the business (if any) of using it as the basis for business decisions.
- Data governance. During the process of data discovery and analysis, questions will undoubtedly arise about the business meaning and definition of certain data fields, and about whose responsibility it is for defining this data and managing its content. These are data governance questions. If a data governance structure and process already exists within the organization, then self-service BI should be made a part of that process. Otherwise, some sort of mechanism (perhaps using Sharepoint or some similar portal) should be put into place to manage the questions and issues that arise, and assign responsibility for resolving them to the appropriate persons or groups.
- Data retention. The tool should support some mechanism for documenting data retention requirements for both the source data and the results of analyses, identifying data that is no longer current enough to be useful, and purging this data from the repository. For example, in the case of one financial institution, a 500 terabyte (TB) database was found to contain 1 TB of active data, and 499 TB of analytics output, much of which was outdated and of no use.3