Industry leading organizations recognize and manage data as a strategic asset. By ensuring high data quality, they are able to rely upon data for critical decision making.
Business intelligence and analytics spending has been increasing dramatically for several years, incorporating traditional data warehouse platforms as well as data lakes comprised of SQL and NoSQL technologies, dispersed across on-premise and cloud environments.
Major investments and effort are spent on data extraction, transformation, and load (ETL) from source systems into data warehouses and data marts. Incorrect decisions based on poor data can be disastrous, so how can we ensure that we are utilizing the proper data to begin with? In order to do so, we must be able to address the following data quality considerations:
- Is the data accurate?
- Is the data timely?
- Is the data complete?
- Is the data consistent?
- Is the data relevant to the decision?
- Is the data fit for use?
This challenge has been complicated further by exponential data growth. Some studies show that up to 90 percent of the world’s data has been created in the past two years alone. This trend is accelerating, making data quality assurance even more challenging. This is compounded further by increasing complexity in the data ecosystem that every organization operates within. Most corporations have a variety of software applications and data stores scattered across multiple heterogeneous platforms, utilizing a spider web of point-to-point interfaces to move data back and forth. This includes ERP solutions and externally hosted SaaS solutions. The result is usually a high level of data redundancy and inconsistency. Some organizations have adopted more sophisticated architectures such as service-oriented-architecture, but the high degree of complexity still remains.
When we examine most data environments, we find that many of the ETL processes usually incorporate at least some degree of data to render the data usable at the point of consumption. However, this can be quite risky if we do not truly understand the data and the changes that have occurred on its journey through the organization’s systems.
This is analogous to the problems that occurred in manufacturing production lines prior to the early 1980’s— complex products were built from thousands of parts and sub-assemblies, then inspected for quality conformance after they rolled off the assembly line. Inspection does not improve the product. It simply identifies the defects that need to be addressed. Defective items were scrapped or reworked at significant costs, but the origin of the defects often went undetected. Thus, the problems ensued repetitively. To address this, the quality movement of the 1980’s focused on many aspects, a few of which are stated here because they are very relevant to data in the context of this discussion:
- Validation of the inputs to every discrete process, preventing usage of defective components
- Trace-ability of components and sub-assemblies within finished goods to point of origin
- Empowerment of front line workers to address problems, even if it meant halting the entire production line
- Continuous improvement of all processes
Unlike physical products, it can be extremely difficult to detect and identify defects in data. However, we can utilize the approach and lessons learned by the quality discipline. In order to succeed, a collaborative culture must be established with a commitment to data quality, from senior executives to the front-line workers that create and modify data on a daily basis. Procedures must be put in place to ensure that data is accurately captured and recorded as it is created and modified through each business process. Workers must be empowered to correct any data that is wrong as part of their daily job function (with proper audit trails). If data originates outside the organization, it must be validated prior to use. Data governance and stewardship must be established so that responsibilities are clearly understood and agreed to by all parties.
The primary challenge is to first understand and map the current data ecosystem but still retain the flexibility to easily adapt and update it as the business and the underlying data stores continue to evolve over time. The most effective means of doing so is through data models, which describe the data (and metadata), as well as process models to describe business processes that create, consume and change the data. This allows data to be understood in context and is the basis of identifying redundancy and inconsistency. All manifestations of each critical business data object must be identified and cataloged. Typically, the most critical business data objects are also master data, as they are utilized in most transactions (for example: customer, product, location, employee, etc.). Without context, it is extremely difficult to ensure that the proper data is being utilized for reporting and analytical purposes, and hence, informed decision making. In order to complete the understanding, the models must be supported by integrated business glossaries that are owned by the business stakeholders responsible for each area. It is imperative that the business team is able to utilize tools that allow them to collaborate not only among themselves, but also with technical staff that are assisting them.
Business analysts, data analysts, modelers, and architects build the required conceptual and logical models based on continual consultation with business stakeholders. Physical data models are used to describe the underlying systems implementations, including data lineage. When combined with data flows, true enterprise data lineage can be understood and documented. This is the point at which we have established true trace-ability, which is vital for comprehension and knowledge.
All of the models, metadata and glossaries must be integrated through a common repository to enable true collaboration and understanding. Approved artifacts need to be published in a medium that is easily consumed, typically through a web-based user interface. In addition, the models themselves become the means to analyze, design, evaluate, and implement changes going forward.
Due to the size and complexity of most environments, this must be done on a prioritized basis, starting with the most critical business data objects. Metrics are established to quantify relative importance as well as to evaluate progress. Breadth and depth are increased incrementally, as with any continuous improvement initiative. Establishing a data culture and improving data quality is not a one-time project. It is an ongoing discipline that, when executed correctly, delivers breakthrough results and competitive advantage.