Data scientists claim that they spend 80% of their time cleansing data and the other 20% of their time complaining about cleansing data. There is clearly something wrong with this picture, but sadly, my own experience can confirm that this seems to be true in practice. In this blog, my goal is to explore why the need to perform data cleansing is such a common challenge. More importantly, I discuss better options that we should strive to adopt. For this exploration. I apply a lean technique called “The 5 Whys” where you continue to ask why until you reach the root cause of a problem.
Why So Much Data Cleansing?
Most organizations suffer from high levels of data technical debt, or in other words, poor quality data. Poor data quality is a serious problem for several reasons:
- Data-driven decision making is hampered because the data cannot be trusted
- Software development initiatives, in general, are slowed due to working with lower quality assets
- Machine learning (ML)/artificial intelligence (AI) require high-quality data to train models, suffering from “garbage in, garbage out” otherwise
- Data warehousing(DW)/business intelligence(BI) solutions invest (or more accurately waste) significant effort in transformation logic
In all of the above use cases, there is a need to get the data to the point where it is usable. As a result, we cleanse the data required that is required, believing that this is an effective strategy.
Why is Data Cleansing Thought to be Effective?
When your organization suffers from excessive data technical debt, every use case that requires clean data must perform data cleansing at the point of usage (assuming that your organization isn’t following the better strategies described later). On the surface, this seems to make sense for two reasons:
- The individual teams have control over how they work with the incoming data. The owners of the data provide a way to access the data, but they typically don’t provide ways to manipulate the source data (except perhaps through some sort of encapsulation layer, often implemented as services). In short, people are allowed to do anything they want with their copy of the data, but have limited ability to fix the source. So, cleansing the data at point of usage is the only option available to them.
- The work is contained to just the data that they need. People typically work with a subset of the source data they have available to them, and often a very small subset of the data. So, the data cleansing effort that they need to perform is much smaller than the effort required to fix all of the source data. From their point of view, this is much less work (and it is for them).
However, this isn’t effective because of the likelihood that multiple teams will do the same cleansing work. For example, if an address suffers from formatting inconsistencies, then every team that works with address data needs to implement, test, and maintain similar cleansing logic on that data. The poor-quality data motivates teams to increase your organization’s overall technical debt through the repetition of said cleansing logic. In short, technical debt begets more technical debt.
Why Don’t We Cleanse the Data Once?
Instead of cleansing the address information in every place that take it as input, wouldn’t it be easier to write the logic once to cleanse the data when it is output from this source? This is exactly what you do in data warehouse (DW) solutions. Data warehouses take data from multiple sources and make it available for many different reporting purposes. Mathematically, they are working with N data sources and providing it for M use cases. If there were to take the data cleansing strategy described above, they potentially have NxM cleaning challenges to get the data into the required formats. Of course, it wouldn’t be that bad because not every use case requires data from all N sources, but each use case requires data from one or more sources. So, there would still be repetitive work, for example cleansing logic to fix address data everywhere it’s used.
Instead, DW teams use an extract-transform-load (ETL) strategy where the data from a data source is cleansed once before it is placed in the DW. This is the transform portion of ETL. After that, everyone has access to high quality data which is then manipulated accordingly. This requires less work and injects less technical debt, the cleansing/transformation code, than the data cleansing strategy described earlier.
Outside of DW solutions organizations will also choose to encapsulate database access, forcing systems to access data via services or components rather than directly via SQL code (or similar). A common strategy is to put cleansing/transformation logic into the services that provide data to users of the data source.
Cleaning the data as it is read from the source is also problematic. The transformation logic required to cleanse the data once is still technical debt that requires effort to build, maintain, and operate over time. Yes, it’s less wasteful than doing this in multiple places at every point of usage but it’s still wasteful.
Why Don’t We Fix the Source Data?
The tools and techniques to address data technical debt in existing legacy data sources, including those running in production, have existed for years. The two critical techniques are:
- Database refactoring. A database refactoring is a simple change to design of a database that improves the quality of its design. Database re-factorings typically improve the table structure, the code quality (of stored procedures, triggers, and so on), or access methods to the data source. Examples of database re-factorings include renaming a column, splitting a table, adding a view, and many more. See the Catalog of Database Re-factorings for an extensive list.
- Data repair. A data repair is a simple change to the values of data within an existing data source. Examples of data repairs include introducing a common format, replacing formatting characters in strings, and fixing “bad values”.
Although these two techniques are quite straightforward, they can be difficult to implement due to business or organizational challenges. These challenges include the need to:
- Own the source data. In most organizations, the owner of a data source is the only one with the ability to authorize changes to it. The implication is that if my team runs into a data quality problem in an existing data source, then we need to work with the team that owns that data source in order to fix the problem. This can often be difficult to accomplish for a variety of reasons.
- Have the ability to fix it. You will need people with the skills to fix the data source(s) and the tools required to do so.
- Have the resources. Fixing existing data sources requires time and funding, even when it’s built into your normal development processes.
Fixing data quality problems at the source is the best strategy available to us when we find ourselves with existing data technical debt. However, the easiest technical debt to address is technical debt that we didn’t incur in the first place. This leads us to our fifth and final question.
Why Don’t We Stop Creating Poor Quality Data?
Poor-quality data is purposefully injected into data sources; it’s a simple as that.
This is done through applications/systems that don’t have sufficient controls in place to ensure data quality. Poor-quality data is also caused by an inconsistent understanding of, or agreement around, the data. For example, there is a Person table with a column called BDate. My team uses it to store the birth date of a person because that’s obviously what a column with that name must store. Unfortunately, the system for which this table was first created uses it to store the date on which the customer was last billed (obviously). Using a column for more than one purpose is a serious quality problem that requires more than a surface-level analysis to detect. It’s also done through manual manipulation of the data, perhaps someone with administrative authority directly edits the data. Or poor-quality data may be loaded into a data source from somewhere else. The point is that poor-quality data is injected into data sources via a variety of methods, these being the more common ones.
To fix the root cause of why a data source contains poor-quality data, you need to:
- Identify the quality problem(s) in the data source. Data sources typically have many quality problems.
- Identify what is injecting the poor-quality data. This could be an application, a data load from an external data source, the result of manual data input, or combinations thereof. It is important to recognize that it may not be a single source, and worse yet the poor-quality data is the result of a combination of things, each one individually being correct in its own context.
- Identify what should actually be happening. Identifying the problem is only a good start, you also need to identify a solution to the problem(s).
- Prioritize the work. Start by fixing the most critical problems first. Interestingly, you are likely to find that the 80/20 rule applies and many of your data quality problems have the same source(s).
- Fix the broken code, broken process, or both. Your goal is to permanently stop the injection of poor-quality data.
There are three fundamental strategies that we need to adopt so as to successfully stop generating poor-quality data:
- Developers must become skilled in data engineering fundamentals. Developers need basic skills and knowledge in fundamental data engineering techniques, particularly those of the Agile Database Techniques Stack.
- Application development teams need to be supported. Data is a shared organizational asset, or at least it should be. Your organization needs to adopt an agile data management strategy where your data professionals work in an evolutionary and collaborative manner, which is pretty much the exact opposite of a traditional approach. The data professionals need to bring real value to the table, be easy to work with, and need to transfer their skills and knowledge to the teams. The first step is to adopt an Agile Data Mindset. Interestingly, the Person BDate column example earlier was really a failure of data management. My team either didn’t know who to talk to about the proper use of that column, didn’t have access to documentation, the documentation didn’t exist, or had never been sufficiently coached in data skills to even know that this was an issue. Ensuring that teams are properly supported this way is the responsibility of your organization’s data management team.
- Application development teams need to be allowed to do so. The primary cause of technical debt, including data technical debt, in organizations is poor project management. The desire for software development projects to be “on time and on budget” often comes from the business, but these constraints are often accepted, and committed to, by project management. To be on time and on budget project teams will invariably cut functionality, cut quality, or both and thereby inject technical debt. We need to get off this treadmill, and instead recognize that a decision that is convenient in the short term at the project level is quite harmful to the long-term health of your organization. If you want your teams to produce applications/systems that result in high quality data, they must be given the time and money required to do so. The good news is that the investment in quality from the very beginning is often less expensive and faster than leaving it towards the end of a project (or leaving it out completely).
Building systems that produce high-quality data is better than fixing the source data after the fact, which is better than transforming data when it is extracted from a data source, which is better than cleansing the data at the point of usage, which is better than working with poor-quality data.