Having been around the block, we data professionals have seen many cases of data which could be construed as “imaginative.” Once such case where imagination can turn good data into the wild-west is free form entry of address information. If users of an enterprise system are allowed to enter US States as two or three letter abbreviations, or even full state names, one can witness the gambit of human imagination as to how to spell “Illinois” (or “Ill”, “Il”, “Illinios”, “Ill.”, “Il.”, “II”, etc…).
A typical user or report consumer may look at said imaginative entries and have an internal monologue that describes this behavior as acceptable. They understand the intent, so there’s no impact on their operational task associated with the reason they are interacting with that one piece of data.
This imaginatively entered data is then processed by an ETL into a Data Warehouse, or thrown into a schema-on-read Data Lake. A brilliant algorithm may be applied here to decipher the imaginative entries for our example above, but fails to troll the depths of human ingenuity that led to our unique data entries. Thus we have our bad data problem. If the problem is not caught (assume a less-mundane problem than the representation of a State) this will pollute our Data Warehouse or Data Lake.
The C-Level suite ideally makes their decisions based on data abstracted and aggregated by the Data Warehouse. They may not have any corporate process that would notify them of a potential issue with the data. They will make their decision, based on what they have in front of them, without knowing that the evidence for their decision is tainted.
Identifying the Source
Once a bad data problem is identified, the source of the problem should be discovered. Surely all of us have seen many ERP systems that allow for a wide berth of user interpretation of business data. In some cases, data may come in from a third-party source with issues. Alternatively, an automated process could be running that alters data and renders it into a bad state, like a fly in a food processor.
If a Data Quality issue is the result of user entry, we will see a wide variety of errors in the data. Much like our State misnomers mentioned above. If we see a large domain of data entry uniqueness, it’s likely due to user entry. That user entry should be tracked down to the interface that allowed the data to be entered incorrectly. Good systems make it hard to do the wrong thing, and ERP systems typically allow for a lot of interpretation (or customization).
Data Quality issues can also arise from third-party sources, such as an inherited list of customers (for example, a company has made an acquisition and is in the process of merging CRM data). Another case where this may happen is if we purchase any sort of service that sends a feed-file on a semi-regular basis. In these cases, we will likely see bad data entered consistently. The term “bad data” may be unfair here, as the data may be good by the definition of its origin, but doesn’t fit the definition for the recipient. We may have CRM data from one system that has three-letter State abbreviations verses two letter ones. If it’s an automated feed that we are getting bad data from – say , a service that cleanses address data – they may have a one-off data issue with the spelling of a province or state in a foreign country. This “bad data” could be the result of a regulatory change; as an example, how smelter names and addresses are tracked for the purposes conflict minerals. This would be something completely out of our control but impacts our decision quality if not dealt with.
The most terrifying source of bad data is an automated process running incorrectly. Say for example there is a task setup by a user to create purchase orders for stationary. Based on stationary type, they order a certain quantity of items for each employee. In this example, imagine that staplers are misidentified as an item that people need multiples of (like pens or paperclips). We may not notice this is an issue until one thousand red swing-line staplers get mistakenly delivered to our front desk; all because of an error in an automated process that went unnoticed.
Dedicating to a Solution
Once the source of bad data is discovered, a solution to its cause can be brought to bear. This is where a Data Governance process can be followed and data stewards can be engaged to solve the existing problem, and then dedicate to preventing future issues.
Solving the existing data quality issue can be done by manual user cleansing, or with the help of our friendly neighborhood IT team. Don’t bribe the IT team with our comical food-processed simile.
As for on-going prevention, users can be trained, incoming data verified before use, and processes can be fixed. We won’t get into Data Governance here, but there’s a plethora of good information to be had about the tools and processes available that can be used to establish greater confidence in the enterprise data asset (as a reader of this newsletter, you’re already at a great resource!).
Witness the Good Decision
The best way to see the success of our efforts to conquer bad data is the return on investment with the decision quality of our executives. If we saved our bad data, we can use it to generate the same trend or report as we can with our good data. This shows us, empirically, the difference in decision quality based on our data. We then can see what the cost to the company would have been had the wrong decision been made.
Doing an exercise like this may not always be feasible, but if we can manage it once, it goes a long way to selling ongoing data cleansing processes to the business.
As data cleansing increases the instances of bad data decrease; decision quality will then increase over time. If all our data regarding the lovely State of Illinois is tracked correctly, we can accurately make decisions based on our history there. Is our business growing? Do we need more sales staff in Illinois? Imagine if we made the wrong decision and left revenue on the table, inviting competitors to swoop in and take our market share, all because we didn’t deal with “Illinios”.