Data Professional Introspective: Tulip Fields and How to Tend Them

Since it’s spring, let’s imagine that we’re visiting the Netherlands and are gazing in delight at the stunning beauty of the tulips in bloom. Behold!

Tulips were originally developed to grow and bloom in thin, rocky soil, which is why they are popular as spring-flowering bulbs, as you don’t have to have a green thumb to get acceptable results. However, for the healthiest, largest, and longest-lasting blooms, you need to do more, starting with preparing the soil properly and alleviating any nutrient deficiencies. The soil should be loose, light, and well drained with compost or peat moss, to a depth of 8-9” to allow for 6” blub placement and strong root growth. The addition of bone meal and blood meal provides a boost for root growth. Tulips grow best in alkaline soil, so you should test the soil, adding lime as needed. Once they begin to flower, you need to fertilize weekly with a high potassium mixture. For that eye-popping look of perfection, you need to plant according to a plan (e.g., beautifully organized rows below).

What does this have to do with data management? Here’s one way to unpack this metaphor:

Analytics (all varieties – descriptive, diagnostic, predictive, and prescriptive) represents the flowering of sound data management practices. If the ‘soil tilling’ – the condition of the data assets employed – is not properly performed, the results will be unreliable, such as failure to bloom or small blooms. If the ‘pH spectrum’ – knowledge about the data assets – is off, the results may be incorrect. If ‘fertilizer’ – monitoring the data assets through governance – is not applied, the results will deteriorate over time. If ‘field planning’ – integration and organization of data sets – is neglected, the results will not be consistent and accurate.

By ‘results’ we are referring to support for decisions that can critically affect the business strategy. An organization often makes big bets – resources, staff, time, attention – on what analytics reveals. For example, if a retailer decides, based on analytical reporting on past buying habits and predictions about future buying habits, that many families will be purchasing portable patio fire pits this year, the company will invest in increased inventory. If the bet is wrong, profits will drop.

Increasingly, business decisions in all industries are data driven. Therefore, developing and nurturing sharp, accurate, evidence-based (‘alternative facts’ are not acceptable), and creative analytics is essential for success.

Now let’s look at where the buck stops in the analytics factory – statisticians, modeling experts, and data scientists. The following chart [1] illustrates how they spend their time.

Note that 60% is “cleansing and organizing” and 19% is “collecting data sets,” which equals 79%, an astonishing percentage. Highly skilled individuals, eager to shine their creative light through computational reasoning, business acumen, and the artful rendering and interpretation of empirical evidence, are instead spending most of their workday as ‘data janitors.’ Naturally, as the survey revealed, these tasks are the least enjoyable parts of their job.

Since analytics is imperative for success, how can the organization help them to do what they were hired to do – to become more efficient and achieve more incisive results? The answer is ‘Manage your data effectively.’

In most organizations, as we know, the data layer is a tangled, poorly documented assortment of legacy data stores, countless point-to-point interfaces, and overloaded legacy data warehouse components, for example, a 20-year old enterprise data warehouse that now serves as a repository of historical data, plus a master data management hub, plus an operational data store, plus a data mart.

Organizations, for the most part, have failed to: capture, publish, and store metadata; implement broad data quality efforts; stand up effective data governance; capture data lineage; architect well-organized databases; design optimal shared data stores, etc. Therefore, the analytics team(s) encounters many obstacles – finding data, determining data meaning, resolving disparate data representations and codes, and time-leveling data sets. Undocumented data and interfaces require a research effort (scope initially unknown) to track down business data experts, determine the best sources, resolve conflicting or ambiguous meaning, and determining how to represent the data for the models to operate. Then there are data quality issues – missing data, incorrect data, data out of range, undocumented values, etc. All this work must be accomplished just to have a good chance at achieving decisive results. The analytics team can be compared to the end person in a roller skate chain – whipsawed about, the helpless receiver of every movement further up the chain.

Let’s take a brief, partial example on the front end – determining and harmonizing the data set(s). Say that an executive of a large crushed stone and stone products company wants to know what types of paving stones and what volume should be delivered to retailers in a specific geographic location next year. The analyst will need a baseline of what paver types and how many of each were sold in all of the stores serving that location. The analyst has numerous data sources for product identification, because the company has grown by acquisition and there are many systems containing product information. The product types and subtypes are probably represented differently in the sources. It may be difficult to determine the ‘best’ source; often there is no ‘best’ source, and data must be reconciled and integrated on the fly for the model to execute, typically under the pressure of a tight deadline. This is a recipe for hero’s syndrome and eventual burnout.

The following chart parses some examples of common data problems encountered by analytics staff against fundamental data management practices, represented by the Data Management Maturity (DMM)^SM Model framework.

Let’s look at row two: “I have client records from eight different business lines. Each has a client named John Doe. Are they the same person or are they different?” Look across the columns and notice how these practices can affect the analyst’s problem, for example:

The analyst needs to interact with eight different business lines, implying that data governance is insufficient or not operational
It’s implied that the client data sources aren’t well documented, illustrating the need for metadata that is available and accessible
The lack of consistency calls for a data quality strategy across the scope of client data
The inability to verify the identity of a client shows that data profiling was not systematically performed across the data stores
The business lines have not developed consistent matching rules for validating a client
The data in the sources has likely not been cleansed (corrected)
It’s implied that the data requirements for the sources were not documented, or that the documentation is not available
An authoritative data source has not been designated, e.g., use Source A for the most accurate client data, and an order of precedence has not been established
The interfaces have not been sufficiently documented, or the documentation is not available, and the same for the business rules for the interfaces
There is no definitive source-to-target data mapping or it is buried in ETL scripts and not easily available
Data standards are not being followed across the sources, as shown by the lack of clear definitions and source metadata
There is likely no master data repository, since the analyst is forced to access multiple systems.

This scenario shows the extent of the dependency between a well-organized, well-managed data layer and powerful analytical results. If the foundation – returning to our metaphor of the the soil, the pH, the fertilizer, and the field planning – is shaky, the flowers will not bloom, aka, the results will not be useful. Even worse, conclusions can be based on muddy or ‘alternative’ facts that are difficult to resolve (e.g. excessive rounding, faulty assumptions), and the organization can end up making bets based on sand castles. Needless to say, big mistakes will engender the wrath of executives, so the stakes are high.

Organizations need to be serious about creating the conditions for optimum bloom, i.e., getting their data management house in order, if they hope to achieve the hard-hitting results they desire. So data scientists, rise up! Ask not what you can do for the organization (you already know) but what the organization can (and should) do for you.

References

[1] “Cleaning Big Data: Most Time-Consuming. Least Enjoyable Data Science Task, Survey Says,” Forbes, May 23, 2016

MenuMenu

Data Professional Introspective: Tulip Fields and How to Tend Them

References

Melanie Mecca

MenuMenu

References

Share this post

Melanie Mecca