Data Models and Data Profiling


As both a data modeler and a data quality analyst helping knowledge workers understand their data, I read with great interest the Why Data Models Cannot Work article in the February edition of TDAN.com and the subsequent Real Men Don’t Read Instructions  in the April edition. The
Why Data Models Cannot Work article highlighted the shortcomings of data models and promoted the examination of the data itself via techniques such as data
profiling to obtain knowledge management. The Real Men Don’t Read Instructions article argued that “data models can and should be used for
knowledge management. … Knowledge workers should verify and, if need be, augment models by other tools such as data profiling.”

Although data models do not tell us the whole truth, profiling the database does not provide the whole truth either. In fact, both may be misleading. However, used together, they can provide better
insight into the data.

This article discusses four situations that may cause the data profiling results, as well as the data model, to be misleading.





Misleading Results – Data Models and Data Profiling

  1. Life Cycle Stages of an Entity

    At the point in time that a record is created in the database to capture a new instance of an entity, certain information may not be known even though it exists in the real world. For example,
    in the prior articles, the “formal” Financial Instrument Description might not be known at the time of initial data entry. The business process rules might allow a Financial
    Instrument to be entered into the database without a Financial Instrument Description but require one when the Financial Instrument is actually issued.

    Therefore, even though in the real world a Financial Instrument might always have a Description:

    • If the data model were purely logical, that attribute would be marked as mandatory under the belief that a “valid” Financial Instrument would always have a Description.

    • If the data model were being used to create the actual database, that attribute would have to be marked as optional.

    • Data profiling results would show that some records contain Financial Description and others don’t.

    Note that the design of the data entry screens and the data entry procedures could affect the data profiling results. The data element could be made mandatory in the database and the data entry
    personnel instructed to enter “unknown” when the information is not available at the time of initial data entry. Conversely, the data element could have been made optional in the
    database. In the design of the data entry screens/update processes, there could be a validation rule to check that Financial Instrument Description must be populated (and not with default
    values) when an instance of a Financial Instrument reaches the issuance stage in its life cycle.

    Frequently, the designer decides to implement optionality rules in the design of the front-end systems rather in the database definition itself. Often, a column in a table will be defined as
    optional, regardless of its true optionality.

    Data profiling can offer valuable hints to these life cycle stage dependencies. For example, in a data model for the insurance industry that I was reviewing, there was a straightforward
    definition of Issuance Date, implying that it was a historical date. Data profiling showed, however, that some dates were in the future. Investigation determined that expected issuance dates
    were being entered. The initial data design did not distinguish between actual and expected dates; the names and definitions in the model did not capture that future date concept either. Using
    the two in combination – data model and data profiling – improved the overall understanding and allowed the reports and business metrics to be calculated and interpreted
    correctly.

  2. Multiple Sources that Populate the Database

    The data might be populated from more than one source. Information that exists in the real world may not be provided in one of the feeds due to contractual constraints or data
    sensitivity/privacy concerns.

    Therefore, the data model could show a data element as optional and data profiling results could show that some records contain it and others don’t, despite it always existing in the real
    world.

  3. Population Segmentation

    The optionality might be a function of other data elements. For example, in the insurance industry, the optionality of a data element might depend on the specific company, state and product
    line for which the Financial Instrument was being issued. Those rules are not easily captured in a data model and might be too voluminous and too dynamic to document via data profiling.

  4. Data Quality

    In the real world, the information might always exist and even be available at the time of data entry. However, the data entry person might enter it inaccurately using short cuts to minimize
    keystrokes or might not bother to enter it at all. That is the focus of data governance – the well known disconnect between the data gatherer and the knowledge worker.

    Also, it may be a matter of data entry training and the design of the user interface. There may be certain business rules as to the conditions (for example, the specific combinations of
    company, state and product line) under which a field should be populated. However, unless the rules are enforced within the system, the data entry person could become confused and enter a value
    in a field when it should not be populated or conversely not enter a value when it should be populated.

    Data profiling documents how a field is actually populated rather than how it was intended to be populated.


Conclusion

As illustrated above, propositions of real-world relationships based on the model and even based on the data population itself can be misleading. That is not to say that data models and data
profiling are not useful. Both are useful and provide complementary information. In fact, when the propositions mined from each appear to disagree, it indicates that there is an opportunity to
clarify a misunderstanding. That investigation can lead to a better understanding of the data as it exists in the real world and as it is held in the database.

Share

submit to reddit

About Michael Smilg, CDMP, MBA

Michael is a systems consultant at a leading insurance company and has worked in information technology for more than 30 years in multiple roles, including systems consultant, data administrator architect, database administrator, systems analyst, programmer and manager. His current areas of interest are information strategy, data quality, data governance, enterprise data management and metadata management. He can be reached at msmilg@sbcglobal.net.

Top