Data Lineage: The Next Generation

Most data-warehousing and data governance professionals are familiar with the concept of data lineage. Data warehouses, very often sources for regulatory and compliance reporting data, are very
often accountable for tracking whatever alterations upstream data is subjected to once it crosses over into the data warehouse (DW) environment. A data lineage analysis performed within an ETL
toolset can track these alterations by presenting a series of data output-input dependencies within such an environment as a graph of nodes and links.

Metadata repository products also typically provide data-lineage or impact-assessment functionality. Tracing data transformations through a homogeneous ETL environment is challenging enough. Using
source code scanners and similar tools to try to piece together an end-to-end picture of data lineage through a heterogeneous, multi-platform “operational” application is even more
labor-intensive and error-prone.

But operational applications are, of course, where the majority of business rules for transforming data values are implemented. And contemporary software developers, firmly grounded in
object-orientation, are largely unaware of the concept of data lineage. They may be familiar with the idea of dependencies as a factor in component-based software design1. But from an
object-oriented perspective, dependencies occur between software components, and only coincidentally between data elements.


Dependencies and Business Rules: An Example

A possible definition of “business application” might be the implementation of a set of business rules such as constraints and derivations. One example of a business rule, suggested by
Barbara von Halle, is “Disallow creation of a new order by any customer who already has ten open orders.” Easy enough to say. After due deliberation about what data this seemingly
innocent statement might be directly or indirectly dependent on, I developed the following list of possible base and intermediate variables:

CUSTOMER_OPEN_ORDERS_COUNT
ORDER_UNPAID_AMOUNT
ORDER_PAID_AMOUNT
ORDER_PAYMENT_AMOUNT
ORDER_TOTAL_AMOUNT
ORDER_LINE_EXTENDED_AMOUNT
ORDER_LINE_TAX_AMOUNT
ORDER_LINE_NET_AMOUNT
ORDER_LINE_UNITS_COUNT
UNIT_PRICE_AMOUNT 

No doubt there are other ways to specify this set of variables, but one thing common to any such set is that on the way to enabling the ten-open-orders constraint, they form a data lineage, or
dependency graph. To test out this hypothesis, I created a set of views in a DB2 UDB database, one view for each of these variables.2 


Dependency Tracking Using the Database Catalog

The reader may have come across a database view that at some point became inoperative. A view becomes inoperative when one or more objects (e.g., tables, views) upon which it is dependent are
dropped. You may not know exactly how a DBMS such as DB2 UDB keeps track of these dependencies. Not surprisingly, it’s done in the system catalog. In DB2 UDB, view SYSCAT.VIEWDEP shows
catalog metadata that defines dependencies between views. Each row includes the name and type for a base table/view and a table/view that is dependent on the base.

Executing this query

SELECT bname, btype, viewname, dtype
FROM SYSCAT.VIEWDEP
where viewschema=’PERL’ order by bname;

in the database containing my set of views returns the result set below. Base tables are indicated by a type T, views by V.

alt


Graphing Data Dependencies

This type of listing is fine for small listings, but graphs are often much easier for humans to visually assimilate. I was introduced to a free software product named Ant Explorer3 while supporting
the development of a Java-based application some time ago. Ant Explorer creates graph diagrams from “Ant scripts,” which are just XML files that happen to be compliant with Ant-specific
vocabulary and syntax. As long as its input complies with this format, what the input data “means” is of no consequence4. The graph below is a result of formatting the query result set
above into Ant-compliant XML. For this example the re-formatting was done manually, but a relatively simple script could be developed to automate this transformation. Voila – instant
data-lineage diagram.

alt


A Word About “Performance”

Any reader plagued by doubts about the practical applicability of an example like this – such as “sure, but what about designing applications for performance?” should bear the
following in mind.

Since the early days of business computing, when business application programs were optimized to fit a 4k memory limit, applications have been built to work as well as possible within the platform
constraints imposed at a given point in time. One after another, these constraints have faded into irrelevance, leaving the prior generations of applications behind.
In addition, application performance should be considered from not only the perspective of short-term performance such as throughput and response time, but also measured against long-term
performance goals including maintainability and adaptability.


Bottom Line Time

So what have we proven by this exercise in do-it-yourself data lineage? First, let’s designate this set of business rules, implemented as DB2 views, an example of a business application.
Granted it’s not overly ambitious, but it does exemplify the most critical aspects of such an application – the data and the business rules. Then let’s designate the DB2 catalog
query and Ant Explorer graph as a dependency/lineage analyzer. This combination has resulted in an application development environment in which all data dependencies begin and remain clearly,
precisely and explicitly defined.

Evolving toward such an explicit exposition of data dependencies in business application development is critically important because – after decades of fits and starts and attempted
improvements – the design, development and testing of business applications continues to be difficult, time-consuming, costly and error-prone. Applications continue to become more complex and
highly distributed (e.g., SOA), while at the same time chronically lagging behind the ever-accelerating changes in the business landscape. And in spite of our best efforts to make development teams
more “agile,” each application produced becomes yet another “legacy.”

The reasons for this state of affairs are twofold. First and foremost are chronically ambiguous statements of business requirements – a topic for another day. The second, even more deeply
embedded root cause is that the fundamental fabric of applications – narrative-text programming languages – continues to obscure, rather than expose, data dependencies. When data
dependencies are obscured, the impact of any change cannot be precisely determined, rendering the modification of an application an exercise in educated guesswork. That’s the way it was at
the beginning of business application development, and that’s the way it continues today.

It is not a given that this status quo must continue. On the contrary, business applications can be developed in a manner in which the deliberate exposition and accommodation of data dependencies
is the most important enabler in fulfilling business requirements. Such applications would exhibit an unprecedented level of transparency and adaptability. The simple views-based mini-application
described here proves that this type of development environment is entirely feasible. In fact, it is one of the many innovative ideas currently under development within IBM Corporation.

References:

  1. See the writings of Robert C. Martin, for example.
  2. With the exception of ORDER_PAYMENT_AMOUNT, ORDER_LINE_UNITS_COUNT and UNIT_PRICE_AMOUNT which are not derivable.
  3. Ant has nothing whatsoever to do with insects. According to ant.apache.org, Apache Ant is “a Java-based build tool”. (A Java application is not compiled or assembled, it is
    “built”.) See www.yfiles.com/en/products.html to download your very own copy!

  4. Anyone interested in seeing the actual XML underlying this diagram can drop me an email at datamodel@aol.com and I’ll send you a copy of the file.

Share

submit to reddit

About Bill Lewis

Bill is a Data Architect with IBM Global Business Solutions. His more than 25 years of information technology experience span the financial services, energy, health care, software and consulting industries. In addition to his current specializations in data management, metadata management and business intelligence, he has been a leading-edge practitioner and thought leader on topics ranging from software development tools to IT architecture. He has contributed to numerous online and print publications, and is the author of Data Warehousing and E-Commerce. He can be reached at lewisw@us.ibm.com.

Top