Data Integrity – Pay Now, Or Pay Later

Published in TDAN.com April 2006

A fundamental requirement for any database application is to ensure that data is inserted, updated or deleted in a controlled manner that protects and preserves the quality of each stored data
value according to the established business rules governing that data element.

Data integrity is unwittingly compromised when transactional databases are denormalized without appropriate database-side and/or application-side code to perform data integrity validation.

Contributing to this problem is a perception that use of database-level integrity constraints (NOT NULL, UNIQUE, PRIMARY KEY, FOREIGN KEY, and CHECK), as well as triggers and stored procedures are
prohibitively resource-intensive, and therefore should be abandoned in favor of equivalent application-side logic.

“Pay Now” is the recognition that ensuring data integrity requires the investment of data modeling and coding of data integrity validation logic. “Pay Later” is the cost incurred by
organizations relying on bad data because they unwittingly sacrificed data integrity through inadequate data integrity validation.

Let’s clear up some misconceptions about performance of server-side data integrity constraints, then highlight risks associated with poor data integrity, and draw a few conclusions.

PERFORMANCE CONSIDERATIONS

Because integrity constraints operate at the row level upon specific values of data in columns, system resources are consumed, resulting in some performance degradation. Since integrity constraints
are implemented as triggers in the DBMS internals, the performance impact is equivalent to explicit triggers. However, performance is not improved by coding data integrity logic in application
code, because the same row-level value checking must be done in either case. In fact, data integrity logic in application code may introduce additional performance overhead from the network traffic
to issue the data integrity SQL calls, as compared with server-side integrity constraints.

In general, the performance cost of integrity constraints is equivalent to a SQL statement that evaluates the business rule. In addition, integrity constraints often outperform triggers or
equivalent application code because they are optimized for performance by the DBMS vendor. Also, because the DBMS query optimizer knows about the constraints from the database dictionary (catalog),
invoking constraint checks occurs only when necessary, contrary to triggers or application code.

DBMS vendors are continually improving the performance of integrity constraints. Several methods include: 1) Deferring constraint checking to the end of a transaction, which allows complex
application processing, such as cascading updates, 2) Use of non-unique indexes to enforce uniqueness, thus preventing index drops and recreates associated with disabling and enabling of
constraints, 3) Use of “enforced” constraints, which behave like an enabled constraint in that it ensures that any new transactions conform to the constraint, but with the difference that it does
not check pre-existing data for validity (this avoids placing an exclusive lock on the table, allowing concurrent transaction activity). These enhancements to integrity constraints significantly
improve availability and simplify application development.

In addition to on-line response time, another area of concern for performance is the time to recover a database in a disaster scenario. Removing or disabling integrity constraints in the database
and implementing integrity checking in application code may result in a more rapid recovery in cases where the database is infrequently backed up and recovery involves applying many transaction
logs. Since each applied transaction must be validated through existing integrity constraints, the more transaction logs to apply, the longer the recovery. However, this recovery time delay can be
eliminated for databases with integrity constraints by performing more frequent on-line backups, thus minimizing the number of transaction logs to be applied at recovery.

The bottom line is this: integrity checking must be done somewhere in order to prevent bad data from being entered into the database. From an overall business perspective, the cost of bad data far
outweighs the cost of integrity checking. Compared to integrity constraints, performing integrity checks in the application code is slower, more difficult to code and maintain***, difficult to
verify existence and correctness of implementation, thus increasing risk of bad data.

***As an example of what must be done in the application code to replace only a portion of the functionality of a simple foreign key constraint, consider a DEPARTMENT table and an EMPLOYEE table in
a parent-child relationship (DEPARTMENT contains EMPLOYEEs / EMPLOYEEs are assigned to a DEPARTMENT), with no foreign key constraint. To eliminate a department and all its associated employees, the
following functionality would have to be hard coded in the application to ensure that all employees are properly deleted and no “orphans” are left behind: 1) Lock the row in the DEPARTMENT table
to prevent update, 2) Lock the EMPLOYEE table to prevent new rows from being inserted or existing rows from being updated that are related to the DEPARTMENT row, 3) Perform a full table scan of the
EMPLOYEE table to see if deleting this DEPARTMENT would cause an “orphan”, 5) Obtain row locks on all related EMPLOYEE rows, 6) Delete the DEPARTMENT row, 7) Delete all related EMPLOYEE rows, 6)
Release row locks. By contrast, this coding complexity is completely avoided through a simple declarative foreign key constraint with a CASCADE DELETE, as follows:

ALTER TABLE EMPLOYEE
ADD ( CONSTRAINT FK_EMPLOYEE01
FOREIGN KEY (Department_Identifier)
REFERENCES DEPARTMENT ) CASCADE DELETE ;

RISK CONSIDERATIONS

Inadequate integrity checking at the transaction level presents a major risk of storing incorrect data from which business decisions are made, affecting revenue, customer experience, regulatory
compliance, privacy, etc. Explicit integrity constraints mitigate risk because they intercept all transactions at the table level, regardless of the transaction source. Although there may be only
one application accessing a database in which data edit controls exist, it is common practice for developers to execute SQL scripts through the DBMS command line interface to directly update
tables. The probability of these scripts not protecting data integrity is very high, and without database level integrity constraints, data corruption is very likely to occur. In addition, the lack
of verifiable integrity checks may be an unacceptable risk from an audit perspective.

CONCLUSIONS

It is important for every organization to recognize that there is a cost associated with ensuring data integrity, and it must not be avoided. “Pay Now” is unquestionably better than “Pay Later”
because the consequences of improper business decisions based on bad data is avoided.

Data Management plays a key role in the following areas: 1) educating organizations about the importance of data integrity, and to advise on best practices for achieving it, 2) being involved in
new system implementations to validate data integrity, 3) leading the effort to perform data profiling on existing databases to identify data integrity deficiencies and implement corrective
solutions.

Finally, a high degree of data integrity in the transactional environment pays big dividends downstream in the corporate data warehouse by reducing the complexity, time, and cost of ETL.

MenuMenu

Data Integrity – Pay Now, Or Pay Later

Gary Flye

MenuMenu

Share this post

Gary Flye