Of all the issues surrounding the logical-physical divide, perhaps none is as critical (or as misunderstood) as the subject of normalization. I’ve alluded to this in previous articles, mentioning, for example, how implementing a fully normalized logical model in a physical database, and forcing application developers to code to the normalized schema, results in what Scott Ambler calls the “object-relational impedance mismatch.” This is why I advocate the use of a virtual data layer (VDL) that allows developers to code to database objects that more closely resemble the “real life” application objects.1
Implementing a fully normalized schema may lead to unnecessary complexity and poor performance in the database, and it may not be an appropriate database design for some applications (rules databases and document databases come immediately to mind). On the other hand, as we’ve previously seen, premature or improper denormalization of the database schema can result in corrupted data, poor performance, and a variety of application issues.2
So how do you decide what degree of normalization is appropriate in a database schema? First, it’s necessary to understand what normalization means in the logical realm and what it means in the physical realm. The logical data model, as I’ve said before, documents the business data requirements (the domain of the data and the business rules that constrain the data) of a business subject area (such as Accounting, Inventory, Order Processing, Personnel, etc.). It is a requirements document, not a design document. The logical data model is usually fully (or mostly) normalized; that is, the identified data elements, or attributes, are organized into groups of interrelated entities (objects with a specific business meaning) such that any given attribute:
- Exists only once in the model
- Is associated with the appropriate entity
- Describes some essential characteristic of that entity
- Is uniquely identified by the key of that entity
This enables the modeler to ensure that the model correctly identifies all the relevant business entities, the nature of the relationships between them, their defining characteristics (including key characteristics), and so on. Or, to put it another way, normalization of the logical model helps ensure that all the business data requirements have been correctly identified, understood, and documented. A normalized logical model is also extensible, meaning that it can be easily modified or enhanced to accommodate new or changing business requirements.
However, the purpose of normalization in physical database design is different from that of logical design. Physical database design consists of mapping the business requirements of the logical model to an appropriate choice of implementation, architecture, and technology. In other words, you first need to answer some fundamental questions:
- Is a database required (or something else, like an XML or EDI document)?
- If a database is required, what is the most appropriate choice of database architecture (relational, object, hierarchical, network, etc.)?
- Will the database be updated in an ad hoc fashion by external users via database transactions (OLTP), or will it be updated only via controlled batch processes?
- What sort of applications will the database need to support (web applications, client/server apps, BI scorecards, OLAP analysis, static reporting, ad hoc queries, etc.)?
- Where will the database reside (on a local database server, on an application or web server, or in the cloud)?
The database designer then needs to decide what parts of the logical model (i.e., what entities, attributes and relationships) need to be implemented, and map the requirements of the logical model to the appropriate artifacts in the chosen implementation technology. If, for example, the logical model (or part of it) is being implemented in a relational database that will be supporting OLTP (online transaction processing) applications in which data is updated in an ad hoc fashion by external users, then the physical implementation may well resemble (or mostly resemble) the logical model. Normalization of the physical schema in an OLTP relational database provides some advantages where data is being frequently updated – it improves performance during data updates and reduces the risk of data anomalies. However, the primary purpose of normalization (or denormalization) in physical database design is to create data structures (objects) that are highly cohesive, loosely coupled, and reusable. For example, you would want to have a single data structure (or set of structures) containing customer data that contains all data relevant to customers (highly cohesive), that changes only when business or application requirements directly related to customers change (loosely coupled), and can be reused to support any application or business process that involves customers.
The degree of normalization in the physical design (and database schema) depends on the nature and purpose of the database being created. OLTP databases will necessarily be more normalized than databases that support reporting and analysis (and, as we’ll see in my next article, there may be denormalized data structures even within a mostly normalized database schema). And, as I said, some types of specialized databases (such as rules databases and document databases) require specialized design. But even when a logical model is being implemented in a mostly normalized OLTP database design, you probably still won’t implement everything in the logical model as-is. Some minor entities in the logical model might be implemented as rules or constraints, rather than as code tables, or they might not be implemented at all. Single-node supertype-subtype relationships might be collapsed into a single database table. Large or compound natural (business) keys might be replaced by a single surrogate (system-generated) key in the database. And so on.
The point is this: Beware of blind assumptions that a given logical model must be implemented exactly as-is in a relational database. There is much analysis to be done, and many decisions to be made, in creating a physical database design that satisfies both the business requirements of the logical design, and the application requirements of the systems and users that will consume the data.
NOTE: I’d like to make this a dialogue, so please feel free to email questions, comments and concerns to me. Thanks for reading!