Abstract
Data modeling is no doubt one of the most important and challenging aspects of developing, maintaining, augmenting and integrating typical enterprise systems. More than 90% of functionality of
enterprise systems is centered round creating, manipulating and querying data. It therefore stands to reason that individuals managing enterprise projects should leverage on data modeling to
execute their projects successfully and deliver not only capable and cost effective but also maintainable and extendable systems. A project manager is involved in a variety of tasks including
estimation, planning, risk evaluation, resource management, monitoring & control, delivery management, etc. Virtually all of these activities are influenced by evolution of the data model and
may benefit by taking it as the primary reference. This series of articles by Amit Bhagwat will go through the links between data modeling and various aspects of project management. Having
explained the importance of Data model in estimation process, taken overview of various estimation approaches, presented illustrative example for them and considered importance of intermediate and
derived data, this article addresses effect of Denormalization / normalization on Estimation.
A Recap
In the first article[1] of this series, we established data-operation to be the principal function of most enterprise systems and inferred that data structure
associated with a system should prove an effective starting point for estimating its development.
In the next two articles[2] [3] we took a simple example to illustrate the function-based estimation approach and its
simplified derivation in data-based approach, highlighting the importance of considering only the data owned by the system and using the data-based approach only as pre-estimate / quick-check.
In the last article[4], we continued with the illustrative example and considered the effect of intermediate and derived data on estimation. The conclusions
were:
- Quantities that are important to business logic must be counted in estimation process, whether or not these quantities form a part of the final persistent data structure and whether or not they
are fundamental. - For estimation purposes, entities, attributes and relationships are considered in their logical sense.
- Process followed for data-based estimation assists in transaction discovery, leading to more complete & accurate function-based estimation and potential system re-scoping in good time.
Agenda
Having considered significance of data elements that may not appear in the final data structure, it is now turn of considering arrangement of data elements across entities as it may or may not
exist in the final data structure.
In this article we’ll continue to use the example of book-lending facility at a public library that has served us through the last three articles. In the course of this article, we’ll establish
the significance of normalized data to understanding and estimating a system. It will also become evident that denormalization should be a carefully applied final process, applied to a data
structure well-normalized first. Denormalization is not a short-cut approach to data design and therefore not a way of shielding designers’ shortcomings in normalizing data.
To make the point vivid, we’ll consider some rather ugly instances of Denormalization and work backwards to establish the importance of normalized data to estimation process. We’ll also discuss
considerations of association classes in normalized data structure, in the process of estimation.
Before we begin, it will be useful to have to our ready reference a view of important data elements & the entities owned by our subsystem. These are provided in fig. 1 & 2.
Denormalization
Now imagine that someone decides to denormalize relationship between Past Borrowing & Fine into one entity, which makes the data structure look as in fig. 3.
With this denormalization exercise, Current Borrowing and Past Borrowing do not have identical attribute structure. Past Borrowing has additional optional attributes (thanks to presence of zero or
one Fine associated with each Past Borrowing) and an optional relationship with Total Fine. The data owned by our subsystem may therefore be represented as in fig. 4.
Now, applying data-based approach to these entities owned by our subsystem, we have:
E = 3, R = 1 & A = 4 + 4 + 5 =13
Therefore UFP = (1.42 x 13 x (1 + 1/3)) + (8.58 x 3) + (13.28 x 1)
= 24.61 + 25.74 + 13.28
= 63.63 ~ 64
This is ~ 82% of data-based UFP without denormalization.
Denormalization, in general, tends to give a lower estimate of efforts in Data-based approach. The situation becomes particularly ugly when a ‘many’ relationship is involved. This not only raises
the level of inaccuracy in the estimate, but also makes the data-structure extremely tiresome to deal with, definitely at design time and often at runtime too.
As an example, a further rather outrageous denormalization of Total Fine with Past Borrowing may give (this looks too ugly and illogical to draw, too much a reminiscent of wide-flat-file days)
E = 2, R = 0 & A = 11,
Thus, UFP = (1.42 x 11 x (1 + 0/2)) + (8.58 x 2) + (13.28 x 0)
= 15.62 + 17.16 + 0
= 32.78
This is ~ 42 % of data-based UFP without denormalization.
You have no doubt appreciated now that denormalization can give disastrously low estimates; more so, in data-based approach. Given that Relationships, and to a lesser extent Entities, contribute
the greatest to UFP, and that denormalization lowers the extrinsic entity and relationship count, analysts can end up pinching a substantial quantity of UFP simply by clubbing entities together and
thus making the relations between them intrinsic.
Indeed, for the most extreme case of denormalization, where the entire data is ‘stuffed’ into a single entity, we end up with E = 1 & R = 0. So UFP becomes 1.42A + 8.58
Of course, in doing this we are violating the fundamentals on which FPA is based. Remember, FPA is essentially function-based. We have studied how function-based approach works. We also know that
the data-based approach is its simplified approximation. The FPA approach is based on inputs, outputs & logical business entities (nouns and their adjective qualifiers that may be associated
with ‘things’ in the business / user domain). Therefore, a denormalized physical data-model, whereas useful for certain performance considerations, is a false start for FPA-based estimation. The
situation is particularly bad when the data-access and data-manipulation pattern deviates widely from the underlying assumptions of FPA as they allowed us to establish various formulae in the
second and third part of this series. Additionally, because FPA-based estimations are conducted in early part of a project, they essentially do not demand a view of the data as will be implemented
for best system performance, but simply require an inventory of data, its elements & logical relations.
As a thumb-rule therefore, during your analysis if you can perceive a thing by itself, let it remain thus, rather than huddling it with or into another thing, and let it be associated with other
things by relations which are explicit.
And that, by the way, serves to remind us of what denormalization really is. It is essentially a very selective and cautious un-normalizing of a very well normalized data structure, to attain very
specific data retrieval benefits (often, but not always attained, due to several ‘physical’ factors), after carefully considering the losses to, most notably, data manipulation. Denormalization
is neither the default way of representing data nor an excuse for not doing normalization. Indeed, a well-normalized logical data structure must exist and should be maintained for every physical
data structure defined to implement it. All changes to the data structure, resulting from changed functional requirements of the system, must first be applied to the underlying normalized data
structure. A further exercise of normalization, as may be warranted by a significant functional change, must then be undertaken, before reflecting the changes onto its operationally denormalized
form.
In our example, you may be wondering about one thing. In third part of this series and indeed as depicted in fig. 2, our subsystem comprised of three related entities: Borrowing, Fine & Total
Fine. In this article, as we denormalized Fine into Past Borrowing, we declared that Present and Past Borrowing are no longer equivalent. We therefore once again had three entities. Indeed, our
attributes went up by 30%. The reduction we obtained in the estimate was by considering that Present and Past Borrowing are unrelated. Indeed, you may argue here that the two entities are
structural siblings and functionally bound by ‘becomes’ (and therefore mutually-exclusive) relationship.
Indeed, I was making a point here and I believe I pressed it home by going a step further to undoubtedly ridiculous length of denormalization and thus crunching the estimate down to 42% of its
original size.
However, I do acknowledge that the relationship between the two Borrowing entities does make a difference, particularly in the OO paradigm. I would like you to ponder over this for the next three
months. When we consider this in our next article, we’ll also consider various arguments, for and against, the mention of the two Borrowing entities, rather than keeping the entity structure
confined to their parent entity.
Normalization
There will be some among you, who will have query on the other side of the normalization-denormalization line. Suppose we have a many-to-many relationship between two entities, which in the
normalized form becomes a third entity subordinate to the first two (or an association class, if you speak OO languages); how then should we count the number of entities and relationships? The
simple answer, which I have found practically effective, is: if this third entity, which represents the many-to-many relationship between the first two, merely stores references to the two entities
that it is associated with & has no logic or information of its own, then it is useful to consider it as non-existent for the purpose of estimation (i.e. the many-to-many relationship between
two other entities that it represents, should be considered a single relationship between those two entities). This is because, while implementing business logic, this third entity simply stores
the one-to-many relationships coming into it from either side, to manifest the many-to-many relationship, thus acting as a grid point and not itself an originator of transactions.
However, oftener than not, this linkage entity also contains some useful information of its own and has business logic associated with it. In such a case (which represents a vast majority of
situations), it & all relationships leading into it deserve separate count.
Conclusions
In this article we established that level of denormalization / normalization has significant effect on extrinsic count of entities & relationships, and therefore on UFP.
Some of the important points noted here include:
What’s next
The FPA came into being and attained popularity in procedural programming paradigm. The emergence of Object-Oriented and Object-Relational data structures and their corresponding implementation
platforms, with focus on encapsulation and inheritance, naturally require us to evolve our analysis approach from how we applied it in classical FPA. I’ll endeavor to highlight this paradigm shift
and its impact on the estimation process in the next article.
[1] Amit Bhagwat – Data Modeling & Enterprise Project Management, Part 1: Estimation – TDAN (Issue 26)
[2] Amit Bhagwat – Data Modeling & Enterprise Project Management, Part 2: Estimation Example – The Function-based
Approach – TDAN (Issue 27)
[3] Amit Bhagwat – Data Modeling & Enterprise Project Management, Part 3: Estimation Example – The Data-based
Approach – TDAN (Issue 28)
[4] Amit Bhagwat – Data Modeling & Enterprise Project Management, Part 4: Estimation – Considering Derived &
Intermediate Data – TDAN (Issue 30)
[5] The square brackets used for certain attributes within the data-structures newly created in this article are for the purpose of drawing readers’ attention
to the attributes that are repositioning themselves among the entities