This is the first in a series of articles from Amit Bhagwat.
Data modeling is no doubt one of the most important and challenging aspects of developing, maintaining, augmenting and integrating typical enterprise systems. More than 90% of functionality of
enterprise systems is centered round creating, manipulating and querying data. It therefore stands to reason that individuals managing enterprise projects should leverage on data modeling to
execute their projects successfully and deliver not only capable and cost effective but also maintainable and extendable systems. A project manager is involved in a variety of tasks including
estimation, planning, risk evaluation, resource management, monitoring & control, delivery management, etc. Virtually all of these activities are influenced by evolution of the data model and
may benefit by taking it as the primary reference. This series of articles by Amit Bhagwat will go through the links between data modeling and various aspects of project management, beginning with
importance of Data model in estimation process.
Have you ever wondered why COBOL has been such a smashing success with business systems? Because if you have, you would have introspected about what business systems generally do. A very tiny
number of them may involve complex reactive algorithms possibly requiring real-time performance and actions that need multiple levels of procedural factoring. However, an overwhelmingly large
number of business systems (or most subsystems therein, at any rate) involve creating, manipulating and querying simple set of data, possibly taking it through steps involving intermediate simple
sets of data and potentially leading to a third simple set of data, more often than not storing a trail of all the data that may have been created on the way to the destination. Moral of the story
is: a business system is more than anything else about eternally performing modest set of operations on data.
Unfortunately, a majority of business systems projects are either outright failures (involve unacceptable overrun or lack tangible progress, in either case leading to abandonment of the activity)
or are ‘challenged’ (bogged down by overruns, near misses or ‘prudent’ functional compromises). Usually the complaint is that the project workers, and particularly the
estimators, did not understand the system well the first time around or that the business changed by the time they did catch up with its earlier perception. Given that business systems largely
involve creating data from data through data, the above reasoning translates into: the project workers / estimators were wrong or inadequate in their understanding of the underlying data
manipulation, and therefore structure / model of the data (persistent or otherwise). It therefore stands to reason that what a business system project should achieve and consequently in what time
and resource framework should it achieve this, is dictated by the underlying data model. In other words, realistic (this is earthly equivalent of ‘accurate’ in wonderland) estimation of
efforts in business systems projects is function of a fairly representative data model of the business.
There are almost as many approaches to Project Estimation as there are to the overall execution of project activity. These may vary with various techniques of requirement elucidation and predictive
/ evolutionary approach to project execution. However, they all are classifiable based on when the estimation is made. We could estimate:
- When we have just understood the problem
- When we have worked out the solution
Of course, as often is the case, the former is used for a starter and the latter is applied later to correct the former.
For example, if we are using the Use Case technique for requirement elucidation, following the first approach, we:
- Write the use cases
- Stack them up based on which level of implementation they are addressing (e.g. will the use case lead to an integration of nearly independent systems or just a system, subsystem group,
- Count use cases belonging to each level & multiply them with efforts per use case associated with an estimation model.
- Apply various factors to this estimate to account for difference between use cases in hand and those in the estimation model. Examples of these factors include:
- Size quotient for the use cases (as our use cases may be doing slightly more or less work compared to that prescribed in the estimation model),
- Project inertia factor (this is to take into account that if x people do work y in z days, then we are likely to need a different number of people (usually greater) than 10x to do 10y work
in the same time),
- Domain complexity factor
- Non-functional complexity factor (as use cases by themselves focus on functional requirements);
- Add the efforts for all levels of use cases to get the end estimate.
Following the second approach, again using the Use Case technique, we would:
- Perform Use Case Realization, i.e. develop design to varying degree of completeness, based on the requirements
- Find out, through the realization (analysis and design) process, all operations that elements of the solution are assigned
- Obtain estimate for each functional realization at the lowest level
- Sum estimates for low level realization and apply factors to account for overheads of integration, effort inertia and non-functional complexity
Clearly, approach 2 is far more accurate compared to approach 1. It however takes us through may be over 30% of our project schedule and is more often than not impracticable. This is because, to
the project sponsor it says: ‘Let me squander an unspecified amount of resource that is uncomfortably large portion of the overall resource requirement, I may then be able to come up with a
fairly accurate estimate for the rest of the project (at which stage you would more likely than not faint and cancel the project)’. Thus, this approach is affordable in its complete majesty
only for projects that are so essential as to have no no-go decision. In such cases though, the counterargument is ‘why bother such precision of estimation if the project should go on
irrespective? It may perhaps suffice to give the sponsor the estimate within right order of magnitude.’
The moral: Effort estimates are best made with adequate knowledge not only of the solution domain but also of the solution itself. However, to be able to make estimates and make them in a way as to
allow a business, operating in competitive environment, to make a prudent go / slow-go / no-go decision, they should be made in good time, thus restricting the level of definition of the solution.
In the example discussed in earlier section, we have commented on the ‘royal projects’, which must go on, whatever the cost may be. However, for businesses in competitive environment,
this is rarely an option. Most commonly, they therefore follow two funding models:
Here a problem is presented, typically soliciting proposals from the suppliers. The supplier with most cost-effective bid generally has the best chance of bagging the assignment. Typically, bids
are based on limited understanding of problem and little work on the solution. Their success is therefore dictated by familiarity of solution provider with the domain, whereby the provider can in
effect produce an illusion of having defined the solution. The approach is inherently risky, but most appealing upfront to the sponsors.
This is oftener than not used for the ‘internal projects’. The IT department typically has a fixed resource allocated and it schedules its tasks so as to utilize and at the same time
attempt not to greatly exceed its allocated resource. In terms of project estimates, there still are effort related numbers attached to projects and some projects do have pressing schedules needing
occasional resource injection. However, generally projects get executed at a predefined ‘burn rate’ and often take incremental development approach to justify the efforts.
In real life, businesses typically take a mixture of these approaches with a few ‘royal projects’ getting in occasionally.
Data Modeling for Estimation
To sum up from what we have discussed so far about Business systems projects:
- They are predominantly about developing / maintaining / integrating means of creating / manipulating / querying business data Economics plays significant role in their execution
- They most commonly take a Bid based or Utilization based resourcing style
- They often run risk of failing or getting challenged due to:
- Limited understanding of problem
- Changing problem
- Perceived understanding of problem lacking validation due to lack of solution conception prior to estimation
- There is almost never sufficient time to completely conceive the solution before estimation, though there is a strong case for conceiving the most important aspects of the solution prior to the
It is therefore fair to suggest that there is a strong case for conceiving the data model upfront, to give the conceived solution and associated effort estimate some validation and stability. In
other words, it is prudent to go a step beyond requirement elucidation and into analysis & design to get reliable estimates, and the candidates for this early realization should be the data
elements associated with the implementation.
Analysis and Design
At this point, we are treading along the border between what and how. I should therefore stress that I would not have a very specific solution here that is ‘ready to fit all feet’,
given that I am not referring to specific project environment / domain / dynamics. I would therefore shy from going a great deal into how and will try to suggest the steps in what.
Data modeling involves two essential steps. The first step, that of analysis, focuses on finding nouns or ‘things’ out of description of requirements. The relationship between these
things and constraints involved in the relations are next worked out. It is not necessary to structure the things as entities, attributes, compositions, etc. at this stage. Here we take stock of
raw data elements associated with the system being worked on and the way they influence action on other data elements. It is important not to disregard intermediate and / or derived data elements,
including those produced ad-hoc or as transients while obtaining final data, since these too are as much subjects of operations as others (the elementary constituents of final data structure) are.
Typically, this process provides enough information to conduct data structure based estimation.
The second step, that of design, leads to getting final data structure, typically comprising of fewer intermediate / derived elements, but with a well-defined attribute-entity-composition hierarchy
and desired level of normalization.
Linking Data Model to Estimation
Two approaches to estimation emerge from classical Function Point technique:
- Function-based approach
- Data approach
The former is credited for being more accurate compared with the latter, but is also quite tedious. Both assume a level of understanding of the business on part of the estimator.
This approach views the business requirements as composed of transactions (here meaning data operations). The transactions may involve system inputs, outputs and / or entity manipulation within the
system. It is recommended to classify them based on level of complexity in terms of number of data elements involved (there is empirically backed guiding data available, say as provided by Charles
Symons[i], to assist in this classification). It is then useful to perform CRUD analysis (finding if the transaction creates, reads, updates or deletes data) and list all transactions along with
their complexity and CRUD type. Next, the approach uses an empirically established and refinable multiplier table. The total function point count may therefore be obtained by sum of multiplication
of number of transactions of each complexity & CRUD type with respective multipliers.
A somewhat lengthy but potentially more accurate variant of this method is one where each transaction is analyzed in terms of its input, output and entity reference data, thus treating each
transaction individually, rather than as one among its kind, based on complexity brackets that it may fit into.
This is regarded simpler but less accurate approach. Also, whereas simpler on estimation itself and giving a simplified view of data dynamics within the system, the approach is rather hard in that
it requires a detailed view of entities involved, in terms of their attributes and relationships.
We first work out the number of entities (say E), attributes (say A) and inter-entity relationships (say R). We next assume (and this is a bit courageous, or in the least leading to functional
restructuring) that each entity will typically be involved in exactly four transactions (namely C, R, U, D). We then work out the average number of entities per transaction. Each relation involves
two entities (trash away n-nary relations anyway, we are talking about entities), therefore on the average each entity accessed will have (R / (E/2)) = 2R/E related entities that will get involved
in a transaction (this also ignores that entity dependencies are as a rule not reflexive and that entities lower down the dependency tree may not drag their higher relatives whereas those higher up
will drag all lower relatives). The number of entities in a transaction will therefore be 1 + 2R/E. Likewise, assuming (why? Because generally works empirically) that a transaction will involve a
principal entity and all entities related to it in a way that it will affect all attributes of the principal entity and half of those related, the number of fields (elementary facts) per
transaction works out to (A/E + ((2R/E) * (A/E) / 2)) = (A/E + RA/E2). The method then continues on its assumption spree to suggest that half the transactions are Create or Update and a quarter
each are Read and Delete. Then by applying empirically established weighing factors to entities manipulated as well as to input & output field(s), summing together effect of all these on each
type of transactions and summing the results for different types of transactions, the approach arrives at end function point count.
Converting Function Points to Estimate
Following either approaches, the function point count arrived at is multiplied by factors to account for technical challenge of the activity, to give function point index. This is then multiplied
by factors to represent level of team instability and incompetence (harsh word, basically refers to training requirement, running in period involved, etc.) to arrive at final effort estimate.
Getting the best of both
As indicated earlier, the Data approach involves much higher level of assumption and therefore a higher level of deviation. It however establishes the influence of data elements (including any
subordinate elements) and relationships on overall effort. The function-based approach seems to be counting algorithmic actions involved, which maps it closely to the requirement specification and
gives it a higher possibility of being realistic. Beneath however, it hints that typically intermediate / derived data is created by transactions, which should not be ignored in conducting
data-based estimation. There is therefore a case for taking the analysis model as the starting point (where data typically exists as facts, irrespective of which of these are derived, which are
going to persist and which are subordinate).
The approach thus could be as follows:
- From the detailed functional elucidation, establish facts and relationships between them
- Convert algorithmic details / business logic into transient / derived facts
- Establish constraints for relationship between / among facts (if you stop thinking of facts as entities in an ER sense, you can think of genuine n-nary relationships in the analysis model)
- Establish directionality of dependence with respect to relationships
- Reading through requirement elucidation, count facts getting involved with each section, directly or through dependence
- Sum this count (call it ‘cumulative fact point’ if you like) and use it as basis for estimation (run this metrics for a few projects, find out the actual effort requirement at the
end of the project and thus establish the multiplier that will convert the cumulative fact count arrived at into effort, for a particular type of projects, with a particular type of technology and
typical team dynamics)
The approach will also work for changing functional requirements, which will typically map onto changing facts (intermediate facts, in any case). Thus, not only will it provide impact analysis for
suggested changes, but it will also bring forth data inconsistency that such changes may potentially bring in.
Interpreting Estimation for Funding Style
Having thus found out the overall effort involved in a project, the Bid model or other complete project scenarios are straightforward to work on. Under the utilization model, we may have verdict to
complete such portion of the activity as is affordable. Here once again, the approach is to mark dependency chains of facts in the data model that have no intermediate joins to other chains (i.e.
they are strict dependency chains and not dependency webs) and that have reached end (have a definite system output as lowest level dependent). They will each represent system functionality upon
which no other functionality depends. It is then possible to decide to truncate (leave for next budgetary allocation) a number of these chains whereby some of the proposed system features are
dropped / postponed to meet the budget.
Choice of Modeling Technique
I have used the potentially controversial word ‘fact’ above, to mean a data element, elementary or otherwise. Using this word, I have minimized use (and avoided use with multiple
meaning) of another controversial word ‘entity’ which has different perception based on whether we are doing problem analysis or data design, and further, based on which design method
we are using.
We have established that to perform estimation we need analysis-level data model, i.e. a model of all data, elementary or composite, basic or derived, persistent or transient, that justifies
existence of the system being built. We are therefore interested in data elements, their dependencies & constraints, and not their compositional hierarchy or persistence mechanism. I have
personally found UML to be adequate for this purpose and the project teams I have worked with have been able to read it. UML also has the advantage of greater universality (compared say with
countless flavors of ER) and formality, as well as ability of functional association along with extensibility. In addition, we can take UML data analysis model and go on to produce both database
and operational design.
If however, we were to emphasize on better verbalization of business information, formal visualization of data constraints and equal treatment to facts, we are probably thinking of using ORM for
analysis and switching to UML / ERM for design.
If you are shying from estimating your business systems project efforts based on underlying data model, perhaps this work will induce you to try it out, may be as a secondary estimator to begin
with. On the other hand, it is possible that some of you are using FPA or similar techniques and possibly experiencing some of the problems addressed here. You may therefore want to go through
steps defined above under ‘Getting the best of both’. In the meanwhile, I think I would do well to illustrate the approaches discussed here with an example case, in the next part.
[i] Software Sizing and Estimation: MKII Function Point Analysis, John Wiley & sons.