- What is the unit of analysis?
- Who/what is the population of interest?
- What is the outcome?
- What is the time frame?
- How will we measure success?

The solution you build is determined by the way you frame the problem.

What is the Outcome?

Consider the notion of renewal in fundraising. Gifts to a non-profit are freely made on a date chosen by the donor. There is no termination of services if no gift is made. Nevertheless, it is useful for organizations to think about donors who make gifts on a regular, “renewing” basis and those who lapse. Stop reading for a moment and ruminate on what outcome you might define for this situation.

Ponder a different application where now you are interested in finding new customers who look like your best customers. How might you define your best customers? Options abound, right? You might define best customers based on total purchasing dollars; say more than $250 in lifetime spending. Or, maybe the total number of purchases is a better metric for your business, and three or more purchases define a good customer for you.

These examples illustrate an important point about selecting and defining an outcome. It is very useful to parameterize the outcome (e.g., more than 13 months, greater than $250, etc.). This way, you can try various options by simply changing a number. By experimenting with the parameter value, you can find the one that makes the most sense and works best for your application.

A first note of guidance about choosing an outcome – be flexible. As mentioned above, be willing to experiment with your definition. And, be willing to make changes even late in the data mining process. Selling price seems like the obvious outcome for the residential real estate application mentioned earlier. When I built a preliminary model, however, I found that the assessed value of the property was an extremely strong predictor – overwhelming almost all other factors. After discussion of this fact with my client, we decided it would be worthwhile to predict the percent difference between the selling price and the assessed value. The assessed value provides a “stake in the ground,” a benchmark, for the selling price, and it also has a strong intuitive appeal for real estate agents. Our predictive model could then focus directly on other, finer-grained attributes that cause the selling price to deviate from the assessed value benchmark.

A second note of guidance – create a yes/no type of outcome, at least initially. For instance, try to predict whether a customer will purchase more than $250 in goods or services, rather than trying to predict the dollar amount of total lifetime spending. Why? First, building a yes/no classification model is simpler than building a continuous prediction model. Second, you will gain tremendous insight about the harder problem by solving the simpler one first. Finally, business processes often embed cut-offs anyway (e.g., sending special promotions to the customers with more than $250 in total spending). Why not design the cut-off in from the start, during the framing process.

What is the Time Frame?In defining a lapsed donor in the section above, we alluded to a time window of 13 months. We also mentioned the notion of lifetime2 spending when we defined our best customers. These are just two illustrations of how time plays a role in framing the data mining problem.

In yes/no classification problems, the time aspect is less obvious, more implicit. For instance, you might build a model to classify your customers into those who will renew their service contract and those who won’t. No time component is called out explicitly. But, if you look closely, you will realize that this classification only has value for some time before a customer terminates or renews their contract3. Further, only information available before the outcome is known can be used to make the classification. It’s subtle, but do these time-based distinctions make sense to you?

For you to frame a data mining problem, various time elements must be considered and reviewed with your client. These include the following:

- The time horizon for prediction. That is, how far into the future do you wish to predict the outcome? As mentioned above, do you want to predict stock prices two minutes, two hours, or two days into the future?
- The time window of relevant behavior. Here, we want to consider how far to “look back.” Are the last 12 months of purchases sufficiently rich to predict the next 12 months? Or is it the last 6 months or 3 months that really matter.
- The time base of the population. In Part 1, we considered the importance of choosing the population of interest when framing the data mining problem. We can and should further refine our population by choosing a time base. That is, did long-gone customers from 10 years ago behave like current customers? Or, should you limit the time base to be customers who have joined in the last three years?

How Will You Measure Success?

Defining success is the first component of this framing task. The second component is figuring out how to measure it. Why is measurement challenging? I think it’s because the predictive model is always embedded6 in a larger business process (e.g., renewal marketing), and it can be hard to distill out the impact of the model. Or, even more challenging, some models require a new business process to be developed because one did not even exist before.

For the former scenario where the model is embedded in a larger process, you might frame in a “traditional-versus-model” test and measurement strategy. That is, two groups of customers are selected to get the same treatment – the standard renewal marketing approach. The first group is selected “traditionally”, say based on tenure. The second group is selected based on the rankings from the model. At the end of the renewal period, you measure the renewal rates for both groups.

If a new business process needs to be developed to measure impact and success, I strongly advise you to get help. Data scientists are not business process experts – at least I’m not (ask me how I know). Partner with an expert to define the process, to get buy-in from all the stakeholders, and then implement it when you are ready for roll-out.

Wrap-UpIn this article, the second of a two-part series, we discussed “framing” a data mining problem – what that means and what value a human data scientist brings to the framing process.

In particular, we considered the final three questions of the five from my own framing checklist:

- What is the outcome?
- What is the time frame?
- How will we measure success?

With these framing questions in hand, you should be able to have productive and insightful conversations with your client – before you begin any data mining/predictive analytics project.

If you have any additional thoughts or questions about framing a data mining problem, call me or send me an email. My contact information is below. I hope to hear from you.

References:

- See “Framing the Data Mining Problem – Part 1”, by Tim Graettinger.
- Where lifetime might mean the customer’s tenure to date, or the last three years, or some similar type of duration.
- Once the outcome (renew or terminate) is known, you really don’t need a model prediction of what is likely to happen. You already know what DID happen.
- Preferably, and make that a strong preference, design your scripts and data flow/modeling diagrams such that you only need to change parameter values in ONE PLACE.
- And they’re not even particularly good at that. All that they are is convenient.
- The cheese does not stand alone. Ever.