Wednesday, October 29, 2008

5 Steps to Build a Predictive Model

I keep on getting inquiries about the right way of building predictive models. I am trying to answer those through this post. But before that let me make one thing clear, there is no definite rules to build a Model. Nor there is any good or better methods. It solely depends on the analyst or modeler.
Here is my way. It is just one among the many ways.

1. Modeling Methodology and Entity
As data comes, first thing an analyst need to do is to study the data. Then understand the business problem for which the model is going to be used. These two things lead to the correct definition of Modeling Methodology and Entity.
Modeling Methodology could be choosing one of among the Linear Regression, Logistic Regression and Poisson Regression. And Entity is to define at what level the Model is going to be built.

For banking while accessing risk, typical model build is a risk model. This model is built with Logistic Regression with Good/Bad Dependent Variable. The entity is usually customer. But if you want to see Risk Involved at each subsequent loan that a person has (and believe more the number of loans to a person risk changes), you could go for a Loan Level.

2. Time-line
Not all data received is usable. The timeline needs to be properly finalized by good understanding of data and the business.
Usually for Credit Cards portfolio of banks, the data taken is for 42 months. It includes 24 months of observation period and 18 months of performance period i.e. all cards taken in first 24 months is taken and their subsequent behavior in next 18 months is analyzed in sliding window basis.

This is quite simple. But things are not simple always. If you need to build a Risk Model for US Sub-Prime Market now, this rule might not hold. The business scenario in the last 12 months have changed a lot and in future it is hard to get similar environment. What to do here?

Might be taking a smaller window helps! Or might have to be conservative in building but solve these issues at the implementation time.

3. DV (Dependent Variable or Outcome Variable) Definition
DV definition has its own challenges. Mostly while deciding on methodology and entity, and on timeline there is a definite thought on DV. But there are areas where we need to go much deeper while defining DV.

For credit card, its the risk in 18 months but how do we define risk. It would be two cheque bounces or missed payments; it could be one; It could be crossing the credit limit and not paying thereafter etc. Each of these have business logic behind them.

For building Response Model for a Email Campaign, DV is response but how do you define response? Is it opening email? or replying email? or taking some action based on the email? It could be any.

4. Sampling
This is slightly different from the other steps. I would prefer to use all data that is available and not sample down or up in most scenarios. But there are certain business issues, or data issues or technical BI tools issues which force to do some sampling.

A case of oversampling could be some cases where the data points are too less. And among them too most of the records are good (could be bad too). Say for e.g. we have 1000 records and 990 are goods. The only way to move forward is to increase bads by taking them multiple times.

Other case could be a place where there are 5 million records. Is it necessary to take whole sample?
I would say yes. But there are certain issues of taking 5 million records. I doubt any tool can handle this huge data apart from SAS. And SAS too will take long time to do any simple process and a iterative process like Modeling will just kill any-body's patience.
Solution: Sample down. Either taken random X% or do stratified sampling by taking Y% of Goods and Y% of bads.

5. IDVs (Independent Variables)
IDVs generation is again a art in itself. It is always good to take as much data as you can get, and create as much derived variables as you can from them. This will open up the window for more advanced statistical tests and helps in capturing performance from all angles.

Some of the data sources for IDVs for Financial Clients would be: internal operations data (like sales, call center usage etc.) , demographic data, personal information of customers, economic data, seasonal factors, etc.

The ways of creating derived variables could be:
a. Log of original variable [log(no. of bank card trade)]
b. Taking exponential with e [e(power No. of Bad Loans)]
c. Difference between two similar variables [Loans in last 5 years - Loans in last 1 year]
d. Interaction Variables using CART
e. Inverse Relations [1/No. of Loans]
f. Series Functions e.g. for No. of Loans 1/(Loans in one yr) + 1/(Loans in 2 yrs)^2 + .........
g. Binned Variables using Single Factor Analysis [Age (<18, 18-30,30-65,65+)]

More readings:
1. Modeling on Censored Data
2. Building Predictive Model
3. Time Series Data

I end my note here. It would be interesting to see how you see my note. There might be places where you agree and other areas where you disagree, would like to know them. Please share your comments and feedback through comment or mail.


Ken Kaufman said...

Great stuff - very academic and thorough. Thanks!

ruchi said...

Good and helpful information about data modeling.
I would like to know how SAS can be beneficial in Data Mining & Analytics.
Is it really helpful to learn SAS as I do not have a prior experience in it.

ruchi said...

Good and helpful information about data modeling.
I would like to know how SAS can be beneficial in Data Mining & Analytics.
Is it really helpful to learn SAS as I do not have a prior experience in it.

Page Views from May 2007


© New Blogger Templates | Webtalks