Monday, June 23, 2008

MODELING ON CENSORED DATA

Introduction: The ABC Bank dataset available now has certain issues of censoring. The data contains people, who have been filtered by earlier Model. Also many people might have withdrawn the Loan offer on the basis of:

  • Loan Amount given to them
  • Interest Rate
  • Loan Term

This requires a different approach to modeling. I propose a ‘different’ approach here.

Axiom: The behavior of a customer depends on the personal characteristics and the controllable factors namely, the loan amount, loan term and rate of interest.

The underlying postulates are:

(i) the goodness / badness of a customer is first and foremost determined by his / her personal characteristics;

(ii) the controllable variables can push up or down the likelihood of goodness of a customer which has been already determined by his / her personal characteristics;

(iii) Among the controllable variables, chronologically earliest-decided (by customer) is ‘loan amount’ followed by ‘loan term’; and then follows the ‘rate of interest’ decided by the lender.

Issue: The dataset on the controllable variables available right now is a ‘censored’ one (i.e.) not spanning the entire ‘space’ in which the future data can possibly lie. The usual approach of building a single model, with the personal characteristics and the controllable variables together, would entail ‘EXTRAPOLATION’ of the model in that part of the ‘space’ which the historical dataset has not encountered. Hence, the single model may not suffice.

The proposed approach: This proposed approach involves building multi-stage model(s) which are ‘not affected’ by the ‘censored’ nature of the available dataset on the controllable and hence, can be hopefully applied to the ‘entire’ space of these variables.

The approach essentially tries to build a first stage model that tries to capture the log odds for ‘goodness’ of a customer on the basis of his / her personal characteristics. This first stage model does not account for the effect of the controllable variables on the likelihood of ‘good / bad’ behavior of a customer. The ‘residual’ from this model contains that information.

The models in the subsequent stages, built on the residuals from earlier stages, would help update the log odds for ‘goodness’. The details are given below:

Stage I Model: Build a (logistic) model with only the personal characteristics of the customers as independent variables. Collect the residuals from this model.

How to get the residuals from this model?

The actual logit (log odds score) for a customer with DV = 1 is actually ∞ [log( 1 / 0)] and for DV = 0 it is actually – ∞ [log ( 0 / 1)] . But, assuming that the predicted probabilities (for DV = 1) is rounded off to 8 digits, the value ‘1’ can be approximated as 0.99999999 and the value ‘0’ as 0.00000001. Hence, the actual logit for a ‘1’ can be approximated as log(0.99999999 / 0.00000001) = 18.42068073. The actual logit for a ‘0’ can be approximated as – 18.42068073.
From the model get the predicted log odds L^ and obtain the residuals as:

18.42068073 – L^ if actual DV = 1

– 18.42068073 – L^ if actual DV = 0

Stage II Model: Take the Residual (Call Residual 1) from the Stage I model as DV and the ‘loan amount’ as the single IDV and build a ‘linear’ model without an intercept term. This model may perhaps have to include a quadratic term also. Specifically, the model could be

Residual 1 = β1* Loan Amt + β2* Loan amt ^2 + Residual 2

Stage III Model: Take Residual 2 as DV and ‘term’ as IDV and build the model

Residual 2 = γ1* term + γ2* term^2 + Residual 3

Stage IV Model: Take Residual 3 as DV and ‘rate of interest’ as IDV and build the model

Residual 3 = δ1* Rate + δ2* Rate^2 + Residual.

The Ultimate Score:

The ultimate (updated) log odds score for the customer is obtained as

L^ + β1* Loan Amt + β2* Loan amt ^2 + γ1* term + γ2* term^2 + δ1* Rate + δ2* Rate^2

Note: The idea above is that the log odds estimated obtained from Stage I Model falls short of the true log odds by a quantity equal to Residual 1. This shortfall is being bridged by building the Stage II Model. Still, there is a shortfall (of a quantity equal to Residual 2) which is captured by the Stage III Model. And so on ……

The ultimate log odds may be converted to probability in the usual manner.

Remark:.This is only an initial proposal which might require modifications. This initial idea might hopefully lead to further fine tuning towards the best possible model.

What do you think of this concept? I welcome suggestions and criticisms on this proposal. Please drop your views by email to me at khanal[dot]Bhupendra[at]gmail[dot]com or through comment to this blog entry.

2 comments:

Dirk N said...

I am not quite sure why you use this iterative approach, but I think you should check the literature first, search for Heckman's selection bias model ('89).

Dirk

Anonymous said...

Yes! I agree with Dirk. We have proven Modeling approach like Heckman's two steps & 'Tobit' to deal with censored data.

You can also follow Conditional Maximum likelihood (CML) approach when two steps objectives (dependent)are conditional and discrete choices. For Example : Approval Rate / Response Rate

Page Views from May 2007

 

© New Blogger Templates | Webtalks