Introduction: The ABC Bank dataset available now has certain issues of censoring. The data contains people, who have been filtered by earlier Model. Also many people might have withdrawn the Loan offer on the basis of:
- Loan Amount given to them
- Interest Rate
- Loan Term
The underlying postulates are:
(i) the goodness / badness of a customer is first and foremost determined by his / her personal characteristics;
(ii) the controllable variables can push up or down the likelihood of goodness of a customer which has been already determined by his / her personal characteristics;
(iii) Among the controllable variables, chronologically earliest-decided (by customer) is ‘loan amount’ followed by ‘loan term’; and then follows the ‘rate of interest’ decided by the lender.
The approach essentially tries to build a first stage model that tries to capture the log odds for ‘goodness’ of a customer on the basis of his / her personal characteristics. This first stage model does not account for the effect of the controllable variables on the likelihood of ‘good / bad’ behavior of a customer. The ‘residual’ from this model contains that information.
The models in the subsequent stages, built on the residuals from earlier stages, would help update the log odds for ‘goodness’. The details are given below:
How to get the residuals from this model?
The actual logit (log odds score) for a customer with DV = 1 is actually ∞ [log( 1 / 0)] and for DV = 0 it is actually – ∞ [log ( 0 / 1)] . But, assuming that the predicted probabilities (for DV = 1) is rounded off to 8 digits, the value ‘1’ can be approximated as 0.99999999 and the value ‘0’ as 0.00000001. Hence, the actual logit for a ‘1’ can be approximated as log(0.99999999 / 0.00000001) = 18.42068073. The actual logit for a ‘0’ can be approximated as – 18.42068073.
From the model get the predicted log odds L^ and obtain the residuals as:
18.42068073 – L^ if actual DV = 1
– 18.42068073 – L^ if actual DV = 0
Residual 1 = β1* Loan Amt + β2* Loan amt ^2 + Residual 2
Residual 2 = γ1* term + γ2* term^2 + Residual 3
Residual 3 = δ1* Rate + δ2* Rate^2 + Residual.
The ultimate (updated) log odds score for the customer is obtained as
L^ + β1* Loan Amt + β2* Loan amt ^2 + γ1* term + γ2* term^2 + δ1* Rate + δ2* Rate^2
Note: The idea above is that the log odds estimated obtained from Stage I Model falls short of the true log odds by a quantity equal to Residual 1. This shortfall is being bridged by building the Stage II Model. Still, there is a shortfall (of a quantity equal to Residual 2) which is captured by the Stage III Model. And so on ……
What do you think of this concept? I welcome suggestions and criticisms on this proposal. Please drop your views by email to me at khanal[dot]Bhupendra[at]gmail[dot]com or through comment to this blog entry.
2 comments:
I am not quite sure why you use this iterative approach, but I think you should check the literature first, search for Heckman's selection bias model ('89).
Dirk
Yes! I agree with Dirk. We have proven Modeling approach like Heckman's two steps & 'Tobit' to deal with censored data.
You can also follow Conditional Maximum likelihood (CML) approach when two steps objectives (dependent)are conditional and discrete choices. For Example : Approval Rate / Response Rate
Post a Comment