Variable Reduction and addition using statistical methods
i. Find score chi-square w.r.t. to DV for all IDVs in the dev sample (using proc logistic with selection=stepwise maxsep=1 and details).
ii. Find the correlation of all IDVs with DV in the dev sample (proc corr).
iii. Find the information value w.r.t. DV for all IDVs in the dev sample.
Information value = sum for alll i (%Gi - %Bi) * WEi
where WEi = ln(#Gi/#Bi) - ln(#G total / #B total)
iv. Cluster the variables using proc varclus to say 50 clusters. Keep the 1-R^2 Ratio and cluster number for all IDVs.
v. Make a table with all the above parameters for all IDVs and merge with table in 2.
vi. Take best 50 variables from the above table in v.
vii. Run CART on all IDVs (not only in vi) on Dev sample and create interaction variables.
viii. The Variable master list will be the variables in vi and vii. Build the scoring model with this varlist.
ix. If the model overfits then find the condition index and varience inflation to decide on further removal of variables.
x. The modeler can now bring some other variable from v in place of someother variable in the master list. He can then play as he wishes to give the best model possible.
If the number of variables is huge, we need some standard process to minimise the variable list. Building the master variable list by running proc logistic at low significance or by running proc logistic multiple times at high significance may not always be the best way (this is what we generally do today). Also, doing these bivariate analyses leads to better understanding of variables.
Thursday, April 19, 2007
Subscribe to:
Post Comments (Atom)
8 comments:
interesting proposition for variable reduction, how would you treat categorical/character variables though? (proc varclus and proc corr only support numeric variables)
I would suggest to take all those variables separately and chose variables with significant information value.
Business logic need to be checked and even a bivariate analysis can be the next step for further reduction.
Hope this answers your query.
How about converting the character variables into dummies and then run proc varclus? Would that make sense?
I think, that could be a good alternative. Never tried that way.
Have you tried that way?
No but I'm thinking of trying it now to see whether anything sensible comes from proc varclus. In essense, I'm trying to replicate what you are suggesting cause I think is a good idea to augment information values with Chi square scores and 1-R^2 and cherry pick the most predictive variables taking business consideration into account. It would make more sense however to get these measures for all variables rather than just the numeric ones.
All my variables are nominal because I've fine classed them so I got the chi square by running a proc logistic with the class statement and the information values are pretty straightforward. The 1-R^2 is a bit trickier though that is why I thought of converting the fine classed variables into dummies.
Thats nice. Please share your result here. It will be a great learning experience for all. I expect interesting findings there.
Post a Comment