Thursday, October 30, 2008

Missing Value Approximation

My friend Deep has started really good technical issue in Analytics. He explores various ways for Missing Value Approximation.
I have spent several years in Analytics and most of the time I see people wasting their time in exploring solutions which are never possible. Reason - The Great Data Issue.

There are several forms of Data Issues, and one of them and most prominent is missing information. People prefer various ways of missing imputation and the basis for people choosing one them is their inclination towards Business or Statistics.

Most of the statisticians prefer defining some statistical guidelines and then follow it. Like for less than 20% missing values for any variables, take the semi-quartile range ((P75 - P25) / 2). If more than 20% missing, then the variable cannot be used.

The Business users on the other side look from different plane. They say, do whatever but do something. Missings cannot be deleted, just replace it by means even if there is 95% missing. Surprised!!! Do not be .... It is a common problem that I see.

There are other few good methods for missing imputation which I prefer.
1. Means with Flags Method:
Replace all missings by means of the remaining values for upto 95% missings. And create a missing flag to throw into the model. This method, I am sure, will invite wrath of the Statisticians questioning the validity and significance.
But I feel its ok. All information starts with pushing into the middle (replacing by means) and then if missings means something (like frauds not providing phone numbers), it is captured by the flags. Fair Enough!!

2. SFA Imputation Bit technical but business oriented. Plot the variable bins against the bad rate and then see where missings fall.
In this figure for Var1, the behavior of the missings is comparable to value 4 to 5. Thus replace missings by 4.5.
Good huh!!
Well not always. What if the variable is discrete and can accept only numbers. Simple solution -- instead of 4.5 get 4 or 5, which ever is close. Sounds good to me.

What is the common method you use? Share your comments and feedbacks. It will be interesting to explore various methods before we actually use one.

No comments:

Page Views from May 2007


© New Blogger Templates | Webtalks