How to deal with missing values in your data?
One of biggest challenges during data exploration and preparation stage of an data science analytics project are missing values!
How do you deal with them – ignore or somehow replace them?
Penalty of ignoring them is that while some techniques like decision trees can treat them as separate values, other techniques (regressions) would drop the whole row of data where there is missing, which can result in quite loss of potentially useful information. But, before you devise the best strategy in dealing with them – you need to try to understand what caused them, and that often involves talking to people who have deeper understanding and knowledge of specific data and its origination.
Here is an example: in my customer database have missing values for variable “age” and upon closer look it become obvious that when my customers are corporates and not the individual customers with date of birth – and so I would end up with missing value. In this case - none of the usual strategies for missing values imputations are applicable! I call this type of missings “proxy missing values". They are mere invisible label for some larger overriding reason of how they come about.
Similar kind of missing values you get for behavioural or transactional information when you add existing and new customer together. New customers have no historical behavioural and transactions and therefore values are missing. Therefore, you solve problem of missings in this case - by analysing new and existing customer separately.
Another exclusion from considering any “genuine” missing values imputation strategy is you have product “a”, “b” and “c”, and customer has asked for product that company never had before – product “z”, which is “out of range problem” and record was left empty and outcome was missing value.
I only use imputation strategies such as mean, median, predictive imputation methods, distribution-based methods, and others – for values that are in existence in population, but they just were not captured in data record for whatever reason. This is what I call “genuine” missing values. Records where there are other types of missings whose existence could not happen in population for any logical and explainable reason – I exclude these records and analyse them separately.