How to Handle Missing Data
Updated: Nov 8, 2020
Missing data is one of the most common data quality issues among three most common issues: Missing Value, Duplicated Value and Inconsistent Value.
Missing value is the easiest one to identify, it may be in various forms, e.g. null values, blank space or being represented as "unknown". Apply a filter to data can make missing values more easily identified.
Duplicate value occurs when several rows of data appear to be the same then most likely that they have been mistakenly recorded multiple times.
Inconsistent value usually occurs when the string values of the same attributes do not follow the same naming convention, e.g. both LA and “Los Angeles” are present in the "City" data field (know more about how to address inconsistent data in this article)
This article will mainly focus on why the data is missing and how to address the issues.
Why the data is missing?
1. Missing completely at random (MCAR)
It may be the result of data not recorded in the first place, hence the reason for missing data is unrelated to this attribute. Therefore, we cannot predict what subsets of data are missing, as the result, the missing values are less predictable as well.
2. Missing at random (MAR)
The attribute value of an instance is missing not because of the attribute itself but because it is affected by other properties of this instance. Use a very simple illustration as below, for example in a supermarket transaction dataset, the liquor purchase attribute would be empty for all customers under age 18, hence it is affected by another property "age" rather than that attribute "liquor purchased amount" itself. We should avoid addressing MAR by deleting the entire row of data, since it may result in bias, e.g. all instances with age under 18 are deleted.
3. Missing not at random (MNAR)
This type of missing data is very hard to distinguish from MAR. In this case, the attribute values are missing as a result of the attribute itself. For example, the machine can record temperature up to 200 degrees, hence all values above 200 degrees are missing. As we can see, the data is missing is because of fo the nature of this "temperature" attribute instead of being affected by other observed data, which distinguishes it from MAR. MNAR has already resulted in a biased dataset. In most cases, extreme values are not recorded.
How to handle missing data?
1. Ignore the tuple
The simplest solution is to remove the tuples with missing data. It is also the most ineffective solution, especially when the fraction missing values varies considerably across attributes. As a result, most instances are deleted and the dataset remains to be very small. Additionally, ignoring the missing values would result in bias when data are missing at random (MAR) or missing not at random (MNAR).
For example, in this case, deleting all tuples with missing data will result in hardly any instances remaining.
2. Fill in missing data manually
Fill in the missing data by inspecting each individual missing value. This may be the most reliable way to ensure that all missing values are handled appropriately. However, it is time and effort consuming. Most of the time, this approach is not feasible.
3. Fill in automatically
This can be achieved in one of the following ways:
1) using global constant such as "missing", "null" etc
2) using mean or median of the attribute value
3) using mean or median of the attribute value within the same class
4) using the most probable value which is computed using from inference based algorithm such as decision tree or kNN.
But each approach has its downside. Using global constant such as "missing" or "unknown" to address numeric attributes may result in inconsistent data type. Using mean or median may introduce new bias to the original dataset and change the data distribution as well as its relationship with other attributes. The last approach seems to the most accurate approach, however, there is always the trade-off of high computation cost.
4. More Advanced Approaches
1) Matrix decomposition: This is a mathematical process of decomposing the original matrix with missing values into a product of submatrices. Then recreate the matrix with filled values by multiplying the submatrices. This article explains it in more detail.
2) Expectation maximization imputation: It is a machine learning approach to find maximum-likelihood estimates for missing values. Read this article "A Gentle Introduction to Expectation Maximization" if you would like to know more.
3) Multiple imputations: this is done by building different models to predict the missing value, then the final value is determined by considering the probability of each predicted value. This is fairly similar to the Ensemble method in the classification problem, where multiple algorithms and models are combined together to create an optimal result.