One of the most common problems when working with Data and Predictive Models is dealing with missing values. In this notebook, we will handle the missing values for the training dataset and evaluate how good our input method is. The evaluation is done by simulating missing data on samples with data, this way we can compare the that that is being inputted to the actual data.
In this notebook we will:
1. Analyze the subset with the missing data
- Is it any different from the rest of the data?
2. Create training and validation subsets
3. Use different methods to input data
- Simple imputation
- Unsupervised Learning
- Supervised Learning
4. Evaluate the results
[1]:
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.neighbors import NearestNeighbors
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from plotly.offline import init_notebook_mode
init_notebook_mode(connected = True)
[2]:
df_train = pd.read_csv('../../data/train-validation/training-data.csv', index_col = 0)
# Remove Label
train_data = df_train.drop(columns = ['dano_na_plantacao']).iloc[:,1:]
[3]:
n_missing = train_data[train_data.Semanas_Utilizando.isna()].shape[0]
total = train_data.shape[0]
n_perc = n_missing/total * 100
print(f'Dataset size:\t {total}')
print(f'Total missing:\t {n_missing:}')
print(f'Missing data:\t {n_perc:.3} %')
Dataset size: 56000
Total missing: 5690
Missing data: 10.2 %
Missing data¶
There are many approaches when handling missing data. In this case, the missing data is Continuous. So these methods will be tested:
- Deletion - This is not ideal since when working with new data for prediction this data will have to be discarded.
- Imputation
- Simple Imputation - Mean, Median and most frequent values.
- Machine Learning - Both supervised learning and unsupervised are possible when imputing missing data
Before we start treating the missing values let’s look at the samples with missing data and compare with the rest of the data. This will tell us if the missing data is in any way skewed from the rest. If so Simple Imputation is not recommended.
[4]:
print('\n\n\t\t\t\t\tSubset without missing Data')
display(train_data[~train_data.Semanas_Utilizando.isna()].describe())
print('\n\n\t\t\t\t\tSubset with missing Data')
display(train_data[train_data.Semanas_Utilizando.isna()].describe())
Subset without missing Data
| Estimativa_de_Insetos | Tipo_de_Cultivo | Tipo_de_Solo | Categoria_Pesticida | Doses_Semana | Semanas_Utilizando | Semanas_Sem_Uso | Temporada | |
|---|---|---|---|---|---|---|---|---|
| count | 50310.000000 | 50310.000000 | 50310.000000 | 50310.000000 | 50310.000000 | 50310.000000 | 50310.000000 | 50310.000000 |
| mean | 1399.919579 | 0.283542 | 0.457364 | 2.270443 | 25.862353 | 28.651719 | 9.512025 | 1.896263 |
| std | 851.705790 | 0.450722 | 0.498184 | 0.465177 | 15.584169 | 12.429610 | 9.903693 | 0.703317 |
| min | 150.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 731.000000 | 0.000000 | 0.000000 | 2.000000 | 15.000000 | 20.000000 | 0.000000 | 1.000000 |
| 50% | 1212.000000 | 0.000000 | 0.000000 | 2.000000 | 20.000000 | 28.000000 | 7.000000 | 2.000000 |
| 75% | 1898.000000 | 1.000000 | 1.000000 | 3.000000 | 40.000000 | 37.000000 | 16.000000 | 2.000000 |
| max | 4097.000000 | 1.000000 | 1.000000 | 3.000000 | 95.000000 | 67.000000 | 50.000000 | 3.000000 |
Subset with missing Data
| Estimativa_de_Insetos | Tipo_de_Cultivo | Tipo_de_Solo | Categoria_Pesticida | Doses_Semana | Semanas_Utilizando | Semanas_Sem_Uso | Temporada | |
|---|---|---|---|---|---|---|---|---|
| count | 5690.000000 | 5690.000000 | 5690.000000 | 5690.000000 | 5690.000000 | 0.0 | 5690.000000 | 5690.000000 |
| mean | 1386.976801 | 0.288576 | 0.447276 | 2.259930 | 25.831283 | NaN | 9.547803 | 1.900176 |
| std | 834.701574 | 0.453140 | 0.497256 | 0.463573 | 15.589252 | NaN | 9.858941 | 0.702468 |
| min | 150.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | NaN | 0.000000 | 1.000000 |
| 25% | 731.000000 | 0.000000 | 0.000000 | 2.000000 | 15.000000 | NaN | 0.000000 | 1.000000 |
| 50% | 1212.000000 | 0.000000 | 0.000000 | 2.000000 | 20.000000 | NaN | 7.000000 | 2.000000 |
| 75% | 1898.000000 | 1.000000 | 1.000000 | 3.000000 | 40.000000 | NaN | 16.000000 | 2.000000 |
| max | 4097.000000 | 1.000000 | 1.000000 | 3.000000 | 95.000000 | NaN | 50.000000 | 3.000000 |
There are no large discrepancies between both subsets, this means that we can try to use information from the complete dataset to fill in the missing values. There are different methods of data inputting and here we will evaluate which one fits better.
Primary we eliminate all the missing samples from Semanas_Utilizando, now we have a new DataFrame with no missing data: df_full. Let’s separate a sample from so we can use to evaluate how good our method is. The evaluation sample size is a percentage of the total missing data: 10%.
Train-Validation Subsets¶
[5]:
# Drop missing data
df_full = train_data[~train_data.Semanas_Utilizando.isna()]
na_train, na_eval = train_test_split(df_full.copy(), test_size=0.1)
na_eval_y = na_eval.Semanas_Utilizando.values.copy()
na_eval_df = na_eval.drop(columns = ['Semanas_Utilizando'])
With an evaluation function is possible to measure how good we are filling in missing values. Here we will use Mean Squared Error (MSE), the smaller the value the better out method is.
[6]:
# Create MSE cost function
mse = lambda A,B : (np.square(A - B)).mean(axis=0)
mse_score = {'Simple Imputing': {}, 'Supervised Learning':{}, 'Unsupervised Learning':{}}
Data Imputation¶
Simple Imptation¶
[7]:
ones = np.ones(na_eval_y.shape)
# Most Frequent Value
mode = na_train.Semanas_Utilizando.mode()[0] * ones
mode_mse = mse(na_eval_y, mode)
mse_score['Simple Imputing']['Mode'] = mode_mse
# Mean Value
mean = na_train.Semanas_Utilizando.mean() * ones
mean_mse = mse(na_eval_y, mean)
mse_score['Simple Imputing']['Mean'] = mean_mse
# Median Value
median = na_train.Semanas_Utilizando.median() * ones
median_mse = mse(na_eval_y, median)
mse_score['Simple Imputing']['Median'] = median_mse
Unsupervised learning¶
Here we can use Nearest Neighbors to find the closest n samples for each sample with missing data and use their Semanas_Utilizadas mean value as an input.
To train the Nearest Neighbors we need to remove the column we want to fill. This way we can predict (kneighbors() - return closest samples) on the subset without the data.
[8]:
na_y = na_train['Semanas_Utilizando']
na_x = na_train.drop(columns = ['Semanas_Utilizando'])
nbrs = NearestNeighbors(n_neighbors=10, algorithm='ball_tree').fit(na_x)
[9]:
_, neighbors = nbrs.kneighbors(na_eval_df)
display(neighbors.shape)
(5031, 10)
The kneighbors output is a matrix containing all the indexes positions for the n closest neighbors for each sample. So the expected output shape is (n_samples, n_neighbors). Now we have to iterate for each sample and get the mean values for the missing column for all n neighbors.
[10]:
# Column Index
col_idx = na_train.columns.get_loc('Semanas_Utilizando')
knn_input = []
# For each sample we have n_neighbors
# Loop all samples
for sample in range(neighbors.shape[0]):
# Get Sample from data set we trained on, but here with the desired column
# desired column index -> Semanas_Utilizando -> col_idx
# loop on all closest n_neighbors and get mean value
sample_mean = np.mean([na_train.iloc[i, col_idx]
for i in neighbors[sample,:] ])
knn_input.append(sample_mean)
# Evaluate score
knn_mse = mse(na_eval_y, knn_input)
mse_score['Unsupervised Learning']['Nearest Neighbors'] = knn_mse
Supervisied learning¶
Linear Regression - here we try to use a simple supervised machine learning model to train a regressor to predict the expected value of the missing feature based on the other features.
[11]:
lr = LinearRegression().fit(na_x, na_y)
lr_input = lr.predict(na_eval_df)
lr_mse = mse(na_eval_y, lr_input)
mse_score['Supervised Learning']['Linear Regression'] = lr_mse
Iterative Imputer - Like with the linear regressor, this is also a supervised learning algorithm, where iteratively, for each feature, a regressoion is fit on (x,y).
[12]:
# Populate dataframe with nans
na_eval['Semanas_Utilizando'] = np.nan
# Get index for later evaluation
eval_index = na_eval.index
# Concatanate with training data
na_concat = pd.concat([na_train, na_eval], axis = 0)
[13]:
# Create Fit and Predict
imp = IterativeImputer(max_iter=10, random_state=0)
imp.fit(na_concat)
imp_input = imp.transform(na_concat)[na_concat.isna()]
# Evaluate score
imp_mse = mse(na_eval_y, imp_input)
mse_score['Supervised Learning']['Interative Inputer'] = imp_mse
Results¶
[14]:
from plotly import graph_objects as go
fig = go.Figure()
for key in mse_score.keys():
fig.add_trace(go.Bar(
x=list(mse_score[key].values()),
y=list(mse_score[key].keys()),
orientation='h',
name = key
),
)
fig.update_layout(
title="Imputation of missing values methods comparison",
xaxis_title="Mean Squared Error")
fig.show()