\[\text{MISSING DATA}\]

One of the most common problems when working with Data and Predictive Models is dealing with missing values. In this notebook, we will handle the missing values for the training dataset and evaluate how good our input method is. The evaluation is done by simulating missing data on samples with data, this way we can compare the that that is being inputted to the actual data.

In this notebook we will:

1. Analyze the subset with the missing data
    - Is it any different from the rest of the data?
2. Create training and validation subsets
3. Use different methods to input data
    - Simple imputation
    - Unsupervised Learning
    - Supervised Learning
4. Evaluate the results

[1]:

import pandas as pd
import numpy as np

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.neighbors import NearestNeighbors
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from plotly.offline import init_notebook_mode
init_notebook_mode(connected = True)

[2]:

df_train = pd.read_csv('../../data/train-validation/training-data.csv', index_col = 0)

# Remove Label
train_data = df_train.drop(columns = ['dano_na_plantacao']).iloc[:,1:]

[3]:

n_missing = train_data[train_data.Semanas_Utilizando.isna()].shape[0]
total = train_data.shape[0]
n_perc = n_missing/total * 100

print(f'Dataset size:\t {total}')
print(f'Total missing:\t {n_missing:}')
print(f'Missing data:\t {n_perc:.3} %')

Dataset size:    56000
Total missing:   5690
Missing data:    10.2 %

Missing data¶

There are many approaches when handling missing data. In this case, the missing data is Continuous. So these methods will be tested:

Deletion - This is not ideal since when working with new data for prediction this data will have to be discarded.
Imputation
- Simple Imputation - Mean, Median and most frequent values.
- Machine Learning - Both supervised learning and unsupervised are possible when imputing missing data

Before we start treating the missing values let’s look at the samples with missing data and compare with the rest of the data. This will tell us if the missing data is in any way skewed from the rest. If so Simple Imputation is not recommended.

[4]:

print('\n\n\t\t\t\t\tSubset without missing Data')
display(train_data[~train_data.Semanas_Utilizando.isna()].describe())
print('\n\n\t\t\t\t\tSubset with missing Data')
display(train_data[train_data.Semanas_Utilizando.isna()].describe())



                                        Subset without missing Data

	Estimativa_de_Insetos	Tipo_de_Cultivo	Tipo_de_Solo	Categoria_Pesticida	Doses_Semana	Semanas_Utilizando	Semanas_Sem_Uso	Temporada
count	50310.000000	50310.000000	50310.000000	50310.000000	50310.000000	50310.000000	50310.000000	50310.000000
mean	1399.919579	0.283542	0.457364	2.270443	25.862353	28.651719	9.512025	1.896263
std	851.705790	0.450722	0.498184	0.465177	15.584169	12.429610	9.903693	0.703317
min	150.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	1.000000
25%	731.000000	0.000000	0.000000	2.000000	15.000000	20.000000	0.000000	1.000000
50%	1212.000000	0.000000	0.000000	2.000000	20.000000	28.000000	7.000000	2.000000
75%	1898.000000	1.000000	1.000000	3.000000	40.000000	37.000000	16.000000	2.000000
max	4097.000000	1.000000	1.000000	3.000000	95.000000	67.000000	50.000000	3.000000



                                        Subset with missing Data

	Estimativa_de_Insetos	Tipo_de_Cultivo	Tipo_de_Solo	Categoria_Pesticida	Doses_Semana	Semanas_Utilizando	Semanas_Sem_Uso	Temporada
count	5690.000000	5690.000000	5690.000000	5690.000000	5690.000000	0.0	5690.000000	5690.000000
mean	1386.976801	0.288576	0.447276	2.259930	25.831283	NaN	9.547803	1.900176
std	834.701574	0.453140	0.497256	0.463573	15.589252	NaN	9.858941	0.702468
min	150.000000	0.000000	0.000000	1.000000	0.000000	NaN	0.000000	1.000000
25%	731.000000	0.000000	0.000000	2.000000	15.000000	NaN	0.000000	1.000000
50%	1212.000000	0.000000	0.000000	2.000000	20.000000	NaN	7.000000	2.000000
75%	1898.000000	1.000000	1.000000	3.000000	40.000000	NaN	16.000000	2.000000
max	4097.000000	1.000000	1.000000	3.000000	95.000000	NaN	50.000000	3.000000

There are no large discrepancies between both subsets, this means that we can try to use information from the complete dataset to fill in the missing values. There are different methods of data inputting and here we will evaluate which one fits better.

Primary we eliminate all the missing samples from Semanas_Utilizando, now we have a new DataFrame with no missing data: df_full. Let’s separate a sample from so we can use to evaluate how good our method is. The evaluation sample size is a percentage of the total missing data: 10%.

Train-Validation Subsets¶

[5]:

# Drop missing data
df_full = train_data[~train_data.Semanas_Utilizando.isna()]

na_train, na_eval = train_test_split(df_full.copy(), test_size=0.1)
na_eval_y = na_eval.Semanas_Utilizando.values.copy()
na_eval_df = na_eval.drop(columns = ['Semanas_Utilizando'])

With an evaluation function is possible to measure how good we are filling in missing values. Here we will use Mean Squared Error (MSE), the smaller the value the better out method is.

\[MSE = \frac{1}{n} \sum_{n=1}^{n}(Y_i - Ŷ_i)^2\]

[6]:

# Create MSE cost function
mse = lambda A,B : (np.square(A - B)).mean(axis=0)
mse_score = {'Simple Imputing': {}, 'Supervised Learning':{}, 'Unsupervised Learning':{}}

Data Imputation¶

Simple Imptation¶

[7]:

ones = np.ones(na_eval_y.shape)

# Most Frequent Value
mode = na_train.Semanas_Utilizando.mode()[0] * ones
mode_mse = mse(na_eval_y, mode)
mse_score['Simple Imputing']['Mode'] = mode_mse

# Mean Value
mean = na_train.Semanas_Utilizando.mean() * ones
mean_mse =  mse(na_eval_y, mean)
mse_score['Simple Imputing']['Mean'] = mean_mse

# Median Value
median = na_train.Semanas_Utilizando.median() * ones
median_mse = mse(na_eval_y, median)
mse_score['Simple Imputing']['Median'] = median_mse

Unsupervised learning¶

Here we can use Nearest Neighbors to find the closest n samples for each sample with missing data and use their Semanas_Utilizadas mean value as an input.

To train the Nearest Neighbors we need to remove the column we want to fill. This way we can predict (kneighbors() - return closest samples) on the subset without the data.

[8]:

na_y = na_train['Semanas_Utilizando']
na_x = na_train.drop(columns = ['Semanas_Utilizando'])
nbrs = NearestNeighbors(n_neighbors=10, algorithm='ball_tree').fit(na_x)

[9]:

_, neighbors = nbrs.kneighbors(na_eval_df)

display(neighbors.shape)

(5031, 10)

The kneighbors output is a matrix containing all the indexes positions for the n closest neighbors for each sample. So the expected output shape is (n_samples, n_neighbors). Now we have to iterate for each sample and get the mean values for the missing column for all n neighbors.

[10]:

# Column Index
col_idx = na_train.columns.get_loc('Semanas_Utilizando')

knn_input = []
# For each sample we have n_neighbors
# Loop all samples
for sample in range(neighbors.shape[0]):

    # Get Sample from data set we trained on, but here with the desired column
    # desired column index -> Semanas_Utilizando -> col_idx
    # loop on all closest n_neighbors and get mean value
    sample_mean = np.mean([na_train.iloc[i, col_idx]
                           for i in neighbors[sample,:] ])
    knn_input.append(sample_mean)

# Evaluate score
knn_mse =  mse(na_eval_y, knn_input)
mse_score['Unsupervised Learning']['Nearest Neighbors'] = knn_mse

Supervisied learning¶

Linear Regression - here we try to use a simple supervised machine learning model to train a regressor to predict the expected value of the missing feature based on the other features.

[11]:

lr = LinearRegression().fit(na_x, na_y)
lr_input = lr.predict(na_eval_df)

lr_mse = mse(na_eval_y, lr_input)
mse_score['Supervised Learning']['Linear Regression'] = lr_mse

Iterative Imputer - Like with the linear regressor, this is also a supervised learning algorithm, where iteratively, for each feature, a regressoion is fit on (x,y).

[12]:

# Populate dataframe with nans
na_eval['Semanas_Utilizando'] = np.nan

# Get index for later evaluation
eval_index = na_eval.index

# Concatanate with training data
na_concat = pd.concat([na_train, na_eval], axis = 0)

[13]:

# Create Fit and Predict
imp = IterativeImputer(max_iter=10, random_state=0)
imp.fit(na_concat)
imp_input = imp.transform(na_concat)[na_concat.isna()]

# Evaluate score
imp_mse =  mse(na_eval_y, imp_input)
mse_score['Supervised Learning']['Interative Inputer'] = imp_mse

Results¶

[14]:

from plotly import graph_objects as go
fig = go.Figure()

for key in mse_score.keys():
    fig.add_trace(go.Bar(
                x=list(mse_score[key].values()),
                y=list(mse_score[key].keys()),
                orientation='h',
                name = key
                        ),
                 )
fig.update_layout(
    title="Imputation of missing values methods comparison",
    xaxis_title="Mean Squared Error")
fig.show()

Scicrop documentation