lib package

Submodules

lib.data_processing module

lib.data_processing.copy_upsample(X, y, over_sampling)[source]

Apply upsampling on minority data

Parameters:
  • X (pandas DataFrame) – Training Features
  • y (pandas Series) – Training Features
Returns:

resampled data

Return type:

tuple

lib.data_processing.fill_missing_knn(df, na_column, n_neighbors=10, algorithm='ball_tree')[source]

fill missing data using k Nearest Neighbors

Parameters:
  • df (pandas DataFrame) – Data Frame
  • na_column (list) – column with nans
  • n_neighbors (int) – number of neighbors
  • algorithm (string) – nearest neighbors algorithm
Returns:

dataframe without nans

Return type:

pandas DataFrame

lib.data_processing.k_fold_prediction(X, y, n_splits, model, reg_params, fit_parameters, upsample_kwargs)[source]

performs train-predict k fold

Parameters:
  • X (pandas DataFrame) – Training Features
  • y (pandas Series) – Training Labels
  • n_splits (int) – number of k splits
  • model (object) – model with fit and predict methods
  • reg_params (dictonary) – model kwargs
  • fit_parameters (dictonary) – model fit kwargs
  • upsample_kwargs (dictonary) – syntetic_sampling kwargs
Returns:

predictions

Return type:

pandas Series

lib.data_processing.ohe(df, columns, drop_first=True)[source]

apply OneHotEncoder

Parameters:
  • df (pandas DataFrame) – Data Frame
  • columns (list) – column with nans
  • drop_first (boolean) – drop first ohe columns
Returns:

dataframe with ohe

Return type:

pandas DataFrame

lib.data_processing.read_data(path, remove_nans=True, apply_ohe=True)[source]

read and transform data

Parameters:
  • path (string) – path to csv file
  • remove_nans (boolean) – remove nans from DataFrame
  • apply_ohe (boolean) – apply OneHotEncoder on categorical data
Returns:

clean data

Return type:

pandas DataFrame with transformations

lib.data_processing.syntetic_sampling(X, y, over_sampling, under_sampling)[source]

Apply Synthetic Minority Oversampling Technique (SMOTE) to tn unbalanced class

Parameters:
  • X (pandas DataFrame) – Training Features
  • y (pandas Series) – Training Features
Returns:

resampled data

Return type:

tuple

lib.data_processing.train_test_sample(X, y, test_size, upsample_type=None, over_sampling=None, under_sampling=None)[source]

Splits into train and test samples and applies transformations to the train sample

Parameters:
  • X (pandas DataFrame) – Training Features
  • y (pandas Series) – Training Features
  • test_size (float) – test size from split, defaults to None
  • upsample_type (float) – sampling method
  • upsample_type – oversample rate
  • under_sampling (float) – undersample rate
Returns:

train and test split

Return type:

tuple

lib.model module

class lib.model.BinaryClassifier[source]

Bases: object

Binary Classifier Model

dump(path=None)[source]

export model to file

Parameters:path (string) – path to dump model, defaults to None
fit(X, y)[source]

fit binary classifier model

Parameters:
  • X (pandas DataFrame) – Training features
  • y (pandas Series) – Training labels
k_fold_prediction(X, y, n_splits=5)[source]

performs train-predict k fold

Parameters:
  • X (pandas DataFrame) – Training Features
  • y (pandas Series) – Training labels
  • n_splits (int) – number of k splits, defaults to 5
Returns:

k fold predictions

Return type:

pandas Series

load(path=None)[source]

laod model from file

Parameters:path (string) – path to dump model, defaults to None
predict(X)[source]

predict on data

Parameters:X (pandas DataFrame) – data to predict
Returns:predictions
Return type:pandas Series
class lib.model.MultiClassifier[source]

Bases: object

Multiclass Classifier Model

dump(path=None)[source]
Parameters:path (string) – path to dump model, defaults to None
fit(X, y)[source]

fit multiclass classifier model

Parameters:
  • X (pandas DataFrame) – Training features
  • y (pandas Series) – Training labels
load(path=None)[source]

laod model from file

Parameters:path (string) – path to dump model, defaults to None
predict(X)[source]

predict on data

Parameters:X (pandas DataFrame) – data to predict
Returns:predictions
Return type:pandas Series

lib.utils module

lib.utils.Evaluate(true_label, predicted_label, predicted_prob, labels)[source]

Plot confusion Matrix and displays accuracy f1 and roc_auc scores

Parameters:
  • true_label (array) – ground truth values
  • predicted_label (array) – predicted values
  • predicted_prob (array) – probability for each predicted class
  • labels (list) – list containing label strings
lib.utils.arg_nearest(array, value)[source]

Find index of nearest value for a given number

Parameters:
  • array (array) – numpy array
  • value (float) – desired value
Returns:

index

Return type:

int

lib.utils.plot_confusion_matrix(cm, labels, suptitle='Confusion Matrix')[source]

_subplot_cm wapper - Plot normalized and not normilized confusion matrix

Parameters:
  • cm (array) – confusion matrix array
  • labels (list) – list containing label strings
  • suptitle (string) – plot title, defaults to Confusion Matrix
lib.utils.plot_precision_recall(y_true, preds_proba)[source]

Plot precision recall curves

Parameters:
  • y_true (array) – ground truth values
  • preds_proba (array) – probability for positive predicted class