EXPLORATORY ANALYSIS OF CLASSIFICATION MODELS WITH SKLEARN AND BANK MARKETING DATA.

Abstract:

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

Source:

Dataset from : http://archive.ics.uci.edu/ml/datasets/Bank+Marketing#

Loading and getting ideas about the data

Many categorical columns. Lets see the unique values of these columns and decide to do one hot encoder or something different.

Since unknown values are a binary columns, taking the mean is not good. So unknown is kept a feature and all categorical columns are transformed with one hot encoding. Now we have 64 features in total.

No categorical values, no NULL values. Changing target variable to numerical.

Correlation analysis

Top 10 higly correlated columns with the target variable.

Duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target

Poutcome: outcome of the previous marketing campaign

Previous: number of contacts performed before this campaign and for this client (numeric)

Contact: contact communication type. Cellular seems to be more effective.

Rest are self explanatory. Some are from the one hot encoded columns. good. Visualzing the full correlation matrix with seaborn

Splitting target and predictor variables

Standardizing the predictor variables

Sample Distribution

The samples are highly varied. AUC can be the final metric, along with other feature engineering techniques.

AUC_ROC function

The ROC(Reciever operator characterstics) curve is a performance measurement for a classification problem at various threshold. A standard accuracy labels an examples as postive if the prediction is >50% confident and negative if it is <50% confident, whereas, ROC-AUC method will consider all the thresholds. It is calculated by plotting the true positive and false positive rates. This method is particulary useful when the data is imbalanced or we need to avoid certain bias in the training process. $$ {TPR} = {TP \over {TP+FN}} $$ $$ {FPR} = {FP \over {TN+FP}} $$ $$ \text{where TP and TN is true positive and true negative whereas FP and FN are fale positive and false negative.} $$

The AUC(Area under the curve) measure the area of under the TPR-FPR plot and it ranges from o to 1. 0 is the worst measure of speratability, wheras 0.5 is no class seperation capacity, and 1 means best class seperation capacity.

Train_test split

Training models

Okay, we will begin with training inital models to understand the baseline perfromance before we can do some feature engineering to make the models better.

KNN

For the first model we will begin with KNN. One of the most common and simples supervised classification technique, usually KNNs are very good for linearly sperable data but it also works with non-linear data aswell. KNN works by by calculating the eucledian distance between k data points and assigns them to a group based on the resultant distance.

Advantages: Robust to noise

Disadvantage: Slow, Finding k value is crucial

Decision Tree classifier.

Decision trees are another supervised learning technique. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

Disadvantage: prone to overfitting as the model is too complex, but can be rectified with Randomforest implementation.

Advantage: There is less requirement of data cleaning compared to other algorithms.

SVM

SVM is one of the most renonwed machine learning model. It has held state of arts on several types of classification problems.

XGB classifier

xgboostis an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework.

Thoughts

Obtained AUC scores range from 0.65 to 0.75 for all the models. Lets see if we can improve, by balancing the distribution by up sampling or downsampling.

Feature Engineering: Up-Sampling with SMOTE

Train Set

Test set

Feature Engineering: Down-Sampling with NearMiss

Train Set

Test Set

BankMarket Class

Now that we have trained few models, did some feature engineering, and understood how the results look, we can now create a single object that can be used to train, test and predict with simple user inputs. This is a rather simple implementation with OOP and can be easily extened with GUI elements later on if needed, but since this is a kaggle challenge it is not implemented.

The object asks for user input of the type of model to be trained. After selecting it will train and test automatically and save the model with the same name as the input string given by the user, which can be later on be used to predict samples by simply inputting the filepath of the model. The object takes X_train, X_test, y_train, y_test for training and testing and any array or dataframe of shape (1,63) for predicting a class.

Original non engineered data

Same results we saw previously.

Train with upsampled data

All models improved by a big margin. Looks like the upsampling helped the performance a lot.

Train with Down-sampled data

Significantly better as well. Although upsampling should be considered as the final model.

Predicting new inputs to get a label class

Can input any new values, will output label.

Great. Apparently the randomly generated 63 values were predicted as positive subscription. Now we have a classifier with AUC of 89 for predicting whether a subject will be a subscriber or not.