TABULAR PLAYGROUND - JULY 2021- EXPLORATORY ANALYSIS OF AIR POLLUTANTS WITH REGRESSION MODELS¶

This notebook is used for learning purposes. The data is partially from real world and partially artificially generated. Data link here.¶

The challenge is to predict the values of air pollution measurements over time, based on basic weather information (temperature and humidity) and the input values of 5 sensors.

Predictors:¶

deg_c = temperature

relative_humidity = humidity

absolute_humidity = normalized humidity

sensors 1 to 5 = sensor data

Targets:¶

target_carbon_monoxide = CO values in air

target_benze = beneze values in air

target_nitrogen_oxides = NO values in air

1. Loading and Understanding Data¶

import pandas as pd  #data manipulation
import numpy as np   #linear algebra
import matplotlib.pyplot as plt   #plotting library
import seaborn as sns       #plotting library

#load data
train = pd.read_csv('../input/tabular-playground-series-jul-2021/train.csv')
test_unseen = pd.read_csv('../input/tabular-playground-series-jul-2021/test.csv')

train.head()

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7111 entries, 0 to 7110
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   date_time               7111 non-null   object 
 1   deg_C                   7111 non-null   float64
 2   relative_humidity       7111 non-null   float64
 3   absolute_humidity       7111 non-null   float64
 4   sensor_1                7111 non-null   float64
 5   sensor_2                7111 non-null   float64
 6   sensor_3                7111 non-null   float64
 7   sensor_4                7111 non-null   float64
 8   sensor_5                7111 non-null   float64
 9   target_carbon_monoxide  7111 non-null   float64
 10  target_benzene          7111 non-null   float64
 11  target_nitrogen_oxides  7111 non-null   float64
dtypes: float64(11), object(1)
memory usage: 666.8+ KB

train[1:].describe() #check the descriptions of the data without the object column

train.isnull().values.any() #check null values

False

Thoughts:¶

The data is relatively clean. Only one object column ie. datetime. Can be engineerd with one hot encoding. No other discrete columns can proceed further.
std for sensor columns are very high. Should be engineered

total = [train,test_useen]  #merging train and test into single dataframe
total = pd.concat(total)
total

Feature engineer the datetime column¶

extracting months, day of the month, hours, day name, time of the day and weekend or weekday check

'''
months and day of the months 
'''

total.date_time = pd.to_datetime(total.date_time)
months = total.date_time.dt.month
monthly_days = total.date_time.dt.day
hours = total.date_time.dt.hour
day_name = total.date_time.dt.day_name()
days = pd.get_dummies(day_name)
days

'''
part of the days, seperating into 5 parts based on daylight and one hot enconding the resultant
categorical values
'''

def daypart(hour):
    if hour in [2,3,4,5]:
        return "dawn"
    elif hour in [6,7,8,9]:
        return "morning"
    elif hour in [10,11,12,13]:
        return "noon"
    elif hour in [14,15,16,17]:
        return "afternoon"
    elif hour in [18,19,20,21]:
        return "evening"
    else: return "midnight"

raw_days = hours.apply(daypart)
dayparts = pd.get_dummies(raw_days)
dayparts = dayparts[['dawn','morning','noon','afternoon','evening','midnight']]
dayparts

'''
check if the day is a weekend or otherwise
'''
is_weekend = day_name.apply(lambda x : 1 if x in ['Saturday','Sunday'] else 0)
is_weekend = pd.DataFrame({'is_weekend':is_weekend})

'''
concat the resultant columns and split the train and test df back based on the original index
'''

final = pd.concat([total, dayparts, days,is_weekend],axis=1)
final = final.drop('date_time',axis=1)
train_eng = final[:7111]
test_eng = final.iloc[7112:,:]

print(f'train :{train_eng.shape}, test: {test_eng.shape}')

train :(7111, 25), test: (2246, 25)

train_eng

test_eng

Okay, now we have 25 columns that we feature engineered. But still we need to look into the distributions of the other predictor variables before training the models.

2. Data Distributions¶

%matplotlib inline

numeric_features = ['deg_C','relative_humidity','absolute_humidity','sensor_1','sensor_2','sensor_3',
                   'sensor_4','sensor_5','target_carbon_monoxide','target_benzene','target_nitrogen_oxides']

'''
Non engineered numerical distribution
'''
 
for col in numeric_features:        #iterating through the numerical features and plotting the histogram, mean and median
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = train_eng[col]
    feature.hist(bins=100, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
plt.show()

The target variables and the sensors data are skewed. The mean and median are skewed to the left. This should be fixed. The distribution can be changed by experimenting with different scaling formats. Three different method were done,

min_max_scaling
log_scaling
Standard_scaling

'''
applying min_max scaling to the predictors and targets
'''
def min_max_scaling(df):
    df_norm = df.copy()
    for column in df_norm.columns:
        df_norm[column] = (df_norm[column] - df_norm[column].min()) / (df_norm[column].max() - df_norm[column].min())
        
    return df_norm

eng_norm = min_max_scaling(train_eng)

eng_norm.describe()

for col in numeric_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = eng_norm[col]
    feature.hist(bins=100, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
plt.show()

'''
applying log scaling to the predictors and targets
'''
def log_scaling(col):
    col = np.log1p(col)
    return col

log_scale = train_eng.copy()
for t in log_scale.columns:
    log_scale[t] = log_scaling(log_scale[t])

log_scale.describe()

for col in numeric_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = log_scale[col]
    feature.hist(bins=100, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
plt.show()

from sklearn.preprocessing import StandardScaler

standard_scale = train_eng.copy()
scaler = StandardScaler()
scaled = pd.DataFrame(scaler.fit_transform(standard_scale), columns = train_eng.columns)

for col in numeric_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = scaled[col]
    feature.hist(bins=100, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
plt.show()

Thoughts:¶

All the distributions are bit on the tail heavy side. Log scaling looks certainly the most plausible to work with as it is closes to normal distribution and the mean and the median are closer to the center.

Note: The distributions are not truly normal in the statistical sense, which would result in a smooth, symmetric "bell-curve" histogram with the mean and mode (the most common value) in the center; but they do generally indicate that most of the observations have a value somewhere near the middle.

We've explored the distribution of the numeric values in the dataset, but what about the categorical features that we created using the one hot encoding? We can plot all of them togther in a single plot with the pandas scatter_matrix method.

from pandas.plotting import scatter_matrix

attributes = ['target_carbon_monoxide','target_nitrogen_oxides',
              'target_benzene','sensor_1','sensor_5','sensor_2','sensor_4', 'sensor_3','deg_C','relative_humidity']

axes = scatter_matrix(log_scale[attributes], figsize=(18,18))
for ax in axes.flatten():
    ax.xaxis.label.set_rotation(90)
    ax.yaxis.label.set_rotation(0)
    ax.yaxis.label.set_ha('right')

plt.tight_layout()
plt.gcf().subplots_adjust(wspace=0, hspace=0)
plt.show()

Thoughts¶

Since we decided to use log scale as final method, after plotting scatter matrix only for that scale, there seems to both linear and non linear relationships between the different features.

The sensors seems to have a linear relationship with their target variables.
The temperature, humidity both have non linear relationships.
The target variables have linear relationship with each other.

Lets get the correlation matrix to see if the linear and non linear relations makes sense with pearson correlation analysis

'''correlation of features with CO target variable'''

corr_matrix = log_scale.corr()
corr_matrix['target_carbon_monoxide'].sort_values(ascending=False)

target_carbon_monoxide    1.000000
sensor_1                  0.870870
target_nitrogen_oxides    0.843404
sensor_5                  0.829337
sensor_2                  0.779981
target_benzene            0.774434
sensor_4                  0.475681
evening                   0.329730
noon                      0.108269
afternoon                 0.100031
Friday                    0.096594
Thursday                  0.084747
Tuesday                   0.062351
Wednesday                 0.052710
morning                   0.049013
deg_C                     0.041945
absolute_humidity        -0.010726
Monday                   -0.027995
relative_humidity        -0.030062
Saturday                 -0.063538
midnight                 -0.081991
Sunday                   -0.206758
is_weekend               -0.209171
dawn                     -0.505413
sensor_3                 -0.707397
Name: target_carbon_monoxide, dtype: float64

'''correlation of features with NO target variable'''

corr_matrix['target_nitrogen_oxides'].sort_values(ascending=False)

target_nitrogen_oxides    1.000000
target_carbon_monoxide    0.843404
sensor_5                  0.778850
sensor_1                  0.711786
sensor_2                  0.652730
target_benzene            0.651093
sensor_4                  0.226679
evening                   0.203325
noon                      0.136201
morning                   0.105208
relative_humidity         0.101113
Friday                    0.077549
afternoon                 0.074427
Thursday                  0.063057
Tuesday                   0.053671
Wednesday                 0.048867
Monday                   -0.021145
Saturday                 -0.045496
midnight                 -0.069243
absolute_humidity        -0.109330
is_weekend               -0.172943
deg_C                    -0.174690
Sunday                   -0.177989
dawn                     -0.450123
sensor_3                 -0.647287
Name: target_nitrogen_oxides, dtype: float64

'''correlation of features with benzene target variable'''

corr_matrix['target_benzene'].sort_values(ascending=False)

target_benzene            1.000000
sensor_2                  0.989710
sensor_5                  0.822993
sensor_4                  0.819966
target_carbon_monoxide    0.774434
sensor_1                  0.743550
target_nitrogen_oxides    0.651093
absolute_humidity         0.341020
evening                   0.242480
deg_C                     0.143921
noon                      0.122983
afternoon                 0.105975
Tuesday                   0.084101
morning                   0.067985
Thursday                  0.054046
Wednesday                 0.031614
Friday                    0.027704
Monday                    0.003599
Saturday                 -0.017511
relative_humidity        -0.053485
midnight                 -0.072257
is_weekend               -0.156263
Sunday                   -0.184428
dawn                     -0.467419
sensor_3                 -0.899791
Name: target_benzene, dtype: float64

'''
plotting the corrleation matrix with the seaborn heatmap method
'''

import matplotlib.pyplot as plt
import seaborn as sns

mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr_matrix, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

<AxesSubplot:>

Thoughts¶

Looks like the sensors 1,2,3,4 and 5 have high correlation with the target vectors. Followed by temperature, humidity and the corresponding target variables. We saw that in the scatter matrix as well since the sensors seemed to have a linear relationship.

Train a Regression Model¶

Now that we've explored the data, it's time to use it to train a regression model that uses the features we've identified as potentially predictive to predict the targets. The first thing we need to do is to seperate the train data into train and validation. Since the it is multivariate analysis and there 3 different target variables and one is dependant on the other we need to use them while training, so train_test_split will not work here. So we will randomly split the dataframe into 80/20.

Beginning with a naive linear regression model.

val = log_scale.sample(frac = 0.2)
train = log_scale.drop(val.index)

print(f'train size: {len(train)}, val size: {len(val)}')

train size: 5689, val size: 1422

LinearRegression()¶

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


'''
Linear regression first. Although we saw some non linear relationship between features, it is a good starting place to
get an idea and quantify some metrics before moving to more complex models'''


model_dict = {'target_nitrogen_oxides':LinearRegression(),  #Dictionary for 3 different targets with linearregression
             'target_benzene': LinearRegression(), 
             'target_carbon_monoxide': LinearRegression()}

for idx_target, target in enumerate(model_dict,1):    #iterate over the dictionary and split the target and predictors
    X_train = train.iloc[:, :-3]           
    y_train = train.iloc[:, -idx_target]
    
    X_test = val.iloc[:, :-3]
    y_test = val.iloc[:, -idx_target]
    
    model_dict[target].fit(X_train, y_train)
    preds = model_dict[target].predict(X_test)
    rmse = mean_squared_error(y_test,preds)
    r2 = r2_score(y_test,preds)
    
    print(f'{target}: rmse: {np.sqrt(rmse).round(2)}, r2:  {r2}')   #results

target_nitrogen_oxides: rmse: 0.0, r2:  1.0
target_benzene: rmse: 0.18, r2:  0.4087527397272016
target_carbon_monoxide: rmse: 0.18, r2:  0.4284852854084843

Decision Tree()¶

from sklearn.tree import DecisionTreeRegressor

'''Decision Trees can learn with non linear data'''

model_dict = {'target_nitrogen_oxides':DecisionTreeRegressor(),
             'target_benzene': DecisionTreeRegressor(),
             'target_carbon_monoxide': DecisionTreeRegressor()}

for idx_target, target in enumerate(model_dict,1):    #iterate over the dictionary and split the target and predictors
    X_train = train.iloc[:, :-3]
    y_train = train.iloc[:, -idx_target]
    
    X_test = val.iloc[:, :-3]
    y_test = val.iloc[:, -idx_target]
    
    model_dict[target].fit(X_train, y_train)
    preds = model_dict[target].predict(X_test)
    rmse = mean_squared_error(y_test,preds)
    r2 = r2_score(y_test,preds)
    
    print(f'{target}: rmse: {np.sqrt(rmse).round(2)}, r2:  {r2}')

target_nitrogen_oxides: rmse: 0.0, r2:  1.0
target_benzene: rmse: 0.24, r2:  0.00290895553331183
target_carbon_monoxide: rmse: 0.24, r2:  0.05350032825995843

RandomForestRegressor}()¶

from sklearn.ensemble import RandomForestRegressor

'''An even better tree model'''


model_dict = {'target_nitrogen_oxides':RandomForestRegressor(),
             'target_benzene': RandomForestRegressor(),
             'target_carbon_monoxide': RandomForestRegressor()}

for idx_target, target in enumerate(model_dict,1):       #iterate over the dictionary and split the target and predictors
    X_train = train.iloc[:, :-3] 
    y_train = train.iloc[:, -idx_target]
    
    X_test = val.iloc[:, :-3]
    y_test = val.iloc[:, -idx_target]
    
    model_dict[target].fit(X_train, y_train)
    preds = model_dict[target].predict(X_test)
    rmse = mean_squared_error(y_test,preds)
    r2 = r2_score(y_test,preds)
    
    print(f'{target}: rmse: {np.sqrt(rmse).round(2)}, r2:  {r2}')

target_nitrogen_oxides: rmse: 0.0, r2:  1.0
target_benzene: rmse: 0.17, r2:  0.4818565637285873
target_carbon_monoxide: rmse: 0.17, r2:  0.49018064876404266

So now we've quantified the ability of our model to predict the number of pollutants. But target_nitrogen_oxides having 1.0 is too good to be true and all things points to overfitting. It definitely has some predictive power, but we can probably do better!

k-Fold implementation and more capable model- CatBoostRegressor¶

Until now we had an arbitrary split for the training and validation. While random, the prediction capabilities could widely vary for different sample. Instead of relying on one such sample, lets implement kfold validation. CatBoost is a high-performance open source library for gradient boosting on decision trees and could be slightly or even much better than sklearn implementations.

from sklearn.model_selection import TimeSeriesSplit
from catboost import CatBoostRegressor

'''CatBoostRegressor serves as the state of the art for many regression problems'''



SEED = 42     #random seed

model_dict = {'target_nitrogen_oxides':CatBoostRegressor(random_state=SEED,    #model dictionary
                                learning_rate=0.05,
                                depth=8,
                               verbose=False),
             'target_benzene': CatBoostRegressor(random_state=SEED,
                                learning_rate=0.05,
                                depth=8,
                               verbose=False),
             'target_carbon_monoxide': CatBoostRegressor(random_state=SEED,
                                learning_rate=0.05,
                                depth=8,
                               verbose=False)}

splits = 5
kfold = TimeSeriesSplit(n_splits=splits)

for idx_fold, (train_idx,val_idx) in enumerate(kfold.split(train)):   #iterate over the folds and create a random train and val variables
    k_train, k_val = scaled.iloc[train_idx], scaled.iloc[val_idx]
    
    for idx_target, target in enumerate(model_dict,1):       #iterate over model dictionary
        x_train = k_train.iloc[:,:-3]
        x_val = k_val.iloc[:,:-3]
        
        y_train = k_train.iloc[:,-idx_target].astype(float)
        y_val = k_val.iloc[:,-idx_target].astype(float)
        
        
        model_dict[target].fit(x_train,y_train)      #train
        preds = model_dict[target].predict(x_val)    #predict
         
        r2 = r2_score(y_val,preds)                    #metrics
        rmse = mean_squared_error(y_val, np.expm1(preds)) ** (1/2)
        
        print(f'fold:{idx_fold} ; {target}: rmse: {rmse} and r2: {r2}')
    print('\n')

fold:0 ; target_nitrogen_oxides: rmse: 1.0529544665290886 and r2: 0.9974132121018654
fold:0 ; target_benzene: rmse: 1.201197100617081 and r2: 0.18798240533290678
fold:0 ; target_carbon_monoxide: rmse: 1.0788616944930722 and r2: 0.29493642598948155


fold:1 ; target_nitrogen_oxides: rmse: 1.0355463399824767 and r2: 0.9919394065562662
fold:1 ; target_benzene: rmse: 1.2723163749640798 and r2: 0.017191386726805313
fold:1 ; target_carbon_monoxide: rmse: 1.7465031271362252 and r2: 0.10507687095896012


fold:2 ; target_nitrogen_oxides: rmse: 1.190591651537904 and r2: 0.9991069760430005
fold:2 ; target_benzene: rmse: 1.432105456069224 and r2: 0.3091265990764992
fold:2 ; target_carbon_monoxide: rmse: 1.074027839347575 and r2: 0.26791220420906503


fold:3 ; target_nitrogen_oxides: rmse: 1.0858058566054205 and r2: 0.9981514826829635
fold:3 ; target_benzene: rmse: 0.9404438796719523 and r2: 0.33533835959902725
fold:3 ; target_carbon_monoxide: rmse: 1.2028703149344242 and r2: 0.30973147996458394


fold:4 ; target_nitrogen_oxides: rmse: 1.203806865889381 and r2: 0.9991057498990993
fold:4 ; target_benzene: rmse: 1.08095657271 and r2: 0.3156448589920443
fold:4 ; target_carbon_monoxide: rmse: 0.8902000126305795 and r2: 0.388304039571714

The results look quite similar to linear regression with slight improvements in certain folds, tut the r2 for target_nitrogen_oxide still seems to be very high, whereas the other two is modest. We can do some fine tuning by experimenting with the hyperparameters of the catboostregressor().

Quite decent for a minimal feature engineering given that the top sumbission for the kaggle challenge was ~0.22 when this was made and the test results were unknown. Let's try to streamline the process by using high level libraries like pycaret or AutoML.

Pycaret is a great tool for machine learning and data anaylsis. It's pretty cool how high level libraries for machine learning have matured in the recent years. We can create complex regression model with just few lines of code.

FineTuning the process with Pycaret¶

We need to train and evaluate for each indiviudal targets. First, setting up the model with the predictor features and only target variable as carbon monoxide.

from pycaret.regression import setup, compare_models, blend_models, finalize_model,tune_model, predict_model, plot_model


setup_target_carbon_monoxide = setup(data = log_scale, target = 'target_carbon_monoxide', session_id=24,
                  
                  # Ignore features target_benzene and target_nitrogen_oxides from the experiment 
                  ignore_features = ['target_benzene','target_nitrogen_oxides'], 
                  silent = True                   
)

#Train all the regression models available in the library and choose the best ones.
best_target_carbon_monoxide = compare_models(sort = 'RMSLE', n_select = 3)

Catboost regressor seems to be the best one in pycaret implementation as well.

#Instead of choosing one, lets ensemble top 10 models

model_target_carbon_monoxide = blend_models(best_target_carbon_monoxide)

#Hyperparameter fine tuning

Tuned_model_target_carbon_monoxide  = tune_model(model_target_carbon_monoxide )

#Plotting the final ensembled models residuals

plot_model(Tuned_model_target_carbon_monoxide)

#plotting the fitted line for the prediction

predict_model(Tuned_model_target_carbon_monoxide)
plot_model(Tuned_model_target_carbon_monoxide, plot='error')

It is as simple to train a ML model. A r^2 of 0.925 is certainly much better than the best catboost model of 0.4 from sklearns implementation.

Evaluating on the unseen data.¶

Predicting on the unseen test data. Needs some manipulation before getting the labels.

#applying log scaling to test data and removing the target variable columns before predicting with the trained model

test_eng = test_eng.drop(['target_carbon_monoxide','target_benzene','target_nitrogen_oxides'], axis=1)
log_scale_test = test_eng.copy()

for t in log_scale_test.columns:
    log_scale_test[t] = log_scaling(log_scale_test[t])

test_preds = predict_model(Tuned_model_target_carbon_monoxide,log_scale_test)
test_preds.head()

Transformation Pipeline and Model Successfully Loaded

Alright, it works. Instead of repeating the 5 cells above for every target variable, we will take this oppurtunity to create a single object with pycaret where users can simply input the data and the AirPollutionPrediction() object would train and return the best model. We will also include a predictor function within the object to make it easier for future use.

from pycaret.regression import *


class AirPollutionPrediction():
    def __init__(self,train,test):   #initialize input dataframes
        self.train = train
        self.test = test
        

    
    def pycaret_train(self,train,target,ignore_features):   #training method.
        
        setup(data=train,target=target, session_id=24,
                  ignore_features = ignore_features, 
                  silent = True)
        
        best = compare_models(sort = 'RMSLE', n_select = 3)    #trains several models and returns the best 10 models
        blended = blend_models(estimator_list= best)           #ensembles the best models.
        tuned  = tune_model(blended )                          #hyperparameter fine tunining
        pred_holdout = predict_model(blended)                 #holdout validation
        final_model = finalize_model(blended)
        
        
        name = f'airpollutant_{target}'
        save_model(final_model, name)                        #save model with name airpollutant_{labeltype}
        return final_model

    def predict(self,test,model):                            #calls model and predict on unseen test data and plot the same
        pred_esb = predict_model(model, test)
        re = pred_esb['Label']
        
        
        
        return re
        
 
    def run(self,train,target,test):                       #run method
        result  = pd.DataFrame()
        targets  = ['target_nitrogen_oxides','target_carbon_monoxide','target_benzene']
        
        for target in targets:                            #call the train and test functions for dfifferent target variables.
            
            if target == 'target_nitrogen_oxides':
                ignores = [targets[1],targets[2]]
                model = self.pycaret_train(train,target,ignores)
                result['target_nitrogen_oxides'] = self.predict(test,model)
            elif target == 'target_carbon_monoxide':
                ignores = [targets[0],targets[2]]
                model = self.pycaret_train(train,target,ignores)           
                result['target_carbon_monoxide'] = self.predict(test,model)
            elif target == 'target_benzene':
                ignores = [targets[0],targets[1]]
                model = self.pycaret_train(train,target,ignores)
                result['benzene'] = self.predict(test,model)
       
        return result
      
    def main(self):                                      #main function. Takes input to begin training with y and pass with n
        
        print('Train the model: y or n?')
        target = input()
        while target not in ('y','n'):
            print('Enter Valid choice')
            target = input()
            print('Train the model: y or n?')
            
                           
        if target == 'y':
            predictions = self.run(self.train,target,self.test)
            predictions.to_csv('results.csv')                  #save test predictions to csv
        else: 
            pass
            
    if __name__ == 'main':
        main()

pollution = AirPollutionPrediction(train = log_scale, test= log_scale_test)
pollution.main()

Transformation Pipeline and Model Successfully Saved

Loading the test predictions¶

test_results = pd.read_csv('./results.csv', index_col=[0])
final = [log_scale_test, test_results]
final = pd.concat(final,axis=1, join='inner')

final.head()

Result thoughts¶

Okay,Now we can train and call the best model to predict input values easily. The test set metrics could not be quantified because the kaggle challenge was ongoing. I will try to update the quantification if or when the results are released. Thanks for your time.

	date_time	deg_C	relative_humidity	absolute_humidity	sensor_1	sensor_2	sensor_3	sensor_4	sensor_5	target_carbon_monoxide	target_benzene	target_nitrogen_oxides
0	2010-03-10 18:00:00	13.1	46.0	0.7578	1387.2	1087.8	1056.0	1742.8	1293.4	2.5	12.0	167.7
1	2010-03-10 19:00:00	13.2	45.3	0.7255	1279.1	888.2	1197.5	1449.9	1010.9	2.1	9.9	98.9
2	2010-03-10 20:00:00	12.6	56.2	0.7502	1331.9	929.6	1060.2	1586.1	1117.0	2.2	9.2	127.1
3	2010-03-10 21:00:00	11.0	62.4	0.7867	1321.0	929.0	1102.9	1536.5	1263.2	2.2	9.7	177.2
4	2010-03-10 22:00:00	11.9	59.0	0.7888	1272.0	852.7	1180.9	1415.5	1132.2	1.5	6.4	121.8

	deg_C	relative_humidity	absolute_humidity	sensor_1	sensor_2	sensor_3	sensor_4	sensor_5	target_carbon_monoxide	target_benzene	target_nitrogen_oxides
count	7110.000000	7110.000000	7110.000000	7110.000000	7110.000000	7110.00000	7110.000000	7110.000000	7110.000000	7110.000000	7110.000000
mean	20.879128	47.561224	1.110358	1091.530520	938.043910	883.87910	1513.206062	998.294065	2.086160	10.236835	204.071899
std	7.937939	17.399945	0.398956	218.524793	281.993227	310.47148	350.194353	381.548478	1.447203	7.694938	193.940883
min	1.300000	8.900000	0.198800	620.300000	364.000000	310.60000	552.900000	242.700000	0.100000	0.100000	1.900000
25%	14.900000	33.700000	0.855925	930.225000	734.800000	681.02500	1320.275000	722.825000	1.000000	4.500000	76.425000
50%	20.700000	47.300000	1.083550	1060.450000	914.200000	827.80000	1513.100000	928.650000	1.700000	8.500000	141.000000
75%	25.800000	60.800000	1.404175	1215.800000	1124.100000	1008.62500	1720.350000	1224.600000	2.800000	14.200000	260.000000
max	46.100000	90.800000	2.231000	2088.300000	2302.600000	2567.40000	2913.800000	2594.600000	12.500000	63.700000	1472.300000

	date_time	deg_C	relative_humidity	absolute_humidity	sensor_1	sensor_2	sensor_3	sensor_4	sensor_5	target_carbon_monoxide	target_benzene	target_nitrogen_oxides
0	2010-03-10 18:00:00	13.1	46.0	0.7578	1387.2	1087.8	1056.0	1742.8	1293.4	2.5	12.0	167.7
1	2010-03-10 19:00:00	13.2	45.3	0.7255	1279.1	888.2	1197.5	1449.9	1010.9	2.1	9.9	98.9
2	2010-03-10 20:00:00	12.6	56.2	0.7502	1331.9	929.6	1060.2	1586.1	1117.0	2.2	9.2	127.1
3	2010-03-10 21:00:00	11.0	62.4	0.7867	1321.0	929.0	1102.9	1536.5	1263.2	2.2	9.7	177.2
4	2010-03-10 22:00:00	11.9	59.0	0.7888	1272.0	852.7	1180.9	1415.5	1132.2	1.5	6.4	121.8
...	...	...	...	...	...	...	...	...	...	...	...	...
2242	2011-04-04 10:00:00	23.2	28.7	0.7568	1340.3	1023.9	522.8	1374.0	1659.8	NaN	NaN	NaN
2243	2011-04-04 11:00:00	24.5	22.5	0.7119	1232.8	955.1	616.1	1226.1	1269.0	NaN	NaN	NaN
2244	2011-04-04 12:00:00	26.6	19.0	0.6406	1187.7	1052.4	572.8	1253.4	1081.1	NaN	NaN	NaN
2245	2011-04-04 13:00:00	29.1	12.7	0.5139	1053.2	1009.0	702.0	1009.8	808.5	NaN	NaN	NaN
2246	2011-04-04 14:00:00	27.9	13.5	0.5028	1124.6	1078.4	608.2	1061.3	816.0	NaN	NaN	NaN

	Friday	Monday	Saturday	Sunday	Thursday	Tuesday	Wednesday
0	0	0	0	0	0	0	1
1	0	0	0	0	0	0	1
2	0	0	0	0	0	0	1
3	0	0	0	0	0	0	1
4	0	0	0	0	0	0	1
...	...	...	...	...	...	...	...
2242	0	1	0	0	0	0	0
2243	0	1	0	0	0	0	0
2244	0	1	0	0	0	0	0
2245	0	1	0	0	0	0	0
2246	0	1	0	0	0	0	0

	dawn	morning	noon	afternoon	evening	midnight
0	0	0	0	0	1	0
1	0	0	0	0	1	0
2	0	0	0	0	1	0
3	0	0	0	0	1	0
4	0	0	0	0	0	1
...	...	...	...	...	...	...
2242	0	0	1	0	0	0
2243	0	0	1	0	0	0
2244	0	0	1	0	0	0
2245	0	0	1	0	0	0
2246	0	0	0	1	0	0

	deg_C	relative_humidity	absolute_humidity	sensor_1	sensor_2	sensor_3	sensor_4	sensor_5	target_carbon_monoxide	target_benzene	...	evening	midnight	Friday	Monday	Saturday	Sunday	Thursday	Tuesday	Wednesday	is_weekend
1	5.1	51.7	0.4564	1249.5	864.9	687.9	972.8	1714.0	NaN	NaN	...	0	1	0	0	1	0	0	0	0	1
2	5.8	51.5	0.4689	1102.6	878.0	693.7	941.9	1300.8	NaN	NaN	...	0	0	0	0	1	0	0	0	0	1
3	5.0	52.3	0.4693	1139.7	916.2	725.6	1011.0	1283.0	NaN	NaN	...	0	0	0	0	1	0	0	0	0	1
4	4.5	57.5	0.4650	1022.4	838.5	871.5	967.0	1142.3	NaN	NaN	...	0	0	0	0	1	0	0	0	0	1
5	4.5	53.7	0.4759	1004.0	745.5	914.2	989.1	973.8	NaN	NaN	...	0	0	0	0	1	0	0	0	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2242	23.2	28.7	0.7568	1340.3	1023.9	522.8	1374.0	1659.8	NaN	NaN	...	0	0	0	1	0	0	0	0	0	0
2243	24.5	22.5	0.7119	1232.8	955.1	616.1	1226.1	1269.0	NaN	NaN	...	0	0	0	1	0	0	0	0	0	0
2244	26.6	19.0	0.6406	1187.7	1052.4	572.8	1253.4	1081.1	NaN	NaN	...	0	0	0	1	0	0	0	0	0	0
2245	29.1	12.7	0.5139	1053.2	1009.0	702.0	1009.8	808.5	NaN	NaN	...	0	0	0	1	0	0	0	0	0	0
2246	27.9	13.5	0.5028	1124.6	1078.4	608.2	1061.3	816.0	NaN	NaN	...	0	0	0	1	0	0	0	0	0	0

	deg_C	relative_humidity	absolute_humidity	sensor_1	sensor_2	sensor_3	sensor_4	sensor_5	target_carbon_monoxide	target_benzene	...	evening	midnight	Friday	Monday	Saturday	Sunday	Thursday	Tuesday	Wednesday	is_weekend
count	7111.000000	7111.000000	7111.000000	7111.000000	7111.000000	7111.000000	7111.000000	7111.000000	7111.000000	7111.000000	...	7111.000000	7111.000000	7111.000000	7111.000000	7111.000000	7111.000000	7111.000000	7111.000000	7111.000000	7111.000000
mean	0.437010	0.472051	0.448533	0.321030	0.296123	0.254034	0.406768	0.321287	0.160179	0.159388	...	0.167065	0.166924	0.145127	0.141752	0.141893	0.141752	0.145127	0.141752	0.142596	0.283645
std	0.177186	0.212439	0.196314	0.148868	0.145455	0.137565	0.148325	0.162225	0.116702	0.120982	...	0.373060	0.372935	0.352254	0.348820	0.348965	0.348820	0.352254	0.348820	0.349685	0.450798
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.303571	0.302808	0.323344	0.211138	0.191324	0.164148	0.325067	0.204154	0.072581	0.069182	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.433036	0.468864	0.435341	0.299864	0.283813	0.229174	0.406709	0.291679	0.129032	0.132075	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	0.546875	0.633700	0.593126	0.405654	0.392087	0.309398	0.494515	0.417535	0.217742	0.221698	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
max	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	...	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

	Description	Value
0	session_id	24
1	Target	target_carbon_monoxide
2	Original Data	(7111, 25)
3	Missing Values	False
4	Numeric Features	21
5	Categorical Features	1
6	Ordinal Features	False
7	High Cardinality Features	False
8	High Cardinality Method	None
9	Transformed Train Set	(4977, 22)
10	Transformed Test Set	(2134, 22)
11	Shuffle Train-Test	True
12	Stratify Train-Test	False
13	Fold Generator	KFold
14	Fold Number	10
15	CPU Jobs	-1
16	Use GPU	False
17	Log Experiment	False
18	Experiment Name	reg-default-name
19	USI	22e2
20	Imputation Type	simple
21	Iterative Imputation Iteration	None
22	Numeric Imputer	mean
23	Iterative Imputation Numeric Model	None
24	Categorical Imputer	constant
25	Iterative Imputation Categorical Model	None
26	Unknown Categoricals Handling	least_frequent
27	Normalize	False
28	Normalize Method	None
29	Transformation	False
30	Transformation Method	None
31	PCA	False
32	PCA Method	None
33	PCA Components	None
34	Ignore Low Variance	False
35	Combine Rare Levels	False
36	Rare Level Threshold	None
37	Numeric Binning	False
38	Remove Outliers	False
39	Outliers Threshold	None
40	Remove Multicollinearity	False
41	Multicollinearity Threshold	None
42	Remove Perfect Collinearity	True
43	Clustering	False
44	Clustering Iteration	None
45	Polynomial Features	False
46	Polynomial Degree	None
47	Trignometry Features	False
48	Polynomial Threshold	None
49	Group Features	False
50	Feature Selection	False
51	Feature Selection Method	classic
52	Features Selection Threshold	None
53	Feature Interaction	False
54	Feature Ratio	False
55	Interaction Threshold	None
56	Transform Target	False
57	Transform Target Method	box-cox

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
catboost	CatBoost Regressor	0.0788	0.0121	0.1097	0.9350	0.0565	0.1001	4.3950
lightgbm	Light Gradient Boosting Machine	0.0821	0.0134	0.1155	0.9278	0.0589	0.1039	0.2510
xgboost	Extreme Gradient Boosting	0.0845	0.0138	0.1174	0.9256	0.0600	0.1061	5.8570
et	Extra Trees Regressor	0.0830	0.0139	0.1175	0.9253	0.0603	0.1055	1.5260
rf	Random Forest Regressor	0.0865	0.0150	0.1221	0.9193	0.0621	0.1093	2.3780
gbr	Gradient Boosting Regressor	0.0876	0.0153	0.1237	0.9173	0.0629	0.1121	0.9080
knn	K Neighbors Regressor	0.1090	0.0223	0.1492	0.8795	0.0746	0.1338	0.0700
ridge	Ridge Regression	0.1096	0.0235	0.1531	0.8731	0.0777	0.1402	0.0150
lr	Linear Regression	0.1097	0.0235	0.1531	0.8731	0.0778	0.1403	0.3810
br	Bayesian Ridge	0.1097	0.0235	0.1531	0.8731	0.0778	0.1403	0.0180
dt	Decision Tree Regressor	0.1217	0.0292	0.1709	0.8423	0.0873	0.1473	0.0550
huber	Huber Regressor	0.1154	0.0302	0.1736	0.8371	0.0880	0.1455	0.1160
par	Passive Aggressive Regressor	0.1307	0.0313	0.1766	0.8312	0.0893	0.1610	0.0230
ada	AdaBoost Regressor	0.1345	0.0301	0.1734	0.8377	0.0910	0.1856	0.3660
omp	Orthogonal Matching Pursuit	0.1425	0.0392	0.1978	0.7881	0.1000	0.1773	0.0150
lar	Least Angle Regression	0.1848	0.1131	0.2427	0.4041	0.1111	0.2294	0.0190
en	Elastic Net	0.3530	0.1865	0.4317	-0.0044	0.2158	0.4897	0.0270
llar	Lasso Least Angle Regression	0.3530	0.1865	0.4317	-0.0044	0.2158	0.4897	0.0150
lasso	Lasso Regression	0.3530	0.1865	0.4317	-0.0044	0.2158	0.4897	0.0160

	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	0.0776	0.0110	0.1050	0.9375	0.0515	0.0941
1	0.0762	0.0109	0.1043	0.9397	0.0518	0.0874
2	0.0810	0.0132	0.1149	0.9278	0.0589	0.1025
3	0.0812	0.0135	0.1163	0.9335	0.0563	0.0929
4	0.0804	0.0143	0.1195	0.9265	0.0651	0.1267
5	0.0803	0.0120	0.1097	0.9379	0.0570	0.1040
6	0.0753	0.0110	0.1047	0.9387	0.0556	0.0987
7	0.0829	0.0131	0.1145	0.9236	0.0609	0.1114
8	0.0803	0.0121	0.1102	0.9340	0.0562	0.0976
9	0.0741	0.0111	0.1052	0.9421	0.0528	0.0857
Mean	0.0789	0.0122	0.1104	0.9341	0.0566	0.1001
SD	0.0028	0.0012	0.0053	0.0059	0.0040	0.0115

	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	0.0768	0.0108	0.1040	0.9387	0.0512	0.0934
1	0.0760	0.0109	0.1044	0.9395	0.0520	0.0874
2	0.0806	0.0129	0.1137	0.9293	0.0583	0.1018
3	0.0804	0.0132	0.1150	0.9351	0.0559	0.0923
4	0.0799	0.0140	0.1184	0.9279	0.0646	0.1256
5	0.0798	0.0118	0.1088	0.9389	0.0566	0.1033
6	0.0752	0.0109	0.1042	0.9393	0.0555	0.0988
7	0.0824	0.0129	0.1138	0.9246	0.0605	0.1109
8	0.0796	0.0120	0.1097	0.9346	0.0560	0.0968
9	0.0734	0.0108	0.1040	0.9433	0.0525	0.0851
Mean	0.0784	0.0120	0.1096	0.9351	0.0563	0.0995
SD	0.0027	0.0011	0.0051	0.0057	0.0039	0.0113

	deg_C	relative_humidity	absolute_humidity	sensor_1	sensor_2	sensor_3	sensor_4	sensor_5	dawn	...	midnight	Saturday	is_weekend	Label
1	1.808289	3.964615	0.375968	7.131299	6.763769	6.535096	6.881206	7.447168	0.000000	...	0.693359	0.693359	0.693147	5.740173
2	1.916923	3.960813	0.384514	7.006333	6.778785	6.543480	6.848960	7.171503	0.693359	...	0.000000	0.693359	0.693147	5.714057
3	1.791759	3.975936	0.384786	7.039397	6.821326	6.588376	6.919684	7.157735	0.693359	...	0.000000	0.693359	0.693147	5.752083
4	1.704748	4.069027	0.381855	6.930886	6.732806	6.771363	6.875232	7.041674	0.693359	...	0.000000	0.693359	0.693147	5.438167
5	1.704748	4.001864	0.389268	6.912743	6.615396	6.819143	6.897806	6.882232	0.693359	...	0.000000	0.693359	0.693147	5.260027

	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	0.0696	0.0075	0.0866	0.9875	0.0299	0.0407
1	0.0714	0.0079	0.0889	0.9851	0.0303	0.0380
2	0.0682	0.0071	0.0841	0.9877	0.0296	0.0401
3	0.0695	0.0074	0.0862	0.9875	0.0295	0.0385
4	0.0662	0.0067	0.0821	0.9880	0.0277	0.0362
5	0.0705	0.0076	0.0869	0.9877	0.0302	0.0417
6	0.0732	0.0082	0.0904	0.9856	0.0319	0.0437
7	0.0678	0.0074	0.0858	0.9871	0.0289	0.0368
8	0.0651	0.0066	0.0815	0.9885	0.0273	0.0359
9	0.0686	0.0074	0.0860	0.9883	0.0297	0.0401
Mean	0.0690	0.0074	0.0859	0.9873	0.0295	0.0392
SD	0.0023	0.0004	0.0026	0.0011	0.0013	0.0024