TABULAR PLAYGROUND - JULY 2021- EXPLORATORY ANALYSIS OF AIR POLLUTANTS WITH REGRESSION MODELS

The challenge is to predict the values of air pollution measurements over time, based on basic weather information (temperature and humidity) and the input values of 5 sensors.

Predictors:

deg_c = temperature

relative_humidity = humidity

absolute_humidity = normalized humidity

sensors 1 to 5 = sensor data

Targets:

target_carbon_monoxide = CO values in air

target_benze = beneze values in air

target_nitrogen_oxides = NO values in air

1. Loading and Understanding Data

In [49]:
import pandas as pd  #data manipulation
import numpy as np   #linear algebra
import matplotlib.pyplot as plt   #plotting library
import seaborn as sns       #plotting library

#load data
train = pd.read_csv('../input/tabular-playground-series-jul-2021/train.csv')
test_unseen = pd.read_csv('../input/tabular-playground-series-jul-2021/test.csv')
In [50]:
train.head()
Out[50]:
date_time deg_C relative_humidity absolute_humidity sensor_1 sensor_2 sensor_3 sensor_4 sensor_5 target_carbon_monoxide target_benzene target_nitrogen_oxides
0 2010-03-10 18:00:00 13.1 46.0 0.7578 1387.2 1087.8 1056.0 1742.8 1293.4 2.5 12.0 167.7
1 2010-03-10 19:00:00 13.2 45.3 0.7255 1279.1 888.2 1197.5 1449.9 1010.9 2.1 9.9 98.9
2 2010-03-10 20:00:00 12.6 56.2 0.7502 1331.9 929.6 1060.2 1586.1 1117.0 2.2 9.2 127.1
3 2010-03-10 21:00:00 11.0 62.4 0.7867 1321.0 929.0 1102.9 1536.5 1263.2 2.2 9.7 177.2
4 2010-03-10 22:00:00 11.9 59.0 0.7888 1272.0 852.7 1180.9 1415.5 1132.2 1.5 6.4 121.8
In [51]:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7111 entries, 0 to 7110
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   date_time               7111 non-null   object 
 1   deg_C                   7111 non-null   float64
 2   relative_humidity       7111 non-null   float64
 3   absolute_humidity       7111 non-null   float64
 4   sensor_1                7111 non-null   float64
 5   sensor_2                7111 non-null   float64
 6   sensor_3                7111 non-null   float64
 7   sensor_4                7111 non-null   float64
 8   sensor_5                7111 non-null   float64
 9   target_carbon_monoxide  7111 non-null   float64
 10  target_benzene          7111 non-null   float64
 11  target_nitrogen_oxides  7111 non-null   float64
dtypes: float64(11), object(1)
memory usage: 666.8+ KB
In [52]:
train[1:].describe() #check the descriptions of the data without the object column
Out[52]:
deg_C relative_humidity absolute_humidity sensor_1 sensor_2 sensor_3 sensor_4 sensor_5 target_carbon_monoxide target_benzene target_nitrogen_oxides
count 7110.000000 7110.000000 7110.000000 7110.000000 7110.000000 7110.00000 7110.000000 7110.000000 7110.000000 7110.000000 7110.000000
mean 20.879128 47.561224 1.110358 1091.530520 938.043910 883.87910 1513.206062 998.294065 2.086160 10.236835 204.071899
std 7.937939 17.399945 0.398956 218.524793 281.993227 310.47148 350.194353 381.548478 1.447203 7.694938 193.940883
min 1.300000 8.900000 0.198800 620.300000 364.000000 310.60000 552.900000 242.700000 0.100000 0.100000 1.900000
25% 14.900000 33.700000 0.855925 930.225000 734.800000 681.02500 1320.275000 722.825000 1.000000 4.500000 76.425000
50% 20.700000 47.300000 1.083550 1060.450000 914.200000 827.80000 1513.100000 928.650000 1.700000 8.500000 141.000000
75% 25.800000 60.800000 1.404175 1215.800000 1124.100000 1008.62500 1720.350000 1224.600000 2.800000 14.200000 260.000000
max 46.100000 90.800000 2.231000 2088.300000 2302.600000 2567.40000 2913.800000 2594.600000 12.500000 63.700000 1472.300000
In [53]:
train.isnull().values.any() #check null values
Out[53]:
False

Thoughts:

  1. The data is relatively clean. Only one object column ie. datetime. Can be engineerd with one hot encoding. No other discrete columns can proceed further.

  2. std for sensor columns are very high. Should be engineered

In [54]:
total = [train,test_useen]  #merging train and test into single dataframe
total = pd.concat(total)
total
Out[54]:
date_time deg_C relative_humidity absolute_humidity sensor_1 sensor_2 sensor_3 sensor_4 sensor_5 target_carbon_monoxide target_benzene target_nitrogen_oxides
0 2010-03-10 18:00:00 13.1 46.0 0.7578 1387.2 1087.8 1056.0 1742.8 1293.4 2.5 12.0 167.7
1 2010-03-10 19:00:00 13.2 45.3 0.7255 1279.1 888.2 1197.5 1449.9 1010.9 2.1 9.9 98.9
2 2010-03-10 20:00:00 12.6 56.2 0.7502 1331.9 929.6 1060.2 1586.1 1117.0 2.2 9.2 127.1
3 2010-03-10 21:00:00 11.0 62.4 0.7867 1321.0 929.0 1102.9 1536.5 1263.2 2.2 9.7 177.2
4 2010-03-10 22:00:00 11.9 59.0 0.7888 1272.0 852.7 1180.9 1415.5 1132.2 1.5 6.4 121.8
... ... ... ... ... ... ... ... ... ... ... ... ...
2242 2011-04-04 10:00:00 23.2 28.7 0.7568 1340.3 1023.9 522.8 1374.0 1659.8 NaN NaN NaN
2243 2011-04-04 11:00:00 24.5 22.5 0.7119 1232.8 955.1 616.1 1226.1 1269.0 NaN NaN NaN
2244 2011-04-04 12:00:00 26.6 19.0 0.6406 1187.7 1052.4 572.8 1253.4 1081.1 NaN NaN NaN
2245 2011-04-04 13:00:00 29.1 12.7 0.5139 1053.2 1009.0 702.0 1009.8 808.5 NaN NaN NaN
2246 2011-04-04 14:00:00 27.9 13.5 0.5028 1124.6 1078.4 608.2 1061.3 816.0 NaN NaN NaN

9358 rows × 12 columns

Feature engineer the datetime column

extracting months, day of the month, hours, day name, time of the day and weekend or weekday check

In [55]:
'''
months and day of the months 
'''

total.date_time = pd.to_datetime(total.date_time)
months = total.date_time.dt.month
monthly_days = total.date_time.dt.day
hours = total.date_time.dt.hour
day_name = total.date_time.dt.day_name()
days = pd.get_dummies(day_name)
days
Out[55]:
Friday Monday Saturday Sunday Thursday Tuesday Wednesday
0 0 0 0 0 0 0 1
1 0 0 0 0 0 0 1
2 0 0 0 0 0 0 1
3 0 0 0 0 0 0 1
4 0 0 0 0 0 0 1
... ... ... ... ... ... ... ...
2242 0 1 0 0 0 0 0
2243 0 1 0 0 0 0 0
2244 0 1 0 0 0 0 0
2245 0 1 0 0 0 0 0
2246 0 1 0 0 0 0 0

9358 rows × 7 columns

In [56]:
'''
part of the days, seperating into 5 parts based on daylight and one hot enconding the resultant
categorical values
'''

def daypart(hour):
    if hour in [2,3,4,5]:
        return "dawn"
    elif hour in [6,7,8,9]:
        return "morning"
    elif hour in [10,11,12,13]:
        return "noon"
    elif hour in [14,15,16,17]:
        return "afternoon"
    elif hour in [18,19,20,21]:
        return "evening"
    else: return "midnight"

raw_days = hours.apply(daypart)
dayparts = pd.get_dummies(raw_days)
dayparts = dayparts[['dawn','morning','noon','afternoon','evening','midnight']]
dayparts
Out[56]:
dawn morning noon afternoon evening midnight
0 0 0 0 0 1 0
1 0 0 0 0 1 0
2 0 0 0 0 1 0
3 0 0 0 0 1 0
4 0 0 0 0 0 1
... ... ... ... ... ... ...
2242 0 0 1 0 0 0
2243 0 0 1 0 0 0
2244 0 0 1 0 0 0
2245 0 0 1 0 0 0
2246 0 0 0 1 0 0

9358 rows × 6 columns

In [57]:
'''
check if the day is a weekend or otherwise
'''
is_weekend = day_name.apply(lambda x : 1 if x in ['Saturday','Sunday'] else 0)
is_weekend = pd.DataFrame({'is_weekend':is_weekend})
In [104]:
'''
concat the resultant columns and split the train and test df back based on the original index
'''

final = pd.concat([total, dayparts, days,is_weekend],axis=1)
final = final.drop('date_time',axis=1)
train_eng = final[:7111]
test_eng = final.iloc[7112:,:]

print(f'train :{train_eng.shape}, test: {test_eng.shape}')
train :(7111, 25), test: (2246, 25)
In [59]:
train_eng
Out[59]:
deg_C relative_humidity absolute_humidity sensor_1 sensor_2 sensor_3 sensor_4 sensor_5 target_carbon_monoxide target_benzene ... evening midnight Friday Monday Saturday Sunday Thursday Tuesday Wednesday is_weekend
0 13.1 46.0 0.7578 1387.2 1087.8 1056.0 1742.8 1293.4 2.5 12.0 ... 1 0 0 0 0 0 0 0 1 0
1 13.2 45.3 0.7255 1279.1 888.2 1197.5 1449.9 1010.9 2.1 9.9 ... 1 0 0 0 0 0 0 0 1 0
2 12.6 56.2 0.7502 1331.9 929.6 1060.2 1586.1 1117.0 2.2 9.2 ... 1 0 0 0 0 0 0 0 1 0
3 11.0 62.4 0.7867 1321.0 929.0 1102.9 1536.5 1263.2 2.2 9.7 ... 1 0 0 0 0 0 0 0 1 0
4 11.9 59.0 0.7888 1272.0 852.7 1180.9 1415.5 1132.2 1.5 6.4 ... 0 1 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7106 9.2 32.0 0.3871 1000.5 811.2 873.0 909.0 910.5 1.3 5.1 ... 1 0 1 0 0 0 0 0 0 0
7107 9.1 33.2 0.3766 1022.7 790.0 951.6 912.9 903.4 1.4 5.8 ... 1 0 1 0 0 0 0 0 0 0
7108 9.6 34.6 0.4310 1044.4 767.3 861.9 889.2 1159.1 1.6 5.2 ... 0 1 1 0 0 0 0 0 0 0
7109 8.0 40.7 0.4085 952.8 691.9 908.5 917.0 1206.3 1.5 4.6 ... 0 1 1 0 0 0 0 0 0 0
7110 8.0 41.3 0.4375 1108.8 745.7 797.1 880.0 1273.1 1.4 4.1 ... 0 1 0 0 1 0 0 0 0 1

7111 rows × 25 columns

In [60]:
test_eng
Out[60]:
deg_C relative_humidity absolute_humidity sensor_1 sensor_2 sensor_3 sensor_4 sensor_5 target_carbon_monoxide target_benzene ... evening midnight Friday Monday Saturday Sunday Thursday Tuesday Wednesday is_weekend
1 5.1 51.7 0.4564 1249.5 864.9 687.9 972.8 1714.0 NaN NaN ... 0 1 0 0 1 0 0 0 0 1
2 5.8 51.5 0.4689 1102.6 878.0 693.7 941.9 1300.8 NaN NaN ... 0 0 0 0 1 0 0 0 0 1
3 5.0 52.3 0.4693 1139.7 916.2 725.6 1011.0 1283.0 NaN NaN ... 0 0 0 0 1 0 0 0 0 1
4 4.5 57.5 0.4650 1022.4 838.5 871.5 967.0 1142.3 NaN NaN ... 0 0 0 0 1 0 0 0 0 1
5 4.5 53.7 0.4759 1004.0 745.5 914.2 989.1 973.8 NaN NaN ... 0 0 0 0 1 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2242 23.2 28.7 0.7568 1340.3 1023.9 522.8 1374.0 1659.8 NaN NaN ... 0 0 0 1 0 0 0 0 0 0
2243 24.5 22.5 0.7119 1232.8 955.1 616.1 1226.1 1269.0 NaN NaN ... 0 0 0 1 0 0 0 0 0 0
2244 26.6 19.0 0.6406 1187.7 1052.4 572.8 1253.4 1081.1 NaN NaN ... 0 0 0 1 0 0 0 0 0 0
2245 29.1 12.7 0.5139 1053.2 1009.0 702.0 1009.8 808.5 NaN NaN ... 0 0 0 1 0 0 0 0 0 0
2246 27.9 13.5 0.5028 1124.6 1078.4 608.2 1061.3 816.0 NaN NaN ... 0 0 0 1 0 0 0 0 0 0

2246 rows × 25 columns

Okay, now we have 25 columns that we feature engineered. But still we need to look into the distributions of the other predictor variables before training the models.

2. Data Distributions

In [61]:
%matplotlib inline

numeric_features = ['deg_C','relative_humidity','absolute_humidity','sensor_1','sensor_2','sensor_3',
                   'sensor_4','sensor_5','target_carbon_monoxide','target_benzene','target_nitrogen_oxides']

'''
Non engineered numerical distribution
'''
 
for col in numeric_features:        #iterating through the numerical features and plotting the histogram, mean and median
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = train_eng[col]
    feature.hist(bins=100, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
plt.show()

The target variables and the sensors data are skewed. The mean and median are skewed to the left. This should be fixed. The distribution can be changed by experimenting with different scaling formats. Three different method were done,

  1. min_max_scaling
  2. log_scaling
  3. Standard_scaling
In [62]:
'''
applying min_max scaling to the predictors and targets
'''
def min_max_scaling(df):
    df_norm = df.copy()
    for column in df_norm.columns:
        df_norm[column] = (df_norm[column] - df_norm[column].min()) / (df_norm[column].max() - df_norm[column].min())
        
    return df_norm

eng_norm = min_max_scaling(train_eng)
In [63]:
eng_norm.describe()
Out[63]:
deg_C relative_humidity absolute_humidity sensor_1 sensor_2 sensor_3 sensor_4 sensor_5 target_carbon_monoxide target_benzene ... evening midnight Friday Monday Saturday Sunday Thursday Tuesday Wednesday is_weekend
count 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 ... 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000
mean 0.437010 0.472051 0.448533 0.321030 0.296123 0.254034 0.406768 0.321287 0.160179 0.159388 ... 0.167065 0.166924 0.145127 0.141752 0.141893 0.141752 0.145127 0.141752 0.142596 0.283645
std 0.177186 0.212439 0.196314 0.148868 0.145455 0.137565 0.148325 0.162225 0.116702 0.120982 ... 0.373060 0.372935 0.352254 0.348820 0.348965 0.348820 0.352254 0.348820 0.349685 0.450798
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.303571 0.302808 0.323344 0.211138 0.191324 0.164148 0.325067 0.204154 0.072581 0.069182 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.433036 0.468864 0.435341 0.299864 0.283813 0.229174 0.406709 0.291679 0.129032 0.132075 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.546875 0.633700 0.593126 0.405654 0.392087 0.309398 0.494515 0.417535 0.217742 0.221698 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 25 columns

In [64]:
for col in numeric_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = eng_norm[col]
    feature.hist(bins=100, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
plt.show()
In [65]:
'''
applying log scaling to the predictors and targets
'''
def log_scaling(col):
    col = np.log1p(col)
    return col

log_scale = train_eng.copy()
for t in log_scale.columns:
    log_scale[t] = log_scaling(log_scale[t])
In [66]:
log_scale.describe()
Out[66]:
deg_C relative_humidity absolute_humidity sensor_1 sensor_2 sensor_3 sensor_4 sensor_5 target_carbon_monoxide target_benzene ... evening midnight Friday Monday Saturday Sunday Thursday Tuesday Wednesday is_weekend
count 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 ... 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000 7111.000000
mean 3.010875 3.808586 0.727923 6.977256 6.797553 6.731233 7.291797 6.834473 1.031356 2.168513 ... 0.115784 0.115723 0.100647 0.098328 0.098389 0.098328 0.100647 0.098328 0.098877 0.196608
std 0.406481 0.404368 0.198131 0.193461 0.315017 0.323938 0.261305 0.386188 0.429461 0.761459 ... 0.258789 0.258545 0.244263 0.241821 0.241821 0.241821 0.244263 0.241821 0.242554 0.312470
min 0.832909 2.292535 0.181321 6.431814 5.899897 5.741720 6.316984 5.495938 0.095310 0.095310 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 2.766319 3.546740 0.618370 6.836528 6.601094 6.525103 7.186409 6.584584 0.693147 1.704748 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 3.077312 3.877432 0.734049 6.967438 6.819143 6.719979 7.322576 6.834862 0.993252 2.251292 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 3.288402 4.123903 0.877196 7.103980 7.025627 6.917557 7.450893 7.111267 1.335001 2.721295 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.693147
max 3.852273 4.519612 1.172792 7.644584 7.742228 7.851038 7.977556 7.861573 2.602690 4.169761 ... 0.693359 0.693359 0.693359 0.693359 0.693359 0.693359 0.693359 0.693359 0.693359 0.693147

8 rows × 25 columns

In [67]:
for col in numeric_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = log_scale[col]
    feature.hist(bins=100, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
plt.show()
In [68]:
from sklearn.preprocessing import StandardScaler

standard_scale = train_eng.copy()
scaler = StandardScaler()
scaled = pd.DataFrame(scaler.fit_transform(standard_scale), columns = train_eng.columns)
In [69]:
for col in numeric_features:
    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()
    feature = scaled[col]
    feature.hist(bins=100, ax = ax)
    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
    ax.set_title(col)
plt.show()

Thoughts:

All the distributions are bit on the tail heavy side. Log scaling looks certainly the most plausible to work with as it is closes to normal distribution and the mean and the median are closer to the center.

Note: The distributions are not truly normal in the statistical sense, which would result in a smooth, symmetric "bell-curve" histogram with the mean and mode (the most common value) in the center; but they do generally indicate that most of the observations have a value somewhere near the middle.

We've explored the distribution of the numeric values in the dataset, but what about the categorical features that we created using the one hot encoding? We can plot all of them togther in a single plot with the pandas scatter_matrix method.

In [70]:
from pandas.plotting import scatter_matrix

attributes = ['target_carbon_monoxide','target_nitrogen_oxides',
              'target_benzene','sensor_1','sensor_5','sensor_2','sensor_4', 'sensor_3','deg_C','relative_humidity']

axes = scatter_matrix(log_scale[attributes], figsize=(18,18))
for ax in axes.flatten():
    ax.xaxis.label.set_rotation(90)
    ax.yaxis.label.set_rotation(0)
    ax.yaxis.label.set_ha('right')

plt.tight_layout()
plt.gcf().subplots_adjust(wspace=0, hspace=0)
plt.show()

Thoughts

Since we decided to use log scale as final method, after plotting scatter matrix only for that scale, there seems to both linear and non linear relationships between the different features.

  1. The sensors seems to have a linear relationship with their target variables.
  2. The temperature, humidity both have non linear relationships.
  3. The target variables have linear relationship with each other.

Lets get the correlation matrix to see if the linear and non linear relations makes sense with pearson correlation analysis

In [71]:
'''correlation of features with CO target variable'''

corr_matrix = log_scale.corr()
corr_matrix['target_carbon_monoxide'].sort_values(ascending=False)
Out[71]:
target_carbon_monoxide    1.000000
sensor_1                  0.870870
target_nitrogen_oxides    0.843404
sensor_5                  0.829337
sensor_2                  0.779981
target_benzene            0.774434
sensor_4                  0.475681
evening                   0.329730
noon                      0.108269
afternoon                 0.100031
Friday                    0.096594
Thursday                  0.084747
Tuesday                   0.062351
Wednesday                 0.052710
morning                   0.049013
deg_C                     0.041945
absolute_humidity        -0.010726
Monday                   -0.027995
relative_humidity        -0.030062
Saturday                 -0.063538
midnight                 -0.081991
Sunday                   -0.206758
is_weekend               -0.209171
dawn                     -0.505413
sensor_3                 -0.707397
Name: target_carbon_monoxide, dtype: float64
In [72]:
'''correlation of features with NO target variable'''

corr_matrix['target_nitrogen_oxides'].sort_values(ascending=False)
Out[72]:
target_nitrogen_oxides    1.000000
target_carbon_monoxide    0.843404
sensor_5                  0.778850
sensor_1                  0.711786
sensor_2                  0.652730
target_benzene            0.651093
sensor_4                  0.226679
evening                   0.203325
noon                      0.136201
morning                   0.105208
relative_humidity         0.101113
Friday                    0.077549
afternoon                 0.074427
Thursday                  0.063057
Tuesday                   0.053671
Wednesday                 0.048867
Monday                   -0.021145
Saturday                 -0.045496
midnight                 -0.069243
absolute_humidity        -0.109330
is_weekend               -0.172943
deg_C                    -0.174690
Sunday                   -0.177989
dawn                     -0.450123
sensor_3                 -0.647287
Name: target_nitrogen_oxides, dtype: float64
In [73]:
'''correlation of features with benzene target variable'''

corr_matrix['target_benzene'].sort_values(ascending=False)
Out[73]:
target_benzene            1.000000
sensor_2                  0.989710
sensor_5                  0.822993
sensor_4                  0.819966
target_carbon_monoxide    0.774434
sensor_1                  0.743550
target_nitrogen_oxides    0.651093
absolute_humidity         0.341020
evening                   0.242480
deg_C                     0.143921
noon                      0.122983
afternoon                 0.105975
Tuesday                   0.084101
morning                   0.067985
Thursday                  0.054046
Wednesday                 0.031614
Friday                    0.027704
Monday                    0.003599
Saturday                 -0.017511
relative_humidity        -0.053485
midnight                 -0.072257
is_weekend               -0.156263
Sunday                   -0.184428
dawn                     -0.467419
sensor_3                 -0.899791
Name: target_benzene, dtype: float64
In [74]:
'''
plotting the corrleation matrix with the seaborn heatmap method
'''

import matplotlib.pyplot as plt
import seaborn as sns

mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr_matrix, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
Out[74]:
<AxesSubplot:>

Thoughts

Looks like the sensors 1,2,3,4 and 5 have high correlation with the target vectors. Followed by temperature, humidity and the corresponding target variables. We saw that in the scatter matrix as well since the sensors seemed to have a linear relationship.

Train a Regression Model

Now that we've explored the data, it's time to use it to train a regression model that uses the features we've identified as potentially predictive to predict the targets. The first thing we need to do is to seperate the train data into train and validation. Since the it is multivariate analysis and there 3 different target variables and one is dependant on the other we need to use them while training, so train_test_split will not work here. So we will randomly split the dataframe into 80/20.

Beginning with a naive linear regression model.

In [75]:
val = log_scale.sample(frac = 0.2)
train = log_scale.drop(val.index)

print(f'train size: {len(train)}, val size: {len(val)}')
train size: 5689, val size: 1422

LinearRegression()

In [76]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score


'''
Linear regression first. Although we saw some non linear relationship between features, it is a good starting place to
get an idea and quantify some metrics before moving to more complex models'''


model_dict = {'target_nitrogen_oxides':LinearRegression(),  #Dictionary for 3 different targets with linearregression
             'target_benzene': LinearRegression(), 
             'target_carbon_monoxide': LinearRegression()}

for idx_target, target in enumerate(model_dict,1):    #iterate over the dictionary and split the target and predictors
    X_train = train.iloc[:, :-3]           
    y_train = train.iloc[:, -idx_target]
    
    X_test = val.iloc[:, :-3]
    y_test = val.iloc[:, -idx_target]
    
    model_dict[target].fit(X_train, y_train)
    preds = model_dict[target].predict(X_test)
    rmse = mean_squared_error(y_test,preds)
    r2 = r2_score(y_test,preds)
    
    print(f'{target}: rmse: {np.sqrt(rmse).round(2)}, r2:  {r2}')   #results
target_nitrogen_oxides: rmse: 0.0, r2:  1.0
target_benzene: rmse: 0.18, r2:  0.4087527397272016
target_carbon_monoxide: rmse: 0.18, r2:  0.4284852854084843

Decision Tree()

In [77]:
from sklearn.tree import DecisionTreeRegressor

'''Decision Trees can learn with non linear data'''

model_dict = {'target_nitrogen_oxides':DecisionTreeRegressor(),
             'target_benzene': DecisionTreeRegressor(),
             'target_carbon_monoxide': DecisionTreeRegressor()}

for idx_target, target in enumerate(model_dict,1):    #iterate over the dictionary and split the target and predictors
    X_train = train.iloc[:, :-3]
    y_train = train.iloc[:, -idx_target]
    
    X_test = val.iloc[:, :-3]
    y_test = val.iloc[:, -idx_target]
    
    model_dict[target].fit(X_train, y_train)
    preds = model_dict[target].predict(X_test)
    rmse = mean_squared_error(y_test,preds)
    r2 = r2_score(y_test,preds)
    
    print(f'{target}: rmse: {np.sqrt(rmse).round(2)}, r2:  {r2}')
target_nitrogen_oxides: rmse: 0.0, r2:  1.0
target_benzene: rmse: 0.24, r2:  0.00290895553331183
target_carbon_monoxide: rmse: 0.24, r2:  0.05350032825995843

RandomForestRegressor}()

In [78]:
from sklearn.ensemble import RandomForestRegressor

'''An even better tree model'''


model_dict = {'target_nitrogen_oxides':RandomForestRegressor(),
             'target_benzene': RandomForestRegressor(),
             'target_carbon_monoxide': RandomForestRegressor()}

for idx_target, target in enumerate(model_dict,1):       #iterate over the dictionary and split the target and predictors
    X_train = train.iloc[:, :-3] 
    y_train = train.iloc[:, -idx_target]
    
    X_test = val.iloc[:, :-3]
    y_test = val.iloc[:, -idx_target]
    
    model_dict[target].fit(X_train, y_train)
    preds = model_dict[target].predict(X_test)
    rmse = mean_squared_error(y_test,preds)
    r2 = r2_score(y_test,preds)
    
    print(f'{target}: rmse: {np.sqrt(rmse).round(2)}, r2:  {r2}')
target_nitrogen_oxides: rmse: 0.0, r2:  1.0
target_benzene: rmse: 0.17, r2:  0.4818565637285873
target_carbon_monoxide: rmse: 0.17, r2:  0.49018064876404266

So now we've quantified the ability of our model to predict the number of pollutants. But target_nitrogen_oxides having 1.0 is too good to be true and all things points to overfitting. It definitely has some predictive power, but we can probably do better!

k-Fold implementation and more capable model- CatBoostRegressor

Until now we had an arbitrary split for the training and validation. While random, the prediction capabilities could widely vary for different sample. Instead of relying on one such sample, lets implement kfold validation. CatBoost is a high-performance open source library for gradient boosting on decision trees and could be slightly or even much better than sklearn implementations.

In [79]:
from sklearn.model_selection import TimeSeriesSplit
from catboost import CatBoostRegressor

'''CatBoostRegressor serves as the state of the art for many regression problems'''



SEED = 42     #random seed

model_dict = {'target_nitrogen_oxides':CatBoostRegressor(random_state=SEED,    #model dictionary
                                learning_rate=0.05,
                                depth=8,
                               verbose=False),
             'target_benzene': CatBoostRegressor(random_state=SEED,
                                learning_rate=0.05,
                                depth=8,
                               verbose=False),
             'target_carbon_monoxide': CatBoostRegressor(random_state=SEED,
                                learning_rate=0.05,
                                depth=8,
                               verbose=False)}

splits = 5
kfold = TimeSeriesSplit(n_splits=splits)

for idx_fold, (train_idx,val_idx) in enumerate(kfold.split(train)):   #iterate over the folds and create a random train and val variables
    k_train, k_val = scaled.iloc[train_idx], scaled.iloc[val_idx]
    
    for idx_target, target in enumerate(model_dict,1):       #iterate over model dictionary
        x_train = k_train.iloc[:,:-3]
        x_val = k_val.iloc[:,:-3]
        
        y_train = k_train.iloc[:,-idx_target].astype(float)
        y_val = k_val.iloc[:,-idx_target].astype(float)
        
        
        model_dict[target].fit(x_train,y_train)      #train
        preds = model_dict[target].predict(x_val)    #predict
         
        r2 = r2_score(y_val,preds)                    #metrics
        rmse = mean_squared_error(y_val, np.expm1(preds)) ** (1/2)
        
        print(f'fold:{idx_fold} ; {target}: rmse: {rmse} and r2: {r2}')
    print('\n')
        
        
    
    
fold:0 ; target_nitrogen_oxides: rmse: 1.0529544665290886 and r2: 0.9974132121018654
fold:0 ; target_benzene: rmse: 1.201197100617081 and r2: 0.18798240533290678
fold:0 ; target_carbon_monoxide: rmse: 1.0788616944930722 and r2: 0.29493642598948155


fold:1 ; target_nitrogen_oxides: rmse: 1.0355463399824767 and r2: 0.9919394065562662
fold:1 ; target_benzene: rmse: 1.2723163749640798 and r2: 0.017191386726805313
fold:1 ; target_carbon_monoxide: rmse: 1.7465031271362252 and r2: 0.10507687095896012


fold:2 ; target_nitrogen_oxides: rmse: 1.190591651537904 and r2: 0.9991069760430005
fold:2 ; target_benzene: rmse: 1.432105456069224 and r2: 0.3091265990764992
fold:2 ; target_carbon_monoxide: rmse: 1.074027839347575 and r2: 0.26791220420906503


fold:3 ; target_nitrogen_oxides: rmse: 1.0858058566054205 and r2: 0.9981514826829635
fold:3 ; target_benzene: rmse: 0.9404438796719523 and r2: 0.33533835959902725
fold:3 ; target_carbon_monoxide: rmse: 1.2028703149344242 and r2: 0.30973147996458394


fold:4 ; target_nitrogen_oxides: rmse: 1.203806865889381 and r2: 0.9991057498990993
fold:4 ; target_benzene: rmse: 1.08095657271 and r2: 0.3156448589920443
fold:4 ; target_carbon_monoxide: rmse: 0.8902000126305795 and r2: 0.388304039571714


The results look quite similar to linear regression with slight improvements in certain folds, tut the r2 for target_nitrogen_oxide still seems to be very high, whereas the other two is modest. We can do some fine tuning by experimenting with the hyperparameters of the catboostregressor().

Quite decent for a minimal feature engineering given that the top sumbission for the kaggle challenge was ~0.22 when this was made and the test results were unknown. Let's try to streamline the process by using high level libraries like pycaret or AutoML.

Pycaret is a great tool for machine learning and data anaylsis. It's pretty cool how high level libraries for machine learning have matured in the recent years. We can create complex regression model with just few lines of code.

FineTuning the process with Pycaret

We need to train and evaluate for each indiviudal targets. First, setting up the model with the predictor features and only target variable as carbon monoxide.

In [80]:
from pycaret.regression import setup, compare_models, blend_models, finalize_model,tune_model, predict_model, plot_model


setup_target_carbon_monoxide = setup(data = log_scale, target = 'target_carbon_monoxide', session_id=24,
                  
                  # Ignore features target_benzene and target_nitrogen_oxides from the experiment 
                  ignore_features = ['target_benzene','target_nitrogen_oxides'], 
                  silent = True                   
)
Description Value
0 session_id 24
1 Target target_carbon_monoxide
2 Original Data (7111, 25)
3 Missing Values False
4 Numeric Features 21
5 Categorical Features 1
6 Ordinal Features False
7 High Cardinality Features False
8 High Cardinality Method None
9 Transformed Train Set (4977, 22)
10 Transformed Test Set (2134, 22)
11 Shuffle Train-Test True
12 Stratify Train-Test False
13 Fold Generator KFold
14 Fold Number 10
15 CPU Jobs -1
16 Use GPU False
17 Log Experiment False
18 Experiment Name reg-default-name
19 USI 22e2
20 Imputation Type simple
21 Iterative Imputation Iteration None
22 Numeric Imputer mean
23 Iterative Imputation Numeric Model None
24 Categorical Imputer constant
25 Iterative Imputation Categorical Model None
26 Unknown Categoricals Handling least_frequent
27 Normalize False
28 Normalize Method None
29 Transformation False
30 Transformation Method None
31 PCA False
32 PCA Method None
33 PCA Components None
34 Ignore Low Variance False
35 Combine Rare Levels False
36 Rare Level Threshold None
37 Numeric Binning False
38 Remove Outliers False
39 Outliers Threshold None
40 Remove Multicollinearity False
41 Multicollinearity Threshold None
42 Remove Perfect Collinearity True
43 Clustering False
44 Clustering Iteration None
45 Polynomial Features False
46 Polynomial Degree None
47 Trignometry Features False
48 Polynomial Threshold None
49 Group Features False
50 Feature Selection False
51 Feature Selection Method classic
52 Features Selection Threshold None
53 Feature Interaction False
54 Feature Ratio False
55 Interaction Threshold None
56 Transform Target False
57 Transform Target Method box-cox
In [81]:
#Train all the regression models available in the library and choose the best ones.
best_target_carbon_monoxide = compare_models(sort = 'RMSLE', n_select = 3) 
Model MAE MSE RMSE R2 RMSLE MAPE TT (Sec)
catboost CatBoost Regressor 0.0788 0.0121 0.1097 0.9350 0.0565 0.1001 4.3950
lightgbm Light Gradient Boosting Machine 0.0821 0.0134 0.1155 0.9278 0.0589 0.1039 0.2510
xgboost Extreme Gradient Boosting 0.0845 0.0138 0.1174 0.9256 0.0600 0.1061 5.8570
et Extra Trees Regressor 0.0830 0.0139 0.1175 0.9253 0.0603 0.1055 1.5260
rf Random Forest Regressor 0.0865 0.0150 0.1221 0.9193 0.0621 0.1093 2.3780
gbr Gradient Boosting Regressor 0.0876 0.0153 0.1237 0.9173 0.0629 0.1121 0.9080
knn K Neighbors Regressor 0.1090 0.0223 0.1492 0.8795 0.0746 0.1338 0.0700
ridge Ridge Regression 0.1096 0.0235 0.1531 0.8731 0.0777 0.1402 0.0150
lr Linear Regression 0.1097 0.0235 0.1531 0.8731 0.0778 0.1403 0.3810
br Bayesian Ridge 0.1097 0.0235 0.1531 0.8731 0.0778 0.1403 0.0180
dt Decision Tree Regressor 0.1217 0.0292 0.1709 0.8423 0.0873 0.1473 0.0550
huber Huber Regressor 0.1154 0.0302 0.1736 0.8371 0.0880 0.1455 0.1160
par Passive Aggressive Regressor 0.1307 0.0313 0.1766 0.8312 0.0893 0.1610 0.0230
ada AdaBoost Regressor 0.1345 0.0301 0.1734 0.8377 0.0910 0.1856 0.3660
omp Orthogonal Matching Pursuit 0.1425 0.0392 0.1978 0.7881 0.1000 0.1773 0.0150
lar Least Angle Regression 0.1848 0.1131 0.2427 0.4041 0.1111 0.2294 0.0190
en Elastic Net 0.3530 0.1865 0.4317 -0.0044 0.2158 0.4897 0.0270
llar Lasso Least Angle Regression 0.3530 0.1865 0.4317 -0.0044 0.2158 0.4897 0.0150
lasso Lasso Regression 0.3530 0.1865 0.4317 -0.0044 0.2158 0.4897 0.0160

Catboost regressor seems to be the best one in pycaret implementation as well.

In [82]:
#Instead of choosing one, lets ensemble top 10 models

model_target_carbon_monoxide = blend_models(best_target_carbon_monoxide)
MAE MSE RMSE R2 RMSLE MAPE
0 0.0776 0.0110 0.1050 0.9375 0.0515 0.0941
1 0.0762 0.0109 0.1043 0.9397 0.0518 0.0874
2 0.0810 0.0132 0.1149 0.9278 0.0589 0.1025
3 0.0812 0.0135 0.1163 0.9335 0.0563 0.0929
4 0.0804 0.0143 0.1195 0.9265 0.0651 0.1267
5 0.0803 0.0120 0.1097 0.9379 0.0570 0.1040
6 0.0753 0.0110 0.1047 0.9387 0.0556 0.0987
7 0.0829 0.0131 0.1145 0.9236 0.0609 0.1114
8 0.0803 0.0121 0.1102 0.9340 0.0562 0.0976
9 0.0741 0.0111 0.1052 0.9421 0.0528 0.0857
Mean 0.0789 0.0122 0.1104 0.9341 0.0566 0.1001
SD 0.0028 0.0012 0.0053 0.0059 0.0040 0.0115
In [83]:
#Hyperparameter fine tuning

Tuned_model_target_carbon_monoxide  = tune_model(model_target_carbon_monoxide ) 
MAE MSE RMSE R2 RMSLE MAPE
0 0.0768 0.0108 0.1040 0.9387 0.0512 0.0934
1 0.0760 0.0109 0.1044 0.9395 0.0520 0.0874
2 0.0806 0.0129 0.1137 0.9293 0.0583 0.1018
3 0.0804 0.0132 0.1150 0.9351 0.0559 0.0923
4 0.0799 0.0140 0.1184 0.9279 0.0646 0.1256
5 0.0798 0.0118 0.1088 0.9389 0.0566 0.1033
6 0.0752 0.0109 0.1042 0.9393 0.0555 0.0988
7 0.0824 0.0129 0.1138 0.9246 0.0605 0.1109
8 0.0796 0.0120 0.1097 0.9346 0.0560 0.0968
9 0.0734 0.0108 0.1040 0.9433 0.0525 0.0851
Mean 0.0784 0.0120 0.1096 0.9351 0.0563 0.0995
SD 0.0027 0.0011 0.0051 0.0057 0.0039 0.0113
In [84]:
#Plotting the final ensembled models residuals

plot_model(Tuned_model_target_carbon_monoxide)
In [85]:
#plotting the fitted line for the prediction

predict_model(Tuned_model_target_carbon_monoxide)
plot_model(Tuned_model_target_carbon_monoxide, plot='error')

It is as simple to train a ML model. A r^2 of 0.925 is certainly much better than the best catboost model of 0.4 from sklearns implementation.

Evaluating on the unseen data.

Predicting on the unseen test data. Needs some manipulation before getting the labels.

In [106]:
#applying log scaling to test data and removing the target variable columns before predicting with the trained model

test_eng = test_eng.drop(['target_carbon_monoxide','target_benzene','target_nitrogen_oxides'], axis=1)
log_scale_test = test_eng.copy()

for t in log_scale_test.columns:
    log_scale_test[t] = log_scaling(log_scale_test[t])
In [111]:
test_preds = predict_model(Tuned_model_target_carbon_monoxide,log_scale_test)
test_preds.head()
Transformation Pipeline and Model Successfully Loaded
Out[111]:
deg_C relative_humidity absolute_humidity sensor_1 sensor_2 sensor_3 sensor_4 sensor_5 dawn morning ... midnight Friday Monday Saturday Sunday Thursday Tuesday Wednesday is_weekend Label
1 1.808289 3.964615 0.375968 7.131299 6.763769 6.535096 6.881206 7.447168 0.000000 0.0 ... 0.693359 0.0 0.0 0.693359 0.0 0.0 0.0 0.0 0.693147 5.740173
2 1.916923 3.960813 0.384514 7.006333 6.778785 6.543480 6.848960 7.171503 0.693359 0.0 ... 0.000000 0.0 0.0 0.693359 0.0 0.0 0.0 0.0 0.693147 5.714057
3 1.791759 3.975936 0.384786 7.039397 6.821326 6.588376 6.919684 7.157735 0.693359 0.0 ... 0.000000 0.0 0.0 0.693359 0.0 0.0 0.0 0.0 0.693147 5.752083
4 1.704748 4.069027 0.381855 6.930886 6.732806 6.771363 6.875232 7.041674 0.693359 0.0 ... 0.000000 0.0 0.0 0.693359 0.0 0.0 0.0 0.0 0.693147 5.438167
5 1.704748 4.001864 0.389268 6.912743 6.615396 6.819143 6.897806 6.882232 0.693359 0.0 ... 0.000000 0.0 0.0 0.693359 0.0 0.0 0.0 0.0 0.693147 5.260027

5 rows × 23 columns

Alright, it works. Instead of repeating the 5 cells above for every target variable, we will take this oppurtunity to create a single object with pycaret where users can simply input the data and the AirPollutionPrediction() object would train and return the best model. We will also include a predictor function within the object to make it easier for future use.

In [128]:
from pycaret.regression import *


class AirPollutionPrediction():
    def __init__(self,train,test):   #initialize input dataframes
        self.train = train
        self.test = test
        

    
    def pycaret_train(self,train,target,ignore_features):   #training method.
        
        setup(data=train,target=target, session_id=24,
                  ignore_features = ignore_features, 
                  silent = True)
        
        best = compare_models(sort = 'RMSLE', n_select = 3)    #trains several models and returns the best 10 models
        blended = blend_models(estimator_list= best)           #ensembles the best models.
        tuned  = tune_model(blended )                          #hyperparameter fine tunining
        pred_holdout = predict_model(blended)                 #holdout validation
        final_model = finalize_model(blended)
        
        
        name = f'airpollutant_{target}'
        save_model(final_model, name)                        #save model with name airpollutant_{labeltype}
        return final_model

    def predict(self,test,model):                            #calls model and predict on unseen test data and plot the same
        pred_esb = predict_model(model, test)
        re = pred_esb['Label']
        
        
        
        return re
        
 
    def run(self,train,target,test):                       #run method
        result  = pd.DataFrame()
        targets  = ['target_nitrogen_oxides','target_carbon_monoxide','target_benzene']
        
        for target in targets:                            #call the train and test functions for dfifferent target variables.
            
            if target == 'target_nitrogen_oxides':
                ignores = [targets[1],targets[2]]
                model = self.pycaret_train(train,target,ignores)
                result['target_nitrogen_oxides'] = self.predict(test,model)
            elif target == 'target_carbon_monoxide':
                ignores = [targets[0],targets[2]]
                model = self.pycaret_train(train,target,ignores)           
                result['target_carbon_monoxide'] = self.predict(test,model)
            elif target == 'target_benzene':
                ignores = [targets[0],targets[1]]
                model = self.pycaret_train(train,target,ignores)
                result['benzene'] = self.predict(test,model)
       
        return result
      
    def main(self):                                      #main function. Takes input to begin training with y and pass with n
        
        print('Train the model: y or n?')
        target = input()
        while target not in ('y','n'):
            print('Enter Valid choice')
            target = input()
            print('Train the model: y or n?')
            
                           
        if target == 'y':
            predictions = self.run(self.train,target,self.test)
            predictions.to_csv('results.csv')                  #save test predictions to csv
        else: 
            pass
            
    if __name__ == 'main':
        main()
        
            
            
        
   
In [129]:
pollution = AirPollutionPrediction(train = log_scale, test= log_scale_test)
pollution.main()
MAE MSE RMSE R2 RMSLE MAPE
0 0.0696 0.0075 0.0866 0.9875 0.0299 0.0407
1 0.0714 0.0079 0.0889 0.9851 0.0303 0.0380
2 0.0682 0.0071 0.0841 0.9877 0.0296 0.0401
3 0.0695 0.0074 0.0862 0.9875 0.0295 0.0385
4 0.0662 0.0067 0.0821 0.9880 0.0277 0.0362
5 0.0705 0.0076 0.0869 0.9877 0.0302 0.0417
6 0.0732 0.0082 0.0904 0.9856 0.0319 0.0437
7 0.0678 0.0074 0.0858 0.9871 0.0289 0.0368
8 0.0651 0.0066 0.0815 0.9885 0.0273 0.0359
9 0.0686 0.0074 0.0860 0.9883 0.0297 0.0401
Mean 0.0690 0.0074 0.0859 0.9873 0.0295 0.0392
SD 0.0023 0.0004 0.0026 0.0011 0.0013 0.0024
Model MAE MSE RMSE R2 RMSLE MAPE
0 Voting Regressor 0.0698 0.0076 0.0872 0.9866 0.0296 0.0382
Transformation Pipeline and Model Successfully Saved

Loading the test predictions

In [147]:
test_results = pd.read_csv('./results.csv', index_col=[0])
final = [log_scale_test, test_results]
final = pd.concat(final,axis=1, join='inner')
In [150]:
final.head()
Out[150]:
deg_C relative_humidity absolute_humidity sensor_1 sensor_2 sensor_3 sensor_4 sensor_5 dawn morning ... Monday Saturday Sunday Thursday Tuesday Wednesday is_weekend target_nitrogen_oxides target_carbon_monoxide benzene
1 1.808289 3.964615 0.375968 7.131299 6.763769 6.535096 6.881206 7.447168 0.000000 0.0 ... 0.0 0.693359 0.0 0.0 0.0 0.0 0.693147 5.788731 0.932186 2.116958
2 1.916923 3.960813 0.384514 7.006333 6.778785 6.543480 6.848960 7.171503 0.693359 0.0 ... 0.0 0.693359 0.0 0.0 0.0 0.0 0.693147 5.733205 0.958473 2.101631
3 1.791759 3.975936 0.384786 7.039397 6.821326 6.588376 6.919684 7.157735 0.693359 0.0 ... 0.0 0.693359 0.0 0.0 0.0 0.0 0.693147 5.774099 1.009564 2.197030
4 1.704748 4.069027 0.381855 6.930886 6.732806 6.771363 6.875232 7.041674 0.693359 0.0 ... 0.0 0.693359 0.0 0.0 0.0 0.0 0.693147 5.414419 0.692009 1.978213
5 1.704748 4.001864 0.389268 6.912743 6.615396 6.819143 6.897806 6.882232 0.693359 0.0 ... 0.0 0.693359 0.0 0.0 0.0 0.0 0.693147 5.265816 0.609709 1.753950

5 rows × 25 columns

Result thoughts

Okay,Now we can train and call the best model to predict input values easily. The test set metrics could not be quantified because the kaggle challenge was ongoing. I will try to update the quantification if or when the results are released. Thanks for your time.