The challenge is to predict the values of air pollution measurements over time, based on basic weather information (temperature and humidity) and the input values of 5 sensors.
deg_c = temperature
relative_humidity = humidity
absolute_humidity = normalized humidity
sensors 1 to 5 = sensor data
target_carbon_monoxide = CO values in air
target_benze = beneze values in air
target_nitrogen_oxides = NO values in air
import pandas as pd #data manipulation
import numpy as np #linear algebra
import matplotlib.pyplot as plt #plotting library
import seaborn as sns #plotting library
#load data
train = pd.read_csv('../input/tabular-playground-series-jul-2021/train.csv')
test_unseen = pd.read_csv('../input/tabular-playground-series-jul-2021/test.csv')
train.head()
train.info()
train[1:].describe() #check the descriptions of the data without the object column
train.isnull().values.any() #check null values
The data is relatively clean. Only one object column ie. datetime. Can be engineerd with one hot encoding. No other discrete columns can proceed further.
std for sensor columns are very high. Should be engineered
total = [train,test_useen] #merging train and test into single dataframe
total = pd.concat(total)
total
extracting months, day of the month, hours, day name, time of the day and weekend or weekday check
'''
months and day of the months
'''
total.date_time = pd.to_datetime(total.date_time)
months = total.date_time.dt.month
monthly_days = total.date_time.dt.day
hours = total.date_time.dt.hour
day_name = total.date_time.dt.day_name()
days = pd.get_dummies(day_name)
days
'''
part of the days, seperating into 5 parts based on daylight and one hot enconding the resultant
categorical values
'''
def daypart(hour):
if hour in [2,3,4,5]:
return "dawn"
elif hour in [6,7,8,9]:
return "morning"
elif hour in [10,11,12,13]:
return "noon"
elif hour in [14,15,16,17]:
return "afternoon"
elif hour in [18,19,20,21]:
return "evening"
else: return "midnight"
raw_days = hours.apply(daypart)
dayparts = pd.get_dummies(raw_days)
dayparts = dayparts[['dawn','morning','noon','afternoon','evening','midnight']]
dayparts
'''
check if the day is a weekend or otherwise
'''
is_weekend = day_name.apply(lambda x : 1 if x in ['Saturday','Sunday'] else 0)
is_weekend = pd.DataFrame({'is_weekend':is_weekend})
'''
concat the resultant columns and split the train and test df back based on the original index
'''
final = pd.concat([total, dayparts, days,is_weekend],axis=1)
final = final.drop('date_time',axis=1)
train_eng = final[:7111]
test_eng = final.iloc[7112:,:]
print(f'train :{train_eng.shape}, test: {test_eng.shape}')
train_eng
test_eng
Okay, now we have 25 columns that we feature engineered. But still we need to look into the distributions of the other predictor variables before training the models.
%matplotlib inline
numeric_features = ['deg_C','relative_humidity','absolute_humidity','sensor_1','sensor_2','sensor_3',
'sensor_4','sensor_5','target_carbon_monoxide','target_benzene','target_nitrogen_oxides']
'''
Non engineered numerical distribution
'''
for col in numeric_features: #iterating through the numerical features and plotting the histogram, mean and median
fig = plt.figure(figsize=(9, 6))
ax = fig.gca()
feature = train_eng[col]
feature.hist(bins=100, ax = ax)
ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
ax.set_title(col)
plt.show()
The target variables and the sensors data are skewed. The mean and median are skewed to the left. This should be fixed. The distribution can be changed by experimenting with different scaling formats. Three different method were done,
'''
applying min_max scaling to the predictors and targets
'''
def min_max_scaling(df):
df_norm = df.copy()
for column in df_norm.columns:
df_norm[column] = (df_norm[column] - df_norm[column].min()) / (df_norm[column].max() - df_norm[column].min())
return df_norm
eng_norm = min_max_scaling(train_eng)
eng_norm.describe()
for col in numeric_features:
fig = plt.figure(figsize=(9, 6))
ax = fig.gca()
feature = eng_norm[col]
feature.hist(bins=100, ax = ax)
ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
ax.set_title(col)
plt.show()
'''
applying log scaling to the predictors and targets
'''
def log_scaling(col):
col = np.log1p(col)
return col
log_scale = train_eng.copy()
for t in log_scale.columns:
log_scale[t] = log_scaling(log_scale[t])
log_scale.describe()
for col in numeric_features:
fig = plt.figure(figsize=(9, 6))
ax = fig.gca()
feature = log_scale[col]
feature.hist(bins=100, ax = ax)
ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
ax.set_title(col)
plt.show()
from sklearn.preprocessing import StandardScaler
standard_scale = train_eng.copy()
scaler = StandardScaler()
scaled = pd.DataFrame(scaler.fit_transform(standard_scale), columns = train_eng.columns)
for col in numeric_features:
fig = plt.figure(figsize=(9, 6))
ax = fig.gca()
feature = scaled[col]
feature.hist(bins=100, ax = ax)
ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)
ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)
ax.set_title(col)
plt.show()
All the distributions are bit on the tail heavy side. Log scaling looks certainly the most plausible to work with as it is closes to normal distribution and the mean and the median are closer to the center.
Note: The distributions are not truly normal in the statistical sense, which would result in a smooth, symmetric "bell-curve" histogram with the mean and mode (the most common value) in the center; but they do generally indicate that most of the observations have a value somewhere near the middle.
We've explored the distribution of the numeric values in the dataset, but what about the categorical features that we created using the one hot encoding? We can plot all of them togther in a single plot with the pandas scatter_matrix method.
from pandas.plotting import scatter_matrix
attributes = ['target_carbon_monoxide','target_nitrogen_oxides',
'target_benzene','sensor_1','sensor_5','sensor_2','sensor_4', 'sensor_3','deg_C','relative_humidity']
axes = scatter_matrix(log_scale[attributes], figsize=(18,18))
for ax in axes.flatten():
ax.xaxis.label.set_rotation(90)
ax.yaxis.label.set_rotation(0)
ax.yaxis.label.set_ha('right')
plt.tight_layout()
plt.gcf().subplots_adjust(wspace=0, hspace=0)
plt.show()
Since we decided to use log scale as final method, after plotting scatter matrix only for that scale, there seems to both linear and non linear relationships between the different features.
Lets get the correlation matrix to see if the linear and non linear relations makes sense with pearson correlation analysis
'''correlation of features with CO target variable'''
corr_matrix = log_scale.corr()
corr_matrix['target_carbon_monoxide'].sort_values(ascending=False)
'''correlation of features with NO target variable'''
corr_matrix['target_nitrogen_oxides'].sort_values(ascending=False)
'''correlation of features with benzene target variable'''
corr_matrix['target_benzene'].sort_values(ascending=False)
'''
plotting the corrleation matrix with the seaborn heatmap method
'''
import matplotlib.pyplot as plt
import seaborn as sns
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr_matrix, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
Looks like the sensors 1,2,3,4 and 5 have high correlation with the target vectors. Followed by temperature, humidity and the corresponding target variables. We saw that in the scatter matrix as well since the sensors seemed to have a linear relationship.
Now that we've explored the data, it's time to use it to train a regression model that uses the features we've identified as potentially predictive to predict the targets. The first thing we need to do is to seperate the train data into train and validation. Since the it is multivariate analysis and there 3 different target variables and one is dependant on the other we need to use them while training, so train_test_split will not work here. So we will randomly split the dataframe into 80/20.
Beginning with a naive linear regression model.
val = log_scale.sample(frac = 0.2)
train = log_scale.drop(val.index)
print(f'train size: {len(train)}, val size: {len(val)}')
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
'''
Linear regression first. Although we saw some non linear relationship between features, it is a good starting place to
get an idea and quantify some metrics before moving to more complex models'''
model_dict = {'target_nitrogen_oxides':LinearRegression(), #Dictionary for 3 different targets with linearregression
'target_benzene': LinearRegression(),
'target_carbon_monoxide': LinearRegression()}
for idx_target, target in enumerate(model_dict,1): #iterate over the dictionary and split the target and predictors
X_train = train.iloc[:, :-3]
y_train = train.iloc[:, -idx_target]
X_test = val.iloc[:, :-3]
y_test = val.iloc[:, -idx_target]
model_dict[target].fit(X_train, y_train)
preds = model_dict[target].predict(X_test)
rmse = mean_squared_error(y_test,preds)
r2 = r2_score(y_test,preds)
print(f'{target}: rmse: {np.sqrt(rmse).round(2)}, r2: {r2}') #results
from sklearn.tree import DecisionTreeRegressor
'''Decision Trees can learn with non linear data'''
model_dict = {'target_nitrogen_oxides':DecisionTreeRegressor(),
'target_benzene': DecisionTreeRegressor(),
'target_carbon_monoxide': DecisionTreeRegressor()}
for idx_target, target in enumerate(model_dict,1): #iterate over the dictionary and split the target and predictors
X_train = train.iloc[:, :-3]
y_train = train.iloc[:, -idx_target]
X_test = val.iloc[:, :-3]
y_test = val.iloc[:, -idx_target]
model_dict[target].fit(X_train, y_train)
preds = model_dict[target].predict(X_test)
rmse = mean_squared_error(y_test,preds)
r2 = r2_score(y_test,preds)
print(f'{target}: rmse: {np.sqrt(rmse).round(2)}, r2: {r2}')
from sklearn.ensemble import RandomForestRegressor
'''An even better tree model'''
model_dict = {'target_nitrogen_oxides':RandomForestRegressor(),
'target_benzene': RandomForestRegressor(),
'target_carbon_monoxide': RandomForestRegressor()}
for idx_target, target in enumerate(model_dict,1): #iterate over the dictionary and split the target and predictors
X_train = train.iloc[:, :-3]
y_train = train.iloc[:, -idx_target]
X_test = val.iloc[:, :-3]
y_test = val.iloc[:, -idx_target]
model_dict[target].fit(X_train, y_train)
preds = model_dict[target].predict(X_test)
rmse = mean_squared_error(y_test,preds)
r2 = r2_score(y_test,preds)
print(f'{target}: rmse: {np.sqrt(rmse).round(2)}, r2: {r2}')
So now we've quantified the ability of our model to predict the number of pollutants. But target_nitrogen_oxides having 1.0 is too good to be true and all things points to overfitting. It definitely has some predictive power, but we can probably do better!
Until now we had an arbitrary split for the training and validation. While random, the prediction capabilities could widely vary for different sample. Instead of relying on one such sample, lets implement kfold validation. CatBoost is a high-performance open source library for gradient boosting on decision trees and could be slightly or even much better than sklearn implementations.
from sklearn.model_selection import TimeSeriesSplit
from catboost import CatBoostRegressor
'''CatBoostRegressor serves as the state of the art for many regression problems'''
SEED = 42 #random seed
model_dict = {'target_nitrogen_oxides':CatBoostRegressor(random_state=SEED, #model dictionary
learning_rate=0.05,
depth=8,
verbose=False),
'target_benzene': CatBoostRegressor(random_state=SEED,
learning_rate=0.05,
depth=8,
verbose=False),
'target_carbon_monoxide': CatBoostRegressor(random_state=SEED,
learning_rate=0.05,
depth=8,
verbose=False)}
splits = 5
kfold = TimeSeriesSplit(n_splits=splits)
for idx_fold, (train_idx,val_idx) in enumerate(kfold.split(train)): #iterate over the folds and create a random train and val variables
k_train, k_val = scaled.iloc[train_idx], scaled.iloc[val_idx]
for idx_target, target in enumerate(model_dict,1): #iterate over model dictionary
x_train = k_train.iloc[:,:-3]
x_val = k_val.iloc[:,:-3]
y_train = k_train.iloc[:,-idx_target].astype(float)
y_val = k_val.iloc[:,-idx_target].astype(float)
model_dict[target].fit(x_train,y_train) #train
preds = model_dict[target].predict(x_val) #predict
r2 = r2_score(y_val,preds) #metrics
rmse = mean_squared_error(y_val, np.expm1(preds)) ** (1/2)
print(f'fold:{idx_fold} ; {target}: rmse: {rmse} and r2: {r2}')
print('\n')
The results look quite similar to linear regression with slight improvements in certain folds, tut the r2 for target_nitrogen_oxide still seems to be very high, whereas the other two is modest. We can do some fine tuning by experimenting with the hyperparameters of the catboostregressor().
Quite decent for a minimal feature engineering given that the top sumbission for the kaggle challenge was ~0.22 when this was made and the test results were unknown. Let's try to streamline the process by using high level libraries like pycaret or AutoML.
Pycaret is a great tool for machine learning and data anaylsis. It's pretty cool how high level libraries for machine learning have matured in the recent years. We can create complex regression model with just few lines of code.
We need to train and evaluate for each indiviudal targets. First, setting up the model with the predictor features and only target variable as carbon monoxide.
from pycaret.regression import setup, compare_models, blend_models, finalize_model,tune_model, predict_model, plot_model
setup_target_carbon_monoxide = setup(data = log_scale, target = 'target_carbon_monoxide', session_id=24,
# Ignore features target_benzene and target_nitrogen_oxides from the experiment
ignore_features = ['target_benzene','target_nitrogen_oxides'],
silent = True
)
#Train all the regression models available in the library and choose the best ones.
best_target_carbon_monoxide = compare_models(sort = 'RMSLE', n_select = 3)
Catboost regressor seems to be the best one in pycaret implementation as well.
#Instead of choosing one, lets ensemble top 10 models
model_target_carbon_monoxide = blend_models(best_target_carbon_monoxide)
#Hyperparameter fine tuning
Tuned_model_target_carbon_monoxide = tune_model(model_target_carbon_monoxide )
#Plotting the final ensembled models residuals
plot_model(Tuned_model_target_carbon_monoxide)
#plotting the fitted line for the prediction
predict_model(Tuned_model_target_carbon_monoxide)
plot_model(Tuned_model_target_carbon_monoxide, plot='error')
It is as simple to train a ML model. A r^2 of 0.925 is certainly much better than the best catboost model of 0.4 from sklearns implementation.
Predicting on the unseen test data. Needs some manipulation before getting the labels.
#applying log scaling to test data and removing the target variable columns before predicting with the trained model
test_eng = test_eng.drop(['target_carbon_monoxide','target_benzene','target_nitrogen_oxides'], axis=1)
log_scale_test = test_eng.copy()
for t in log_scale_test.columns:
log_scale_test[t] = log_scaling(log_scale_test[t])
test_preds = predict_model(Tuned_model_target_carbon_monoxide,log_scale_test)
test_preds.head()
Alright, it works. Instead of repeating the 5 cells above for every target variable, we will take this oppurtunity to create a single object with pycaret where users can simply input the data and the AirPollutionPrediction() object would train and return the best model. We will also include a predictor function within the object to make it easier for future use.
from pycaret.regression import *
class AirPollutionPrediction():
def __init__(self,train,test): #initialize input dataframes
self.train = train
self.test = test
def pycaret_train(self,train,target,ignore_features): #training method.
setup(data=train,target=target, session_id=24,
ignore_features = ignore_features,
silent = True)
best = compare_models(sort = 'RMSLE', n_select = 3) #trains several models and returns the best 10 models
blended = blend_models(estimator_list= best) #ensembles the best models.
tuned = tune_model(blended ) #hyperparameter fine tunining
pred_holdout = predict_model(blended) #holdout validation
final_model = finalize_model(blended)
name = f'airpollutant_{target}'
save_model(final_model, name) #save model with name airpollutant_{labeltype}
return final_model
def predict(self,test,model): #calls model and predict on unseen test data and plot the same
pred_esb = predict_model(model, test)
re = pred_esb['Label']
return re
def run(self,train,target,test): #run method
result = pd.DataFrame()
targets = ['target_nitrogen_oxides','target_carbon_monoxide','target_benzene']
for target in targets: #call the train and test functions for dfifferent target variables.
if target == 'target_nitrogen_oxides':
ignores = [targets[1],targets[2]]
model = self.pycaret_train(train,target,ignores)
result['target_nitrogen_oxides'] = self.predict(test,model)
elif target == 'target_carbon_monoxide':
ignores = [targets[0],targets[2]]
model = self.pycaret_train(train,target,ignores)
result['target_carbon_monoxide'] = self.predict(test,model)
elif target == 'target_benzene':
ignores = [targets[0],targets[1]]
model = self.pycaret_train(train,target,ignores)
result['benzene'] = self.predict(test,model)
return result
def main(self): #main function. Takes input to begin training with y and pass with n
print('Train the model: y or n?')
target = input()
while target not in ('y','n'):
print('Enter Valid choice')
target = input()
print('Train the model: y or n?')
if target == 'y':
predictions = self.run(self.train,target,self.test)
predictions.to_csv('results.csv') #save test predictions to csv
else:
pass
if __name__ == 'main':
main()
pollution = AirPollutionPrediction(train = log_scale, test= log_scale_test)
pollution.main()
test_results = pd.read_csv('./results.csv', index_col=[0])
final = [log_scale_test, test_results]
final = pd.concat(final,axis=1, join='inner')
final.head()
Okay,Now we can train and call the best model to predict input values easily. The test set metrics could not be quantified because the kaggle challenge was ongoing. I will try to update the quantification if or when the results are released. Thanks for your time.