The weekly sales data is from kaggle: https://www.kaggle.com/balaganeshm/clustering
It contains sales volume of 810 products every week for the whole year. It has both normalized and un-normalized data in the same dataframe so it is useful to do some analysis and we can try to cluster the products into different classes and see if we can extract some information.
import numpy as np # linear algebra
import pandas as pd # data processing
from matplotlib import style
df = pd.read_csv('../input/clustering/clustering/data/sales_transactions_dataset_weekly.csv')
df.head() #show first 5 rows
df.info() #show overall stats of the dataframe
print(list(df.columns)) #show the different columns
df.isnull().any().sum() #checkingfor null values
No NULL values, there are 107 columns, of which product_code is the index , min and max are the numerical minimum and maximum products sold in 52 weeks. The normalized are repeated columns that are normalized with min_max strategy.
There are no categorical columns except product_code which is the index and can be removed. Rest are all numerical and since the data is already normalized we can just work with it.
raw = df.iloc[:,1:53] #raw un-normalized columns
raw.describe()
normal = df.iloc[:,55:]
normal.describe() #normalized columns
It is not possible to visualize high dimensional data. So we will implement a PCA method on the normalized data to project and calculate the feature variance. Both 2 and 3 components are visualized.
What is PCA?
Principal component analysis (PCA) is a technique for reducing the dimensionality of high dimensional datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.
In our case, we have a data of shape (811,52). Although it is feasible to train clustering models on n dimensional data, it is not feasible to visualize these said clusters. The best we can do is 2D or 3D. So we will compute the PCA vectors with 2 and 3 projections and visualize using plotly.
'''
compute pca and t-sne for the raw data. Store the lower dimensional data in new dataframes'''
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
%matplotlib inline
pca_2 = PCA(n_components=2)
pca_3 = PCA(n_components=3)
principle_components_2 = pca_2.fit_transform(normal)
principle_components_3 = pca_3.fit_transform(normal)
pca_data_2 = pd.DataFrame(data=principle_components_2, columns = ['principle component 1', 'principle component 2'])
pca_data_3 = pd.DataFrame(data=principle_components_3, columns = ['principle component 1', 'principle component 2', 'principle component 3'])
print(f'variance of 2 dimensional PCA: {pca_2.explained_variance_ratio_}')
print(f'variance of 3 dimensional PCA: {pca_3.explained_variance_ratio_}')
ax_pca = plt.scatter(x = pca_data_2['principle component 1'], y = pca_data_3['principle component 2'])
plt.xlabel('principle component 1')
plt.ylabel('principle component 2')
plt.title('PCA for normalized weekly sales data')
plt.show()
import plotly.express as px
# Creating plot
px.scatter_3d(x = pca_data_3['principle component 1'], y = pca_data_3['principle component 2'],
z = pca_data_3['principle component 3'])
Okay, but selecting 2 or 3 is arbitray and we only do that so it is easier for use to visualize, but it might not explain all the variance in a dataset. Sometimes it might take >3 projections. So lets visualize the variance captured for each projections in increasing manner.
'''
plot the variance for all the columns
'''
pca = PCA()
components = pca.fit_transform(normal)
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)
px.area(
x=range(1, exp_var_cumul.shape[0] + 1),
y=exp_var_cumul,
labels={"x": "# Components", "y": "Explained Variance"})
Generally, accepted variance to represent the entire dataset is 90%. In our case 40 features represents 90% of the dataset. But, since the dataset is not that huge and training complexity wont be affected by much from the rest of 12 features we will add them to training data anyways.
t-SNE is another more modern approach of dimensionality reduction. Seriously, PCA was invented in 1933!
t-SNE is inherently different from PCA. In the latter we calculate the variance between the features and only take the feature set with a high variance threshol. t-SNE on the other hand calculates the closeness of the datapoints from a perspective of local cluster and global cluster.
There are two imporatant hyperparameters for t-SNE, perplexity which explains how to balance attention between local and global aspects of the data. Typically it ranges from 5 to 50 depending on the size of the data. Another parameter is the iterations, which is usually tuned for training computational complexity.
from sklearn.manifold import TSNE
tsne_2 = TSNE(n_components=2,perplexity=5).fit_transform(normal)
tsne_3 = TSNE(n_components=3,perplexity=5).fit_transform(normal)
tsne_data_2 = pd.DataFrame(data=tsne_2, columns = ['Embedding 1', 'Embedding 2'])
tsne_data_3 = pd.DataFrame(data=tsne_3, columns = ['Embedding 1', 'Embedding 2', 'Embedding 3'])
fig, ax = plt.subplots(1,2,figsize=(15,8))
ax[0].scatter(x = pca_data_2['principle component 1'], y = pca_data_3['principle component 2'])
ax[0].set_xlabel('principle component 1')
ax[0].set_ylabel('principle component 2')
ax[0].set_title('principle component analysis')
ax[1].scatter(x = tsne_data_2['Embedding 1'], y = tsne_data_2['Embedding 2'])
ax[1].set_xlabel('Embedding 1')
ax[1].set_ylabel('Embedding 2')
ax[1].set_title('t-SNE analysis')
plt.show()
from plotly.subplots import make_subplots
px.scatter_3d(x = tsne_data_3['Embedding 1'], y = tsne_data_3['Embedding 2'],
z = tsne_data_3['Embedding 3'])
The total number of clusters you expect should be small enough (otherwise there's no clustering) but large enough so that inertia can be reasonable (small enough). Inertia measures the typical distance between a data point and the center of its cluster.
But we have to do some analysis before getting to the point of optimum cluster count (k). First we will train a trivial number of clusters in the range from 1 to 50, expecting the optimal cluster to be somwhere between 5 to 10.
from sklearn.cluster import KMeans
from sklearn.metrics.cluster import contingency_matrix
''' cluster function takes dataframe and cluster size k as input
returns the predicted cluster and model.
'''
def cluster(nclusters,data):
kmeans = KMeans(n_clusters=nclusters)
kmeans.fit(data)
Z = kmeans.predict(data)
return kmeans, Z
max_cluster_size = 50
inertias = np.zeros(max_cluster_size) #mask array that will be filled with calculated interias
for i in range(1, max_cluster_size):
kmeans, Z = cluster(i,normal)
inertias[i] = kmeans.inertia_
from plotly.graph_objs import *
'''
plot for the elbow method to find the optimal k.
'''
trace1 = {
"mode": "lines+markers",
"name": "lines+markers",
"type": "scatter",
"x": list(range(1,50)),
"y": list(inertias[1:]),
}
data = Data([trace1])
layout = {
"title": "Elbow for KMeans clustering",
"xaxis": {"title": "Number of clusters"},
"yaxis": {"title": "Inertia"},
}
fig = Figure(data=data, layout=layout)
fig.show()
In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.
If we look at the elbow graph result of our data, we can that we can choose the number of cluster to be somewhere between 3 and 6. We cannot be sure which of those would be optimal. Elbow method works well when there is a distinct change between two points. For example, if the difference btween 3 and 5 is steep, we can say 4 is the optimal k. If not, there is a workaround for this. We can calculate the scaled intertia rather than vannila intertia. Scaling here means, we add regularization in the form of alpha. Alpha should be a very low number. As alpha goes towards, 0 the number of cluster will be 1. In order to get the cluster size, we simply do a argmin of all found weighted interia.
$$ Weighted Inertia_i^k = {(Inertia_i^k) \over (\text{Inertia of k} = 1) \times alpha \times k}$$def AutoKmeans(data,k,alpha_k=0.02):
inertia_o = np.square((data.values - data.values.mean(axis=0))).sum()
kmeans = KMeans(n_clusters=k, random_state=0).fit(data)
scaled_inertia = kmeans.inertia_ / inertia_o + alpha_k * k
return scaled_inertia
def chooseBestKforKMeans(data, k_range):
ans = []
for k in k_range:
scaled_inertia = AutoKmeans(data, k)
ans.append((k, scaled_inertia))
results = pd.DataFrame(ans, columns = ['k','Scaled Inertia']).set_index('k')
return results
res_df = chooseBestKforKMeans(normal,range(1,50))
print(f'Best k for the data: {res_df.idxmin()[0]}')
Okay, so we found that 4 is the optimal cluster count with the weighted intertia method. Now we can train some model based on this and get the predictions for them.
n_clusters = 4
model, Z = cluster(n_clusters, normal)
model_pca, Z_pca = cluster(n_clusters, pca_data_2)
model_tsne, Z_tsne = cluster(n_clusters, tsne_data_2)
model_pca_3, Z_pca_3 = cluster(n_clusters, pca_data_3)
model_tsne_3, Z_tsne_3 = cluster(n_clusters, tsne_data_3)
Concatenating the predicted cluster with the PCA and t-sne data to visualize in low dimension.
pca_data_2['class_normal'] = Z
pca_data_2['class_pca'] = Z_pca
pca_data_3['class_normal'] = Z
pca_data_3['class_pca'] = Z_pca_3
tsne_data_2['class_normal'] = Z
tsne_data_2['class_tsne'] = Z_tsne
tsne_data_3['class_normal'] = Z
tsne_data_3['class_tsne'] = Z_tsne_3
classes = ['1','2','3','4']
fig, ax = plt.subplots(1,2,figsize=(15,8))
ax[0].scatter(x = pca_data_2['principle component 1'], y = pca_data_2['principle component 2'],
c = pca_data_2['class_normal'], label=pca_data_2['class_normal'])
ax[0].set_xlabel('principle component 1')
ax[0].set_ylabel('principle component 2')
ax[0].set_title('Kmeans before PCA')
ax[1].scatter(x = pca_data_2['principle component 1'], y = pca_data_2['principle component 2'],
c = pca_data_2['class_pca'], label=pca_data_2['class_pca'])
ax[1].set_xlabel('principle component 1')
ax[1].set_ylabel('principle component 2')
ax[1].set_title('Kmeans with PCA')
plt.show()
px.scatter_3d(x = pca_data_3['principle component 1'], y = pca_data_3['principle component 2'],
z = pca_data_3['principle component 3'], color=pca_data_3['class_pca'])
Applying K-means on PCA data and kmeans on raw normalized data are very similar. As we can see from the 2D and 3D clusters there are some outliers in both the cases, but overall the results seems to good.
classes = ['1','2','3','4']
fig, ax = plt.subplots(1,2,figsize=(15,8)) #raw data subsplot
ax[0].scatter(x = tsne_data_2['Embedding 1'], y = tsne_data_2['Embedding 2'],
c = tsne_data_2['class_normal'], label=tsne_data_2['class_normal'])
ax[0].set_xlabel('Embedding 1')
ax[0].set_ylabel('Embedding 2')
ax[0].set_title('Kmeans before t-SNE')
ax[1].scatter(x = tsne_data_2['Embedding 1'], y = tsne_data_2['Embedding 2'], #t-sne data subsplot
c = tsne_data_2['class_tsne'], label=tsne_data_2['class_tsne'])
ax[1].set_xlabel('Embedding 1')
ax[1].set_ylabel('Embedding 2')
ax[1].set_title('Kmeans with t-SNE')
plt.show()
px.scatter_3d(x = tsne_data_3['Embedding 1'], y = tsne_data_3['Embedding 2'],
z = tsne_data_3['Embedding 3'], color=tsne_data_3['class_tsne'])
Both the dimensionality reduction techniques seems to work very well. But does it mean they are correct? From one of the stack overflow answer here the writer says that the t-sne clusters must be carefully understood before coming to conclusions as they can be misleading.
While clustering after t-SNE will sometimes (often?) work, you will never know whether the "clusters" you find are real, or just artifacts of t-SNE. You may just be seeing shapes in clouds.
Even though the visuals give some ideas about the models clustering capability, we need to quantify it to make it more trustworthy.
Quantifying results with some metrics is easier when working with supervised learning. But given the lack of lables in unsupervised learning, quantification is limited. One of the most common metrics to use for clustering problems is the Silhouette Coefficient. It is given by,
$$ S = {(b-a) \over \max(a,b)} $$a = average intracluster distance between datapoints in a cluster.
b= average intercluster distance between datapoints in all clusters.
Silhouette ranges from -1 to 1, where -1 means the data is opposing to each other, 0 means the distance between clusters is insignificant, 1 means the distance between clusters are well pronounced.
from sklearn.metrics import silhouette_score, silhouette_samples
normal_score = silhouette_score(normal, model.labels_, metric='euclidean')
pca_score_2 = silhouette_score(pca_data_2, model_pca.labels_, metric='euclidean')
tsne_score_2 = silhouette_score(tsne_data_2, model_tsne.labels_, metric='euclidean')
pca_score_3 = silhouette_score(pca_data_3, model_pca_3.labels_, metric='euclidean')
tsne_score_3 = silhouette_score(tsne_data_3, model_tsne_3.labels_, metric='euclidean')
nl = '\n'
print(f'KMeans Non Engineered Silhouette Score: {nl} {normal_score}')
print('\n')
print(f'KMeans PCA Scaled Silhouette Score; {nl} 2 components: {pca_score_2}; {nl} 3 components: {pca_score_3}')
print('\n')
print(f'KMeans t-SNE Scaled Silhouette Score;{nl} 2 embeddings: {tsne_score_2};{nl} 3 embeddings: {tsne_score_3}')
This somewhat contradicts with the result from the visualizations of t-sne. Interestingly, the silhouetee score for the model trained on the entire df does not equate to the results shown in lower dimension visuals. This shows the need for quantification. Ok. So for our data, the final combination that works best would be Kmeans+PCA.
Why do all this anyway? We can get some nice information from the clustered data. First lets concatenate the predicted labels back to the raw data and see if there are any patterns.
inf = raw.copy()
inf['class'] = Z
sales_group = inf.groupby('class').sum().astype(int).reset_index() #groupby the predicted cluster and calculate the median sales
weekly_sales = sales_group.drop('class',axis=1)
Let's say we want to see the sales pattern of the product in a week. We do not have names for the product but arbitrary values like P1 and P2. But let's assume if P1 is ice-cream and P2 is pencils. Obviously they both would have different sales pattern and possible be clustered into different groups. We want to see at which week of the year they sell the most so we can manage stocks.
A typical plot from the raw data would look like this. Very messy and illegible
fig = px.line(raw.T, title='Messy data representation of weekly sales')
fig.update_xaxes(title_text='Week')
fig.update_yaxes(title_text='Sales Count')
fig.show()
Now lets take a look at out clustered and grouped data. So much better. When a new product with certain parameter is added into the data, we can easily predict which time of the year the clustered product would sell the most. Offcourse the data is not very elobrate to support this kind of patterns since it only has amount of weekly sales from one year. It can be made more robust with other features added such as product categories, seasonal data, more data from different years etc..
Two types of products seems to fall behind in sales as the year progresses. Hmm I suppose what could that be? meanwhile the other two seems to be static and low volume.
fig = px.line(weekly_sales.T, title='After applying clustering to raw data')
fig.update_xaxes(title_text='Week')
fig.update_yaxes(title_text='Sales Count')
fig.show()
'''
Adding weekly data(every 4 columns) and further grouing by the predicted cluster and plotting monthly data'''
monthly = inf.iloc[:,:52].groupby((np.arange(len(inf.iloc[:,:52].columns)) // 4) + 1, axis=1).sum()
monthly['class'] = inf['class']
monthly_group = monthly.groupby('class').sum().astype(int).reset_index()
monthly_sales = monthly_group.drop('class',axis=1)
fig = px.line(monthly_sales.T, title='Monthly sales of different clusters')
fig.update_xaxes(title_text='Month')
fig.update_yaxes(title_text='Sales Count')
fig.show()
Okay so we went through two different dimensionality reduction techniques and one clustering techniuqe in the form of Kmeans. There are several other algorithims for clustering like DBSAN, Agglomerative clustering etc.. but this notebook is already too long and the results are quite decent for the combination of kmeans and PCA. Thank you for your attention and have a good day!