CLUSTERING ANALYSIS WITH VARIOUS DIMENSIONALITY REDUCTION TECHNIQUES AND KMEANS ALGORITHM ON WEEKLY SALES DATA.

The weekly sales data is from kaggle: https://www.kaggle.com/balaganeshm/clustering

It contains sales volume of 810 products every week for the whole year. It has both normalized and un-normalized data in the same dataframe so it is useful to do some analysis and we can try to cluster the products into different classes and see if we can extract some information.

LOADING AND UNDERSTANDING THE DATA

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing
from matplotlib import style

df = pd.read_csv('../input/clustering/clustering/data/sales_transactions_dataset_weekly.csv')
In [2]:
df.head() #show first 5 rows
Out[2]:
product_code w0 w1 w2 w3 w4 w5 w6 w7 w8 ... normalized_42 normalized_43 normalized_44 normalized_45 normalized_46 normalized_47 normalized_48 normalized_49 normalized_50 normalized_51
0 P1 11 12 10 8 13 12 14 21 6 ... 0.06 0.22 0.28 0.39 0.50 0.00 0.22 0.17 0.11 0.39
1 P2 7 6 3 2 7 1 6 3 3 ... 0.20 0.40 0.50 0.10 0.10 0.40 0.50 0.10 0.60 0.00
2 P3 7 11 8 9 10 8 7 13 12 ... 0.27 1.00 0.18 0.18 0.36 0.45 1.00 0.45 0.45 0.36
3 P4 12 8 13 5 9 6 9 13 13 ... 0.41 0.47 0.06 0.12 0.24 0.35 0.71 0.35 0.29 0.35
4 P5 8 5 13 11 6 7 9 14 9 ... 0.27 0.53 0.27 0.60 0.20 0.20 0.13 0.53 0.33 0.40

5 rows × 107 columns

In [3]:
df.info() #show overall stats of the dataframe
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 811 entries, 0 to 810
Columns: 107 entries, product_code to normalized_51
dtypes: float64(52), int64(54), object(1)
memory usage: 678.1+ KB
In [4]:
print(list(df.columns))   #show the different columns
['product_code', 'w0', 'w1', 'w2', 'w3', 'w4', 'w5', 'w6', 'w7', 'w8', 'w9', 'w10', 'w11', 'w12', 'w13', 'w14', 'w15', 'w16', 'w17', 'w18', 'w19', 'w20', 'w21', 'w22', 'w23', 'w24', 'w25', 'w26', 'w27', 'w28', 'w29', 'w30', 'w31', 'w32', 'w33', 'w34', 'w35', 'w36', 'w37', 'w38', 'w39', 'w40', 'w41', 'w42', 'w43', 'w44', 'w45', 'w46', 'w47', 'w48', 'w49', 'w50', 'w51', 'min', 'max', 'normalized_0', 'normalized_1', 'normalized_2', 'normalized_3', 'normalized_4', 'normalized_5', 'normalized_6', 'normalized_7', 'normalized_8', 'normalized_9', 'normalized_10', 'normalized_11', 'normalized_12', 'normalized_13', 'normalized_14', 'normalized_15', 'normalized_16', 'normalized_17', 'normalized_18', 'normalized_19', 'normalized_20', 'normalized_21', 'normalized_22', 'normalized_23', 'normalized_24', 'normalized_25', 'normalized_26', 'normalized_27', 'normalized_28', 'normalized_29', 'normalized_30', 'normalized_31', 'normalized_32', 'normalized_33', 'normalized_34', 'normalized_35', 'normalized_36', 'normalized_37', 'normalized_38', 'normalized_39', 'normalized_40', 'normalized_41', 'normalized_42', 'normalized_43', 'normalized_44', 'normalized_45', 'normalized_46', 'normalized_47', 'normalized_48', 'normalized_49', 'normalized_50', 'normalized_51']
In [5]:
df.isnull().any().sum()  #checkingfor null values
Out[5]:
0

THOUGHTS:

No NULL values, there are 107 columns, of which product_code is the index , min and max are the numerical minimum and maximum products sold in 52 weeks. The normalized are repeated columns that are normalized with min_max strategy.

There are no categorical columns except product_code which is the index and can be removed. Rest are all numerical and since the data is already normalized we can just work with it.

In [6]:
raw = df.iloc[:,1:53]       #raw un-normalized columns
raw.describe()
Out[6]:
w0 w1 w2 w3 w4 w5 w6 w7 w8 w9 ... w42 w43 w44 w45 w46 w47 w48 w49 w50 w51
count 811.000000 811.000000 811.000000 811.000000 811.000000 811.000000 811.000000 811.000000 811.000000 811.000000 ... 811.000000 811.000000 811.000000 811.000000 811.000000 811.000000 811.000000 811.000000 811.000000 811.000000
mean 8.902589 9.129470 9.389642 9.717633 9.574599 9.466091 9.720099 9.585697 9.784217 9.681874 ... 8.394575 8.318126 8.434032 8.556104 8.720099 8.670777 8.674476 8.895191 8.861899 8.889026
std 12.067163 12.564766 13.045073 13.553294 13.095765 12.823195 13.347375 13.049138 13.550237 13.137916 ... 11.348777 11.250455 11.223499 11.382041 11.621684 11.435870 11.222996 10.941375 10.492710 9.558011
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 1.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000
50% 3.000000 3.000000 3.000000 4.000000 4.000000 3.000000 4.000000 4.000000 4.000000 4.000000 ... 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000 5.000000 5.000000
75% 12.000000 12.000000 12.000000 13.000000 13.000000 12.500000 13.000000 12.500000 13.000000 13.000000 ... 10.000000 11.000000 11.000000 11.000000 11.000000 12.000000 12.000000 12.000000 13.000000 14.000000
max 54.000000 53.000000 56.000000 59.000000 61.000000 52.000000 56.000000 62.000000 63.000000 52.000000 ... 52.000000 50.000000 46.000000 46.000000 55.000000 49.000000 50.000000 52.000000 57.000000 73.000000

8 rows × 52 columns

In [7]:
normal = df.iloc[:,55:]
normal.describe()              #normalized columns
Out[7]:
normalized_0 normalized_1 normalized_2 normalized_3 normalized_4 normalized_5 normalized_6 normalized_7 normalized_8 normalized_9 ... normalized_42 normalized_43 normalized_44 normalized_45 normalized_46 normalized_47 normalized_48 normalized_49 normalized_50 normalized_51
count 811.000000 811.000000 811.000000 811.000000 811.000000 811.000000 811.000000 811.000000 811.00000 811.000000 ... 811.000000 811.000000 811.000000 811.000000 811.000000 811.000000 811.00000 811.000000 811.000000 811.000000
mean 0.289396 0.299100 0.306732 0.319852 0.326905 0.319420 0.332848 0.326572 0.32434 0.326843 ... 0.299149 0.287571 0.304846 0.316017 0.334760 0.314636 0.33815 0.358903 0.373009 0.427941
std 0.266307 0.281343 0.284234 0.296498 0.297291 0.292765 0.301855 0.298986 0.29320 0.292093 ... 0.266993 0.256630 0.263396 0.262226 0.275203 0.266029 0.27569 0.286665 0.295197 0.342360
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 ... 0.000000 0.000000 0.000000 0.020000 0.085000 0.000000 0.10500 0.100000 0.110000 0.090000
50% 0.250000 0.280000 0.290000 0.290000 0.310000 0.300000 0.310000 0.330000 0.32000 0.330000 ... 0.280000 0.270000 0.300000 0.310000 0.330000 0.310000 0.33000 0.330000 0.350000 0.430000
75% 0.500000 0.500000 0.500000 0.535000 0.550000 0.520000 0.530000 0.540000 0.53500 0.530000 ... 0.490000 0.450000 0.500000 0.500000 0.500000 0.500000 0.50000 0.550000 0.560000 0.670000
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 1.000000

8 rows × 52 columns

It is not possible to visualize high dimensional data. So we will implement a PCA method on the normalized data to project and calculate the feature variance. Both 2 and 3 components are visualized.

PCA ANALYSIS

What is PCA?

Principal component analysis (PCA) is a technique for reducing the dimensionality of high dimensional datasets, increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.

In our case, we have a data of shape (811,52). Although it is feasible to train clustering models on n dimensional data, it is not feasible to visualize these said clusters. The best we can do is 2D or 3D. So we will compute the PCA vectors with 2 and 3 projections and visualize using plotly.

In [8]:
'''
compute pca and t-sne for the raw data. Store the lower dimensional data in new dataframes'''



from sklearn.decomposition import PCA
import matplotlib.pyplot as plt 
%matplotlib inline

pca_2 = PCA(n_components=2)
pca_3 = PCA(n_components=3)

principle_components_2 = pca_2.fit_transform(normal)
principle_components_3 = pca_3.fit_transform(normal)

pca_data_2 = pd.DataFrame(data=principle_components_2, columns = ['principle component 1', 'principle component 2'])
pca_data_3 = pd.DataFrame(data=principle_components_3, columns = ['principle component 1', 'principle component 2', 'principle component 3'])


print(f'variance of 2 dimensional PCA: {pca_2.explained_variance_ratio_}')
print(f'variance of 3 dimensional PCA: {pca_3.explained_variance_ratio_}')
variance of 2 dimensional PCA: [0.33139718 0.04500733]
variance of 3 dimensional PCA: [0.33139718 0.04500733 0.02435102]

2 projections

In [9]:
ax_pca = plt.scatter(x = pca_data_2['principle component 1'], y = pca_data_3['principle component 2'])
plt.xlabel('principle component 1')
plt.ylabel('principle component 2')
plt.title('PCA for normalized weekly sales data')
plt.show()

3 projections

In [10]:
import plotly.express as px


# Creating plot
px.scatter_3d(x = pca_data_3['principle component 1'], y = pca_data_3['principle component 2'], 
             z = pca_data_3['principle component 3'])

Okay, but selecting 2 or 3 is arbitray and we only do that so it is easier for use to visualize, but it might not explain all the variance in a dataset. Sometimes it might take >3 projections. So lets visualize the variance captured for each projections in increasing manner.

In [11]:
'''
plot the variance for all the columns
'''

pca = PCA()
components = pca.fit_transform(normal)

exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)

px.area(
    x=range(1, exp_var_cumul.shape[0] + 1),
    y=exp_var_cumul,
    labels={"x": "# Components", "y": "Explained Variance"})

Generally, accepted variance to represent the entire dataset is 90%. In our case 40 features represents 90% of the dataset. But, since the dataset is not that huge and training complexity wont be affected by much from the rest of 12 features we will add them to training data anyways.

t-SNE analysis

t-SNE is another more modern approach of dimensionality reduction. Seriously, PCA was invented in 1933!

t-SNE is inherently different from PCA. In the latter we calculate the variance between the features and only take the feature set with a high variance threshol. t-SNE on the other hand calculates the closeness of the datapoints from a perspective of local cluster and global cluster.

There are two imporatant hyperparameters for t-SNE, perplexity which explains how to balance attention between local and global aspects of the data. Typically it ranges from 5 to 50 depending on the size of the data. Another parameter is the iterations, which is usually tuned for training computational complexity.

In [12]:
from sklearn.manifold import TSNE



tsne_2 = TSNE(n_components=2,perplexity=5).fit_transform(normal)
tsne_3 = TSNE(n_components=3,perplexity=5).fit_transform(normal)

tsne_data_2 = pd.DataFrame(data=tsne_2, columns = ['Embedding 1', 'Embedding 2'])
tsne_data_3 = pd.DataFrame(data=tsne_3, columns = ['Embedding 1', 'Embedding 2', 'Embedding 3'])
In [13]:
fig, ax = plt.subplots(1,2,figsize=(15,8))
ax[0].scatter(x = pca_data_2['principle component 1'], y = pca_data_3['principle component 2'])
ax[0].set_xlabel('principle component 1')
ax[0].set_ylabel('principle component 2')
ax[0].set_title('principle component analysis')

ax[1].scatter(x = tsne_data_2['Embedding 1'], y = tsne_data_2['Embedding 2'])
ax[1].set_xlabel('Embedding 1')
ax[1].set_ylabel('Embedding 2')
ax[1].set_title('t-SNE analysis')

plt.show()
In [14]:
from plotly.subplots import make_subplots


px.scatter_3d(x = tsne_data_3['Embedding 1'], y = tsne_data_3['Embedding 2'], 
             z = tsne_data_3['Embedding 3'])

Kmeans Clustering

The total number of clusters you expect should be small enough (otherwise there's no clustering) but large enough so that inertia can be reasonable (small enough). Inertia measures the typical distance between a data point and the center of its cluster.

But we have to do some analysis before getting to the point of optimum cluster count (k). First we will train a trivial number of clusters in the range from 1 to 50, expecting the optimal cluster to be somwhere between 5 to 10.

In [15]:
from sklearn.cluster import KMeans
from sklearn.metrics.cluster import contingency_matrix

''' cluster function takes dataframe and cluster size k as input
    returns the predicted cluster and model.
'''


def cluster(nclusters,data):
    kmeans = KMeans(n_clusters=nclusters)
    kmeans.fit(data)
    Z = kmeans.predict(data)
    return kmeans, Z

max_cluster_size = 50         

inertias = np.zeros(max_cluster_size)       #mask array that will be filled with calculated interias
for i in range(1, max_cluster_size):
    kmeans, Z = cluster(i,normal)
    inertias[i] = kmeans.inertia_
    
In [16]:
from plotly.graph_objs import *

'''
plot for the elbow method to find the optimal k.
'''

trace1 = {
  "mode": "lines+markers", 
  "name": "lines+markers", 
  "type": "scatter", 
  "x": list(range(1,50)), 
  "y": list(inertias[1:]),
    
    
}

data = Data([trace1])

layout = {
  "title": "Elbow for KMeans clustering", 
  "xaxis": {"title": "Number of clusters"}, 
  "yaxis": {"title": "Inertia"},

}
fig = Figure(data=data, layout=layout)
/opt/conda/lib/python3.7/site-packages/plotly/graph_objs/_deprecations.py:40: DeprecationWarning:

plotly.graph_objs.Data is deprecated.
Please replace it with a list or tuple of instances of the following types
  - plotly.graph_objs.Scatter
  - plotly.graph_objs.Bar
  - plotly.graph_objs.Area
  - plotly.graph_objs.Histogram
  - etc.


In [17]:
fig.show()

Elbow Method

In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use.

If we look at the elbow graph result of our data, we can that we can choose the number of cluster to be somewhere between 3 and 6. We cannot be sure which of those would be optimal. Elbow method works well when there is a distinct change between two points. For example, if the difference btween 3 and 5 is steep, we can say 4 is the optimal k. If not, there is a workaround for this. We can calculate the scaled intertia rather than vannila intertia. Scaling here means, we add regularization in the form of alpha. Alpha should be a very low number. As alpha goes towards, 0 the number of cluster will be 1. In order to get the cluster size, we simply do a argmin of all found weighted interia.

$$ Weighted Inertia_i^k = {(Inertia_i^k) \over (\text{Inertia of k} = 1) \times alpha \times k}$$
In [18]:
def AutoKmeans(data,k,alpha_k=0.02):
    
    inertia_o = np.square((data.values - data.values.mean(axis=0))).sum()       
    kmeans = KMeans(n_clusters=k, random_state=0).fit(data)
    scaled_inertia = kmeans.inertia_ / inertia_o + alpha_k * k
    return scaled_inertia

def chooseBestKforKMeans(data, k_range):
    ans = []
    for k in k_range:
        scaled_inertia = AutoKmeans(data, k)
        ans.append((k, scaled_inertia))
    results = pd.DataFrame(ans, columns = ['k','Scaled Inertia']).set_index('k')
    
    return results
In [19]:
res_df = chooseBestKforKMeans(normal,range(1,50))
In [20]:
print(f'Best k for the data: {res_df.idxmin()[0]}')
Best k for the data: 4

Okay, so we found that 4 is the optimal cluster count with the weighted intertia method. Now we can train some model based on this and get the predictions for them.

In [21]:
n_clusters = 4
model, Z = cluster(n_clusters, normal)

model_pca, Z_pca = cluster(n_clusters, pca_data_2)
model_tsne, Z_tsne = cluster(n_clusters, tsne_data_2)

model_pca_3, Z_pca_3 = cluster(n_clusters, pca_data_3)
model_tsne_3, Z_tsne_3 = cluster(n_clusters, tsne_data_3)

Concatenating the predicted cluster with the PCA and t-sne data to visualize in low dimension.

In [22]:
pca_data_2['class_normal'] = Z
pca_data_2['class_pca'] = Z_pca

pca_data_3['class_normal'] = Z
pca_data_3['class_pca'] = Z_pca_3

tsne_data_2['class_normal'] = Z
tsne_data_2['class_tsne'] = Z_tsne

tsne_data_3['class_normal'] = Z
tsne_data_3['class_tsne'] = Z_tsne_3

PCA RESULTS

In [23]:
classes = ['1','2','3','4']

fig, ax = plt.subplots(1,2,figsize=(15,8))
ax[0].scatter(x = pca_data_2['principle component 1'], y = pca_data_2['principle component 2'],
                 c = pca_data_2['class_normal'], label=pca_data_2['class_normal'])
ax[0].set_xlabel('principle component 1')
ax[0].set_ylabel('principle component 2')
ax[0].set_title('Kmeans before PCA')

ax[1].scatter(x = pca_data_2['principle component 1'], y = pca_data_2['principle component 2'], 
             c = pca_data_2['class_pca'], label=pca_data_2['class_pca'])
ax[1].set_xlabel('principle component 1')
ax[1].set_ylabel('principle component 2')
ax[1].set_title('Kmeans with PCA')

plt.show()
In [24]:
px.scatter_3d(x = pca_data_3['principle component 1'], y = pca_data_3['principle component 2'], 
             z = pca_data_3['principle component 3'], color=pca_data_3['class_pca'])

THOUGHTS:

Applying K-means on PCA data and kmeans on raw normalized data are very similar. As we can see from the 2D and 3D clusters there are some outliers in both the cases, but overall the results seems to good.

T-SNE RESULTS

In [25]:
classes = ['1','2','3','4']

fig, ax = plt.subplots(1,2,figsize=(15,8))        #raw data subsplot
ax[0].scatter(x = tsne_data_2['Embedding 1'], y = tsne_data_2['Embedding 2'],
                 c = tsne_data_2['class_normal'], label=tsne_data_2['class_normal'])
ax[0].set_xlabel('Embedding 1')
ax[0].set_ylabel('Embedding 2')
ax[0].set_title('Kmeans before t-SNE')

ax[1].scatter(x = tsne_data_2['Embedding 1'], y = tsne_data_2['Embedding 2'],  #t-sne data subsplot
             c = tsne_data_2['class_tsne'], label=tsne_data_2['class_tsne'])
ax[1].set_xlabel('Embedding 1')
ax[1].set_ylabel('Embedding 2')
ax[1].set_title('Kmeans with t-SNE')

plt.show()
In [26]:
px.scatter_3d(x = tsne_data_3['Embedding 1'], y = tsne_data_3['Embedding 2'], 
             z = tsne_data_3['Embedding 3'], color=tsne_data_3['class_tsne'])

THOUGHTS

Both the dimensionality reduction techniques seems to work very well. But does it mean they are correct? From one of the stack overflow answer here the writer says that the t-sne clusters must be carefully understood before coming to conclusions as they can be misleading.

While clustering after t-SNE will sometimes (often?) work, you will never know whether the "clusters" you find are real, or just artifacts of t-SNE. You may just be seeing shapes in clouds.

Even though the visuals give some ideas about the models clustering capability, we need to quantify it to make it more trustworthy.

Quantifying results with some metrics is easier when working with supervised learning. But given the lack of lables in unsupervised learning, quantification is limited. One of the most common metrics to use for clustering problems is the Silhouette Coefficient. It is given by,

$$ S = {(b-a) \over \max(a,b)} $$

a = average intracluster distance between datapoints in a cluster.

b= average intercluster distance between datapoints in all clusters.

Silhouette ranges from -1 to 1, where -1 means the data is opposing to each other, 0 means the distance between clusters is insignificant, 1 means the distance between clusters are well pronounced.

In [27]:
from sklearn.metrics import silhouette_score, silhouette_samples

normal_score = silhouette_score(normal, model.labels_, metric='euclidean')
pca_score_2 = silhouette_score(pca_data_2, model_pca.labels_, metric='euclidean')
tsne_score_2 = silhouette_score(tsne_data_2, model_tsne.labels_, metric='euclidean')

pca_score_3 = silhouette_score(pca_data_3, model_pca_3.labels_, metric='euclidean')
tsne_score_3 = silhouette_score(tsne_data_3, model_tsne_3.labels_, metric='euclidean')


nl = '\n'              
print(f'KMeans Non Engineered Silhouette Score: {nl} {normal_score}')
      
print('\n')
              
print(f'KMeans PCA Scaled Silhouette Score; {nl} 2 components: {pca_score_2}; {nl} 3 components: {pca_score_3}')
      
      
print('\n')
print(f'KMeans t-SNE Scaled Silhouette Score;{nl}  2 embeddings: {tsne_score_2};{nl} 3 embeddings: {tsne_score_3}')
     
KMeans Non Engineered Silhouette Score: 
 0.07730232765455476


KMeans PCA Scaled Silhouette Score; 
 2 components: 0.731275924926822; 
 3 components: 0.7335979125472216


KMeans t-SNE Scaled Silhouette Score;
  2 embeddings: 0.44243904464483286;
 3 embeddings: 0.26589447561946394

This somewhat contradicts with the result from the visualizations of t-sne. Interestingly, the silhouetee score for the model trained on the entire df does not equate to the results shown in lower dimension visuals. This shows the need for quantification. Ok. So for our data, the final combination that works best would be Kmeans+PCA.

INFERENCE

Why do all this anyway? We can get some nice information from the clustered data. First lets concatenate the predicted labels back to the raw data and see if there are any patterns.

In [28]:
inf = raw.copy()
inf['class'] = Z

sales_group = inf.groupby('class').sum().astype(int).reset_index()    #groupby the predicted cluster and calculate the median sales 
weekly_sales = sales_group.drop('class',axis=1)

Let's say we want to see the sales pattern of the product in a week. We do not have names for the product but arbitrary values like P1 and P2. But let's assume if P1 is ice-cream and P2 is pencils. Obviously they both would have different sales pattern and possible be clustered into different groups. We want to see at which week of the year they sell the most so we can manage stocks.

A typical plot from the raw data would look like this. Very messy and illegible

In [42]:
fig = px.line(raw.T, title='Messy data representation of weekly sales')
fig.update_xaxes(title_text='Week')
fig.update_yaxes(title_text='Sales Count')
fig.show()

Now lets take a look at out clustered and grouped data. So much better. When a new product with certain parameter is added into the data, we can easily predict which time of the year the clustered product would sell the most. Offcourse the data is not very elobrate to support this kind of patterns since it only has amount of weekly sales from one year. It can be made more robust with other features added such as product categories, seasonal data, more data from different years etc..

Two types of products seems to fall behind in sales as the year progresses. Hmm I suppose what could that be? meanwhile the other two seems to be static and low volume.

In [41]:
fig = px.line(weekly_sales.T, title='After applying clustering to raw data')
fig.update_xaxes(title_text='Week')
fig.update_yaxes(title_text='Sales Count')
fig.show()
In [31]:
'''
Adding weekly data(every 4 columns) and further grouing by the predicted cluster and plotting monthly data'''


monthly = inf.iloc[:,:52].groupby((np.arange(len(inf.iloc[:,:52].columns)) // 4) + 1, axis=1).sum()
monthly['class'] = inf['class']

monthly_group = monthly.groupby('class').sum().astype(int).reset_index()
monthly_sales = monthly_group.drop('class',axis=1)
In [40]:
fig = px.line(monthly_sales.T, title='Monthly sales of different clusters')
fig.update_xaxes(title_text='Month')
fig.update_yaxes(title_text='Sales Count')
fig.show()

CONCLUSIONS

Okay so we went through two different dimensionality reduction techniques and one clustering techniuqe in the form of Kmeans. There are several other algorithims for clustering like DBSAN, Agglomerative clustering etc.. but this notebook is already too long and the results are quite decent for the combination of kmeans and PCA. Thank you for your attention and have a good day!