K Means Clustering on High Dimensional Data.

Published in

The Startup

6 min readJan 28, 2021

KMeans is one of the most popular clustering algorithms, and sci-kit learn has made it easy to implement without us going too much into mathematical details. How does KMeans work under the hood? Maybe that’s a story for another time. Today we will see how we can use KMeans to cluster data, especially data with higher dimensions. Statistics defines dimensionality as the attributes or features a dataset has, and the data that we are working today with is Wine-dataset which has 13 dimensions. Each wine has 13 factors that contributed to its awesome taste and we will make KMeans work, to group similar wines together.

Steps that we will perform:

Load the wine-dataset.
Reduce the dimensions using Principal Component Analysis (PCA).
Finding important features with the help of PCA.
Hyperparameter tuning using the silhouette score method.
Apply K Means & Visualize your beautiful wine clusters.

Full code can be found at Wine_Clustering_KMeans.

1. Load your wine dataset.

We are using pandas for that. So we have :
178 rows → each row represent one wine entry
13 columns → each column represents wine’s attribute

The information about which wine belongs to which category was removed so that the clustering can be done based solely on the attributes, but we do know that there are 3 different types of wines (the original dataset specifies it), so if KMeans predicts 3 clusters then our day is saved!

Raw wine data

Scaling the features is a crucial step when the units of features differ from each other. In the above output it is evident that all the 13 attributes are in different range, scikit’s StandardScaler() standardize features by removing the mean and scaling to unit variance. There are several other types of scaling available but today we will just use Standard Scaler as per our requirement.

Scaled wine data

We all know that KMeans is great, but it does not work well with higher dimension data. Thanks to Curse of Dimensionality :( as the dimensions increase the distances between different data points tend to be close together and this is a huge problem especially when you are dealing with algorithms like KMeans that uses distance-based metrics to identify similar points for clustering. We can either reduce the number of features by:

a) By dropping features that are very similar to each other and keeping just one out of the two.

b) By combining features that represent more sensible information when considered together.

and if you cannot afford to do either of the two or even after applying them, the dimensions are a mess we then can

c) Use any of the several available dimensionality reduction techniques.

2. Reduce the dimensions using Principal Component Analysis (PCA)

We are reducing the number of dimensions from 13 to 2, also because it will be easier to visualize, remember reducing dimensions means that there will be some loss of information. To be sure you are not losing a lot of information, do check the combined “explained variance ratio” of each component in your PCA result. Here we get 55.41% which means we still have 55.41% of our original data, what percentage of data you want totally depends on your project and your requirement, right now 55.41 % seems good to me for this toy example.

Fitting PCA on the scaled data

3. Finding important features with the help of PCA.

To check which features are shown as important by our PCA, we can use sklearn’s components_ method. It is essentially a matrix of number of principal components as rows and features as columns (n_components, n_features).

PC 1 ad PC 2 values for all the features

“To interpret each principal components, examine the magnitude and direction of the coefficients for the original variables. The larger the absolute value of the coefficient, the more important the corresponding variable is in calculating the component.” — source

Looking at the results we can notice that as per the first principal component feature 6 (Total_Phenols), feature 7 (Flavonoids), feature 9 (Proanthocyanin), and feature 12 (OD280) are important as their values are quite higher than the rest of the features. (Looking at the result using 0.3 as the threshold to filter the important features seems fair). Hence, below are the important features categorized by PC 1 and PC 2.

Important features as per PC 1 and PC 2

4. Hyperparameter tuning using the silhouette score method

Apart from the curse of dimensionality issue, KMeans also has this problem where we need to explicitly inform the KMeans model about the number of clusters we want our data to be categorised in, this hit and trial can be daunting, so we are using silhouette score method. Here you give a list of probable candidates and the metrics.silhouette_score method calculates a score by applying the KMeans model to our data considering one value (number of clusters) at a time. For eg., if we want to check how good our model will be if we ask it to form 2 clusters out of our data, we can check the silhouette score for clusters=2.

Silhouette score value ranges from 0 to 1, 0 being the worst and 1 being the best.

Silhouette Scores using a different number of cluster

Plotting the silhouette scores with respect to each number of clusters for our KMeans model shows that for the number of clusters=3 the score is the highest.

5. Apply K Means

We finally consider the number of clusters that we got as a result of our Hyperparameter tuning of the model and apply our KMeans model for our final result.

# fitting KMeans    
kmeans = KMeans(n_clusters=optimum_num_clusters)    kmeans.fit(data_scaled)

and visualize it by plotting the 2 PCA components (remember we reduced 13 dimensions to 2). Our KMeans was indeed able to cluster 3 different wine categories.

Now that we are done clustering, pour a glass for yourself and enjoy! 😉

*Note*: In the code, the fit_transform method of PCA is not applied to KMeans. That is because of 2 main reasons:

We would like to reduce the dimensions so that we can visualize our data in 2-D. Reducing dimensions also means some loss of information.
To apply our KMeans we cannot ignore the information which was lost when we reduced the dimensions, hence we should apply KMeans on the entire scaled data.

Also one can notice in the code that, for visualizing the data, PCA results were used and on top of those, the centroids were marked (these centroids were obtained from the KMeans). In short PCA reduction helps reduce the high dimensional data more for visualization and initial exploratory purposes in this example (that might not be the case always).

Shivangi0503/Wine_Clustering_KMeans

This repo consists of a simple clustering of the famous Wine dataset's using K-means. There are total 13 attributes…

github.com

Wine Dataset for Clustering

Cluster wines based on their chemical constituents

www.kaggle.com

5 Basic questions and answers about high dimensional data

The main idea of this post is to answer what high dimensional data is, its main challenges at the moment to create a…

www.thinkingondata.com

PCA clearly explained — How, when, why to use it and feature importance: A guide in Python

In this post I explain what PCA is, when and why to use it and how to implement it in Python using scikit-learn. Also…

towardsdatascience.com