Unsupervised Machine Learning – Clustering using K-Means

Familiarization with Unsupervised Learning

Until now we have dealt with supervised learning only! This is the first time we will be trying out a form of Unsupervised Learning. Just as a refresher, Unsupervised Learning is a type of learning where we do not help the system during training by supplying it with labels, instead the machine itself figures out what to do with it. Unsupervised Learning may seem complex at first but I assure you that under the hood, it is easy. You have to just prepare your mindset while dealing with Unsupervised Learning. It is similar to reading a textbook without your teacher’s help, you can still kind of understand what is written, but if a teacher was present it would make a big difference. Whenever you are learning complex topics, just close your eyes and create a simple analogy. This will help you a lot when we come to the difficult topics such as Restricted Boltzmann Machines & AutoEncoders.

Introduction to Clustering

Clustering is a form of Unsupervised Machine Learning where after repeated training the model figures out how to group the data into individual clusters. Similar to a kNN the K in K-Means represents the number of clusters we want. if we specify the value of K as 2, then the model will split the data among 2 individual clusters.  Again to simplify, the goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. The K-Means Clustering produces 2 outputs, The Centroids, which are the center of the cluster & The Labels, once the n Clusters are formed, here n stands for the value of k. They are assigned a label. This type of model allows us to save time by not needing to sort, arrange and manipulate the data and feed it into the model. Instead it allows us to input the data and get back useful groups decided by the value of k. This is a relatively simple algorithm, we will discuss more about its learning strategy later. Actually the topics of Classic Machine Learning are relatively easier than those of Deep Learning. Over many iterations the machine understands the groups of data very well and there occurs one point where the algorithm ends a.k.a convergence of the algorithm. That point of time is when we have the output of our clustering, i.e The Centroids & The Labels for the clusters, along with the actual clusters.

Learning Strategy of K-Means Clustering

Every ML Algorithm needs some sort of training to work properly, to understand everything that has been fed to it, to create relationships with the data etc. let us see how the K-Means Algorithm learns. There are 4 main steps to train a K-Means algorithm, they are:

  • Initialization, n Centroids are generated at a random location, here n refers to the value of K that we specify.
  • Assignment, n Clusters are created by associating each data observation with the closest Centroid. This creates k number of clusters.
  • Update, The Center of mass ( I am referring to Center of Mass as the center of the entire cluster ) becomes the new Centroid Location.
  • Iterate, This process then continues until the model reaches convergence, i.e when the clusters are correctly formed.

So let us briefly discuss about its learning strategy, Firstly it creates a Centroid at a random location. Then the distance(typically Euclidean distance) is measured from that centroid to the other datapoints. Once that step is done the model creates a cluster consisting of datapoints that are closest to the centroid. Then the Update phase starts, here the new center of mass is calculated for each cluster and then that center of mass becomes the new cluster. Now we reach the iterate phase where the model checks if it has reached convergence, if it hasn’t reached convergence then the model starts again with the Assignment stage and continues to calibrate the centroid from its previous location to a more stable location. This happens to all the n Clusters, again where n is the value of k decided by the developer. Here is the learning process visualized.

RAW Data
This is the RAW Data that is fed into the model
Initialization of Centroids at random location
Firstly, the Centroids are randomly generated
The first Cluster
Then the Centroids calculate the distance to other datapoints and groups the closes ones to form a cluster
This picture shows us the centroids updating the position
Once the Cluster is formed, the center of mass is calculated and the center of the cluster is the new position of the Centroid
This picture is showing the process of recreating the clusters
The previous cluster is now discarded and the distance to the other points are measured once again
This is the convergence
A new cluster is formed from the new centroid, this entire process continues until convergence, in this case the model has reached convergence and now gives us the new clusters A,B & C along with the respective Centroids

Now that you have visually understood how the K-Means Clustering works, we can perform a simple demo in Python that shows the proof of concept. We will not be using an actual dataset but instead we will make our own simple data that will just show you how the K-Means Clustering Algorithm performs.

Code

The Python code requied to execute a simple demo using K-Means
This is the code to execute a simple clustering demo using the K-Means Clustering Algorithm

here we are creating the sample training data in the variable X. Now let us see the output with visualization created using the Matplotlib library.

Output

This picture shows the visualization of the RAW Data
These are all the RAW Datapoints that are fed into the Algorithm.

We specified a K Value of 3 so we are expecting 3 clusters. Let us see the final results

Shows the output with 3 clusters
We can clearly see 3 clusters have been formed, the ‘X’ marks are the Centroids, we can clearly see that they are in the center of the clusters.

Whoa! That is really great. We have 3 correct clusters, just as we expected!

So here we can see that our model has made 3 clusters just after a few seconds of training. However we were not trying to imitate the demo that I gave you to understand the learning strategy of K-Means. This is a completely separate example.  This was the demo that I wanted to show you.

Conclusion and more information

Clustering is a truly unique form of Machine Learning. We have learnt all the things required to understand the basics of Clustering using the K-Means Algorithm. How do we find the correct value of k? Well we can use just trial and error, but it is not worth your time. So there is a rule of thumb called The Elbow method. Let me Show it

picture of the elbow method
This is a picture representing the elbow method

In this elbow method we try different values of K and observe the degree of variance. we can see that in the first few K values we see big jumps in variance, after a K value of 4 we see limited increase, thus 4 is the best K value to use in order to get good results as well as minimize the load on the computer.

So guys in my opinion this is all that you need to know in order to understand the concept of Clustering using the K-Means Algorithm. There is a good news for all of you following this blog series. We have officially ended the Classical Machine Learning series. In the next blog we will do a short discussion on other forms of Classic ML such as Naive Bayes and Ensemble learning. After that we will dive into the beautiful bliss that is Deep Learning. Once you learn the basics of Deep Learning you will be confident in doing such tasks.

Thank you following me and reading this blog, I am pleased to share my knowledge. Have a nice day and enjoy Deep Learning. 🙂

-MANAS HEJMADI

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s