Familiarization with Unsupervised Learning
Until now we have dealt with supervised learning only! This is the first time we will be trying out a form of Unsupervised Learning. Just as a refresher, Unsupervised Learning is a type of learning where we do not help the system during training by supplying it with labels, instead the machine itself figures out what to do with it. Unsupervised Learning may seem complex at first but I assure you that under the hood, it is easy. You have to just prepare your mindset while dealing with Unsupervised Learning. It is similar to reading a textbook without your teacher’s help, you can still kind of understand what is written, but if a teacher was present it would make a big difference. Whenever you are learning complex topics, just close your eyes and create a simple analogy. This will help you a lot when we come to the difficult topics such as Restricted Boltzmann Machines & AutoEncoders.
Introduction to Clustering
Clustering is a form of Unsupervised Machine Learning where after repeated training the model figures out how to group the data into individual clusters. Similar to a kNN the K in K-Means represents the number of clusters we want. if we specify the value of K as 2, then the model will split the data among 2 individual clusters. Again to simplify, t
Learning Strategy of K-Means Clustering
Every ML Algorithm needs some sort of training to work properly, to understand everything that has been fed to it, to create relationships with the data etc. let us see how the K-Means Algorithm learns. There are 4 main steps to train a K-Means algorithm, they are:
- Initialization, n Centroids are generated at a random location, here n refers to the value of K that we specify.
- Assignment, n Clusters are created by associating each data observation with the closest Centroid. This creates k number of clusters.
- Update, The Center of mass ( I am referring to Center of Mass as the center of the entire cluster ) becomes the new Centroid Location.
- Iterate, This process then continues until the model reaches convergence, i.e when the clusters are correctly formed.
So let us briefly discuss about its learning strategy, Firstly it creates a Centroid at a random location. Then the distance(typically Euclidean distance) is measured from that centroid to the other datapoints. Once that step is done the model creates a cluster consisting of datapoints that are closest to the centroid. Then the Update phase starts, here the new center of mass is calculated for each cluster and then that center of mass becomes the new cluster. Now we reach the iterate phase where the model checks if it has reached convergence, if it hasn’t reached convergence then the model starts again with the Assignment stage and continues to calibrate the centroid from its previous location to a more stable location. This happens to all the n Clusters, again where n is the value of k decided by the developer. Here is the learning process visualized.
Now that you have visually understood how the K-Means Clustering works, we can perform a simple demo in Python that shows the proof of concept. We will not be using an actual dataset but instead we will make our own simple data that will just show you how the K-Means Clustering Algorithm performs.
here we are creating the sample training data in the variable X. Now let us see the output with visualization created using the Matplotlib library.
We specified a K Value of 3 so we are expecting 3 clusters. Let us see the final results
Whoa! That is really great. We have 3 correct clusters, just as we expected!
So here we can see that our model has made 3 clusters just after a few seconds of training. However we were not trying to imitate the demo that I gave you to understand the learning strategy of K-Means. This is a completely separate example. This was the demo that I wanted to show you.
Conclusion and more information
Clustering is a truly unique form of Machine Learning. We have learnt all the things required to understand the basics of Clustering using the K-Means Algorithm. How do we find the correct value of k? Well we can use just trial and error, but it is not worth your time. So there is a rule of thumb called The Elbow method. Let me Show it
In this elbow method we try different values of K and observe the degree of variance. we can see that in the first few K values we see big jumps in variance, after a K value of 4 we see limited increase, thus 4 is the best K value to use in order to get good results as well as minimize the load on the computer.
So guys in my opinion this is all that you need to know in order to understand the concept of Clustering using the K-Means Algorithm. There is a good news for all of you following this blog series. We have officially ended the Classical Machine Learning series. In the next blog we will do a short discussion on other forms of Classic ML such as Naive Bayes and Ensemble learning. After that we will dive into the beautiful bliss that is Deep Learning. Once you learn the basics of Deep Learning you will be confident in doing such tasks.
Thank you following me and reading this blog, I am pleased to share my knowledge. Have a nice day and enjoy Deep Learning. 🙂