After learning Support Vector Machines we are going to advance into the territory of kNN a.k.a KNearestNeighbors. K nearest neighbors is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). kNN has been used in statistical estimation and pattern recognition. kNN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The kNN algorithm is among the simplest of all machine learning algorithms.
Classification & Learning process
kNN is a classification algorithm. This algorithm works by finding the euclidean distance between points of the nearest classes and the new point. The number of points that it needs to find the distance between is denoted by K. if K = 3 then then the model has to find 3 closest points and the majority of the closest distance is taken and the class is identified. For example we have the value of K as 3, two classes ‘A’ & ‘B’ along with a new point named ‘C’. How do we predict which class does ‘C’ belong to? The answer here is using a kNN Classifier. Let us see a diagrammatic representation
Here we can see that because our K value is 3, we are checking the distance to other points only 3 times. We can clearly make out that the new yellow point ‘C’ is closer to the class ‘A’ than the class ‘B’ simply because the total euclidean distance from the points in class ‘A’ is lesser than the ones in class ‘B’. Thus the kNN classifies this new point to be of type ‘A’. The more the value of ‘K’ more the accuracy of the classification. However there occurs a point of diminishing returns where a high value of K doesn’t give the best results but instead taxes the computer. We Should always choose a generous amount for K. The accuracy of this model is counted by dividing the number of closest points and the total K value. In this case the accuracy is 2(closest points) / 3(value of K) so 2/3 = 0.66% so we have an accuracy of 66%. Generally we would not want to have an even K Value. Although a accuracy of 66% is not that great we have a good enough classification here. So for the purpose of this example it is fine. However to increase the accuracy, we can increase the value of K. This is a simple and intuitive understanding of how a kNN Classifies a new point.
Things to keep in mind
- The dataset that we are going to use needs to have as little noise as possible, here ‘noise’ refers to the overlapping of the target classes.
- We need to have labelled features in our dataset as a kNN is a supervised type of ML.
- Our dataset needs to contain only relevant features. If there are some features that do not contribute anything to the final answer like “patient name” etc. we have to drop it as it increases the complexity and decreases the overall accuracy.
- Avoid using large K values for a kNN working on a large dataset as it takes a lot of time to train because it needs to find the distance to other points n number of times, where n stands for the value of K.
- Some popular use cases of a kNN include:
- Recommendation Systems
- Credit Risk Analysis
- Predictive Trip Planning
- Stock price recommendation
- a kNN is a supervised Classifier
Now that you have a rough idea about what is a kNN algorithm, let us solve a simple problem in Python. We are going to work on a dataset which contains all the records of Heart Disease Patients. Our task is to classify which category does the new patient belong to! Firstly we need to download our dataset, to do that go to this link. This is a link to my Dropbox where I have added a file named ‘Heart.data’ click on download and save it to your working directory.
You may ask, where have you specified the value of K? The answer to that is on line 17 in the program. clf = neighbors.KNeighborsClassifier(7) This line creates a kNN Classifier with a K Value of 7 and stores it into a reference variable called ‘clf’. The KNeighborsClassifier is a part of the neighbors module of scikit-learn in Python 3. You may have noticed that in line 9, 10 and 11 we are removing certain columns from our dataset! This is because that contains data which is useless for our predictions as they just increase the complexity of the dataset. We are predicting on a sample array of features that we have created. Let us see the output of this model based on the sample features that we provided. For further information on the dataset you can visit here.
Predictions made by the kNN
It is sad to see that we have an accuracy of only 63 percent! However this is not completely our fault. The model itself predicts whether the person has a 50% chance of diameter narrowing. It also consisted of a lot of missing attributes and bad data. We have replaced all the missing attributes with an outlier(-99999) and thus most of our data-points are treated as outliers affecting our overall accuracy. But none the less, we have successfully created a kNN Classifier that can classify Heart Disease Patients!
To sum it all up, The KNearestNeighbors algorithm is useful in Classification of smaller chunks of data. It is a method that finds the euclidean distance between n points, where n is the value of K that the user specifies. This approach is useful in many cases. a kNN is a relatively easy type of algorithm in the ML Pipeline. We also created and used a kNN to classify Heart Disease Patients.
Hope you learnt something new today and enjoyed today’s session. In the next blog post we will be discussing about a very simple topic in the ML Pipeline, Decision Trees. Until then Have a nice day and enjoy Machine Learning! 🙂