machine learning Technology

Machine Learning: Fundamenals of K-Nearest Neighbours Algorithm for Classification Problems

In this article, we are going to understand the absolute fundamentals, along with intuitions of what K-Nearest Neighbours (K-NN) is all about.

What is K-Nearest Neighbours ?

K-NN, or K-Nearest Neighbours, is a supervised learning algorithm that can be used for Classification and Regression problems.

What is classification ? You have 10 bottles of wines. You are supposed to classify them into red and white. This can be done using software, and is known as a Classification Problem.

What is regression ? Simply put, regression is a technique to predict. The prediction could be the future, the price of a house, the quantity of units being sold today on Amazon. Regression in simple words means to predict in the space of Machine Learning. Don’t let these gurus confuse you!

The term K-Nearest Neighbours means the following:
1. K – An integer value that we choose to tell the algorithm
2. Neareast Neighbours – We use the value from K above, say for example K=5, and ask our algorithm to tell us the 3 Nearest Neighbours to a certain data point wih coordinate say (X,Y)

K-NN Classification: A strong intuition

To begin with it is important we touch the topic of K-NN classification, and then once the intuition is understood, we slowly move over to K-NN regression in another post.

To make things clearer in perspective, like I always prefer, the intuition of this algorithm is absolutely essential. Let’s take a look at this image below.

func.in: An illustration of the case at hand

In the above graph we see two categories, Green and Blue. Now, the ones in Blue when dropped do not break, but the ones in Green, when dropped, break!

Now the question is what if I have one glass with a specific hardness and it is dropped from a certain height. We would like to predict if this glass will break or not break. Take a look at the image below.

func.in: The objective is to find out whether the glass will break or not

Here, the point on the scatter plot, with a purple outline, is to be categorised as a glass which will break or not. Simple ? Maybe yes, maybe no. Let’s find out together 🙂

The process behind K-NN

The process is simple, and memorise it, will only help you grasp the logic better. Four main steps for you. Ecco qua!

Step 1

The problem is question whether a glass of a specific hardness, if dropped from a specific height, will break or not – is placed into the scatter plot with other glasses.

Step 2

The distance is measured from this point to ALL other points on the scatter plot. The distance is measured using the Euclidean distance formula, which is:

If the points (x1,y1) and (x2,y2) are in 2-dimensional space, then the Euclidean distance between them is

Formula for Euclidean Distance between two points in a two dimensional space
func.in: Euclidean Distance Formula

Step 3

All the distances are first sorted in ascending order.

We then choose K = 5, which is an number we choose as part of the algorithm. It tells the algorithm to find the K number of nearest neighbours from the PURPLE point. In the diagram on the right we have marked the SHORTEST FIVE DISTANCES in RED, since the value of K we have chosen is 5.

Step 4

In the image below, I have illustrated the most important steps next:
1. The distances are sorted in ascending order, an the top five distances are chosen since we have selected K=5.
2. The algorithm then checks for the highest frequency of the classes. In this case it is Green.
3. Since the highest frequency is Green, the algorithm decides, ok “I think that the glass will break since the highest frequency is green!

func.in: The most important process in the K-NN algorithm

How do I choose the right value of K?

To select the appropriate value of K, we have to run the KNN algorithm multiple times, with multiple values of K. Whichever value of K gives us the best percentage of accuracy and/or minimum errors, then that is the right choice of K. 🙂

Quick tips:
1. With decreasing value of K – predictions become less stable. Example: K=3
2. With increasing value of K – predictions become better, and more stable. This is mainly because of higher voting capacity.
3. Always keep the value of K as an odd number. Helps reduce split decisions.

Conclusion

K-NN is among the simplest classification algorithm on the market right now. It is important to begin with such simple classification algorithms to understand the absolute basics of Machine Learning. I hope you enjoyed reading this, and in case you have any questions – please ask your doubts, in the comments below.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: