Learn the k-nearest neighbours classifier through animated graphics

Showing 178 data points

The data science of wine

Data science is a hot topic and has been for a while now. Starting with this page we are going to introduce some key data science algorithms. Here we will see how the k-nearest neighbours (kNN) algorithm works, which classifies data points by considering their closest neighbours. These algorithms are a part of machine learning (ML) too. The task of the kNN on this page is called classification, so new data has to be assigned a class based on the existing, labelled data. KNN can also be tasked with regression, which we won't cover here.

Above we see a view on the data set which we will work with here. It shows the chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. While there are 13 columns in the file we are only displaying 2 in the scatter plot above: alcohol content and malic acid concentration. We won't take the data science too seriously here and focus on the algorithm itself.

k-Nearest neighbours

The red cross shows a new data point for which we want to estimate the class from the k = 7 closest neighbours. The closest neighbours are shown in colour while the rest of the data is grey. The new data point is classified as the most common neighbour class. Click on play to animate a range of new data points to be classified.

Parameter k = 3

Decision regions

Instead of classifying a single new point, we can ask what the algorithm would predict for every position in the plot. If we colour each spot of the background by the class that its k nearest neighbours vote for, the so-called decision regions of the classifier appear. The tinted areas now show at a glance where a new wine would be labelled, while the dots are the original data on top. Here we used a small k of just 3, and the result is telling: the boundaries are jagged and react strongly to individual points, so that a single outlier can carve out its own little island of colour.

Parameter k = 9

The effect of k

This plot is made in exactly the same way, but now we let 9 neighbours vote on every position. Using more neighbours smooths out the decision boundaries and makes the classifier far less sensitive to individual noisy points, at the price of blurring the finer detail between the classes. Choosing k is therefore a balancing act: pick it too small and the model overfits to noise, pick it too large and it ignores genuine local structure. Finding a good value of k is at the heart of using the k-nearest neighbours algorithm well.

Putting it to the test

So far we have only looked at the decision regions, but how good is our classifier really? To find out we hold back a random 10% of the wines as a test set and build the decision regions from the remaining 90%, the training set. Each held-out point is then dropped onto the map and classified by its k = 9 nearest training neighbours. A cross marks a wine that was labelled correctly, while a diamond marks one the classifier got wrong. Because the model never saw these points while it was being built, they give us an honest estimate of how it would do on new, unseen wines.

A third dimension

On the panel above we have added another chemical property, the so-called Alcalinity of ash, as the third dimension to be used in the scatterplot. You can rotate the plot by using the mouse. Using three instead of two dimensions in finding neighbours might improve performance, but doesn't necessarily do so.

Testing in three dimensions

Here we have used approximately 10% of the data as the test set again, and if you click play, it will estimate the performance of the classifier. In my opinion the data set is too small to judge whether we have good classifiers or not, so this is just our toy example for this page. We have omitted using all available columns with kNN and focused on what we can display, but neighbours can be estimated with an arbitrary number of dimensions.

Thank you to the providers of the data set, which you can find here on Kaggle. The Claude Opus LLM helped create this page. There are more algorithms and data structures to be found on the main page.