k-Means clustering algorithm

22 Mar 2020 - tsp
Last update 19 Aug 2020
Reading time 4 mins

Since I’ve written a short summary about neuronal network basics a few people / students have also shown some interest in different clustering methods. One of the most simple vector compression or clustering methods is the k-means algorithm.

k-Means is used really often used for stuff like object recognition, recommender systems, etc. - anything that does categorical association. It’s one of the more important algorithms in data analysis and data mining. It’s one of the most simple to implement algorithms - and also one that’s really powerful. In many situations it can provide similar or better performance as neuronal networks. It’s also used heavily in image recognition in models like bag-of-words, etc.

Basic idea

The basic idea behind k-Means is to locate $k$ (hence the name) cluster centers that are located in the mean - one can also imagine at the center of mass - of a group of data points. This means that every point is assigned to a given cluster center which is located exactly in the center of mass of all points assigned to a given cluster.

The k-Means algorithm is used to locate these centers of mass, it in fact minimizes the function

[ E = \sum_{n \in Clusters} \sum_{i \in Cluster} (x_{i;n} - \bar{x}_n)^2 ]

I.e. it minimizes the distance between all nodes and their associated cluster centers. The easiest way to achieved this behavior is the Lloyd algorithm:

Randomly (or heuristically) initializing a fixed number of cluster centers (i.e. $k$ centers). It’s a good idea to randomly choose one of the data points as cluster center instead of sampling totally random values.
Determine to which cluster center each and every data point will get assigned. One chooses the center to which the distance (normally using the euclidean norm, i.e. $d_{j;n} = (\vec{x}_i - \vec{\bar{x}}_n)^2$) is minimized. In case multiple cluster centers are having the same distance one might randomly choose one.
Recalculate the position of cluster centers by taking the mean of all data points that are assigned to a given cluster center
Repeating and re-assigning the data points to cluster centers. That way the cluster centers move and try to minimize the distance function mentioned above.

Of course the choice of initial positions has a huge impact on the quality of cluster centers. In the most basic algorithm one might try multiple iterations with different random starting values for the cluster centers.

At the end one should calculate the variance of each of the cluster centers to get a measure for the quality of the clusters. When now one wants to do categorization one can simply

There are some modifications to this algorithm:

One might not select the cluster center that minimizes the distance to a given point but which has the smallest variance change before and after assignment of the point. This is mostly called MacQueen’s Algorithm. One might do the same iterations as before to move cluster centers into optimal spots.
One might select cluster centers using a different heuristic (see for example K-Means++ below).
Note that you do not have to calculate the square root when calculating the euclidean distance. Also the square is not really required (if one uses absolute values)
Another possible way to select initial cluster centers would be to assign all points randomly to $k$ clusters. Then one can calculate the first cluster centers from these random assignments and then perform standard k-means again.

K-Means++

This is simply an extension to the initial center guessing / determination to select clusters that have a high probability of leading to a useful result.

Initially select a center at (uniformly) random from any of the data points
Repeat till $k$ centers have been chosen:
- For each point compute the distance $D(\vec{x})$ between the data point and the nearest center that has already been chosen.
- choose a new data point random as a new center but use a weighted probability function with a probability $p \propto D(\vec{x})$
Now run standard k-means algorithm as described above and use either distance or variance for cluster assignment of the data points.

Bisecting k-means

This is an algorithm that can be used when one doesn’t know how many cluster centers one wants to use. This method starts with $k=2$ clusters. After the initial run of the k-means or k-means++ algorithm one splits the largest cluster and repeats the same procedure. Then one splits again, etc. till either the separation provided by the clusters (determined by their variance and overlap) is not usable anymore or the desired amount of clusters has been found.

There is an additional method called X-means that works the same way but without any limit on $k$ but with other criteria that decide when one should stop splitting (i.e. when there’s no more information gain from the additional clusters).

Drawbacks

The solution found depends heavily on the starting conditions selected
k-means assumes that clusters are convex.
There is no built-in detection for outliers. One can include such detection when calculating clusters of course.