Grouping with Clustering Algorithms

javascript Aug 16, 2019

A common and introductory unsupervised learning problem is that of clustering. Often, you have large datasets that you wish to organize into smaller groups, or wish to break up into logically similar groups. For instance, you can try to divide census data of household incomes into three groups: low, high, and super rich. If you feed the household income data into a clustering algorithm, you would expect to see three data points as a result, with each corresponding to the average value of your three categories. Even this one-dimensional problem of clustering household incomes may be difficult to do by hand, because you might not know where one group should end and the other should begin. You could use governmental definitions of income brackets, but there's no guarantee that those brackets are geometrically balanced; they were invented by policymakers and may not accurately represent the data.

A cluster is a group of logically similar data points. They can be users with similar behavior, citizens with similar income ranges, pixels with similar colors, and so on. The k-means algorithm is numerical and geometric, so the clusters it identifies will all be numerically similar, with data points that are geometrically close to one another. Fortunately, most data can be represented numerically, so the k-means algorithm is useful for many different problem domains.

The k-means algorithm is a powerful, fast, and popular clustering algorithm for numerical data. The name k-means is comprised of two parts: k, which represents the number of clusters that we want the algorithm to find, and means, which is the method of determining where those cluster centers are (you could, for instance, also use k-medians or k-modes). Translated into plain English, we might ask the algorithm to find us three cluster centers that are the mean values of the points they represent. In that case, k = 3 and we can tell our bosses that we did a k-means analysis with k = 3 when filing our report.

The k-means algorithm is an iterative algorithm, which means it runs a loop and continually updates its model until the model reaches steady state, at which point it will return its results. Put into narrative form, the k-means algorithm works like this: plot the data that you wish to analyze, and pick a value for k. You must know the value of k beforehand, or at least have an idea ofwhat it should be (though we'll also explore a way around this later in the article). Randomly create k points (if k = 5, create five points) and add them to your plot; these points are called the centroids, as they will ultimately represent the geometric centers of the clusters. For each data point in the plot, find the centroid closest to that point and connect or assign it to the point. Once all the assignments have been made, look at each centroid in turn and update its position to the mean position of all the points assigned to it. Repeat the assign-then-update procedure until the centroids stop moving; these final positions of the centroids are the output of the algorithm, and can be considered your cluster centers. If the narrative is hard to follow, don't worry, we'll dig into it more deeply as we build this algorithm from scratch.

In this article , we'll first discuss the concepts of average and distance and how they apply to the k-means algorithm. Then we'll describe the algorithm itself and build a JavaScript class from scratch to implement the k-means algorithm. We'll test our k-means solver with a couple of simple datasets, and then discuss what to do when you don't know the value of k beforehand. We'll build another tool that automates the discovery of the value k. We'll also discuss what the concept of error means for k-means applications, and how to design an error algorithm that helps us achieve our goals.

Average and distance

The k-means algorithm relies on two concepts in order to operate: average and distance. In order to tell you where the center of a cluster is, the algorithm will calculate an average value for these points. In this case, we will use the arithmetic mean, or the sum of values divided by the number of values, to represent our average. In ES5/classic JavaScript (I'm also being purposefully explicit in this example, for any readers who are not familiar with calculating the mean), we might write a function like this:/**

/**
 * @param {Array.<number>} numbers
 * @return {float}
 */
function mean(numbers) {
    var sum = 0, length = numbers.length;

    if (length === 0) {
        /**
         * Mathematically, the mean of an empty set is undefined,
         * so we could return early here. We could also allow the function
         * to attempt dividing 0/0, would would return NaN in JavaScript but
         * fail in some other languages (so probably a bad habit to encourage).
         * Ultimately, I would like this function to not return mixed types,
         * so instead let's throw an error.
         */
        throw new Error('Cannot calculate mean of empty set');
    }

    for (var i = 0; i < length; i++) {
        sum += numbers[i];
    }

    return sum / length;
}

This is a handy ES6 one-liner to keep in your back pocket, however, it assumes all values are already numeric and defined, and it will return NaN if you give it an empty array. If the shorthand is confusing, we can break it up like so:

const sum = (numbers) => numbers.reduce((sum, val) => sum + val, 0);
const mean = (numbers) => sum(numbers) / numbers.length;

Keep in mind we can use any type of average, including the median and mode. In fact, it's sometimes preferable to use k-medians over k-means. The median does a better job of muting outliers than the mean does. You should therefore always ask yourself which average you actually need. If you want a representation of total resources consumed, for instance, you might use the arithmetic mean. If you suspect outliers are caused by faulty measurements and should be ignored, k-medians could suit you better.

We will also need a concept of distance in this algorithm. It can be any distance measure, however, for numeric data you will mostly use the typical Euclidean distance—the standard distance measure you'd learn in high school—which in ES5 JavaScript looks like this for two dimensions:

Calculates only the 2-dimensional distance between two points a and b.
 * Each point should be an array with length = 2, and both elements defined and numeric.
 * @param {Array.number} a
 * @param {Array.number} b
 * @return {float}
 */
function distance2d(a, b) {
    // Difference between b[0] and a[0]
    var diff_0 = b[0] - a[0];
    // Difference between b[1] and a[1]
    var diff_1 = b[1] - a[1];

    return Math.sqrt(diff_0*diff_0 + diff_1*diff_1);
}

We must support many more than two dimensions, however, so we can generalize to the following:

 */
function distance(a, b) {

    var length = a.length,
        sumOfSquares = 0;

    if (length !== b.length) {
        throw new Error('Points a and b must be the same length');
    }

    for (var i = 0; i < length; i++) {
        var diff = b[i] - a[i];
        sumOfSquares += diff*diff;
    }

    return Math.sqrt(sumOfSquares);
}

We can write an ES6 one-liner for this, but it won't be as readable as the lengthier, explicit example:

const distance = (a, b) => Math.sqrt(
    a.map((aPoint, i) => b[i] - aPoint)
     .reduce((sumOfSquares, diff) => sumOfSquares + (diff*diff), 0)
);

Neeraj Dana

Experienced Software Engineer with a demonstrated history of working in the information technology and services industry. Skilled in Angular, React, React-Native, Vue js, Machine Learning