k-means clustering

k-means clustering is a technique used to categorize some data points into k distinct groups. This is an example of an unsupervised learning classifier where an algorithm refines an initial guess without training data to determine the groupings. Usually the "similarity" of two different data entries is based off some idea of a difference or distance to other data points. Graphically, the goal of k-means clustering can be easily demonstrated by plotting points and grouping them by colorization:
2d points colored into groups

The algorithm works off of the ideas of centroids, which represent the "center" of the clusters of data (represented by circles in the illustration). The algorithm's inputs are the data and an initial guess at the k different centroids, usually chosen at randomly. Next the algorithm takes each data point and computes which cluster the point is closest to, and thus assigns that data point to that particular cluster. Now that every point is assigned to one of the initial guessed at clusters, the new centroids (center of all the points assigned to each cluster) are calculated. Now that there are new centroids, the cluster assignment may have change, so the algorithm goes back and recalculates the distances. This process is repeated until the clusters no longer change.

Since this algorithm may fail to find the "best" clustering assignment due to a poor initial guess, the algorithm is usually run with multiple different initial guesses. The next step would be to determine which clustering assignment was "best" using some heuristics or other mechanisms.

I have implemented the algorithm in Octave, a software package with capabilities similar to Matlab. The implementation works on any dimensionality data, allowing greater flexibility than just classifying dyads.
% Our data points, in this case 2d points, but could be in any dimension
% Each row represents a data point, each column represents a dimension
X = [1 1; 2 1; 4 3; 5 4];

% Our initial guess at the centroids, in this case the first two data points
centroids = [1 1; 2 1];

% Make a function call to my k-means clustering function
[centroids, clusterAssignment] = kMeansCluster(X, centroids);
The bulk of the work is done in the function kMeansCluster. The function works by first calculating which clusters each data point belongs to, and then updating the centroids accordingly. The function then repeats until the cluster assignment fails to change. The function takes advantage of the trivial function distance I wrote, which just calculates the distances between two points using Euclidean distance, but could be any "distance" mechanism appropriate for your data collection.
function [centroids, clusters] = kMeansCluster(X, initialCentroids)
%KMEANSCLUSTER assign clusters to the data from X based on 
% euclidian distance

% initialize centroids to the initial guess
centroids = initialCentroids;
k = size(centroids, 1); % number of clusters
m = size(X, 1);         % number of data points
% assign clusters to something random so that the 
% first cluster calculation is not coincidentally the same
% causing the the algorithm to end prematurely
clusters = rand(m, k);  

while 1
   % calculate the new cluster assignment based on the centroids
   % For each row which represents each data point
   % Each columns will be 0 except the column representing 
   % the cluster this point belongs to which will be 1
   newClusters = zeros(m, k);
   for i = 1:m
      % Calculate distance to each centroid
      distances = zeros(1, k);
      for j = 1:k
         distances(j) = distance(X(i, :), centroids(j, :));
      end

      % determine which centroid is closed to this data point
      [temp, index] = min(distances);

      % Set 0 for every cluster for this data point
      newClusters(i, :) = zeros(1, length(newClusters(i, :)));

      % Set 1 for the closest cluster
      newClusters(i, index) = 1;
   end

   % update centroids based on new cluster assignments
   for i = 1:k
      total = zeros(1, size(X, 2));
      count = 0;
      for j = 1:m
         if newClusters(j, i) == 1
            total += X(j, :);
            count++;
         end
      end

      if count == 0
         % prevent divide by zero
         centroids(i, :) = zeros(1, size(X, 2));
      else
         % calculate the average point
         centroids(i, :) = total / count;
      end
   end

   if newClusters == clusters
      % if this is the same as the last cluster, we are done
      break;
   else
      % We have a different cluster assignment, keep iterating
      clusters = newClusters;
   end
end
end
Running of this program will appropriately determine the clusters and output the centroids too:
% Initial data
%
% 5....x
% 4.....
% 3...x.
% 2.....
% 1xx...
% |12345
X = 
    1    1
    2    1
    4    3
    5    3

initialGuess = 
    1    1
    2    1

% [centroids, clusterAssignemnt] = kMeansCluster(X, centroids);
centroids = 
    1.5000   1.0000
    4.5000   3.5000

clusterAssignment = 
    1    0
    1    0
    0    1
    0    1
k-means clustering can be used for many things that may not be obvious based on the visual. For example, we could define a spam classifier algorithm. The 2 clusters would be spam and not spam, the data points would represent email characteristics (possibly with normalization, such as:
  • contains wrist watch offers
  • from someone in address book
  • ...

When we run the classifier on some mail sample, the algorithm will output 2 clusters. Some other method will have to be used to determine which cluster is spam and which cluster is not spam. This algorithm will probably not be as good as Bayesian spam filtering and is heavily dependent on what characteristics you pick, but it does demonstrate k-means clustering's capabilities.

I hope my example k-means clustering algorithm explanation was helpful. Please post any questions in the comments.

1 comment:

Unknown said...

I actually enjoyed reading through this posting.Many thanks.








Function Point Estimation Training