Skip to article frontmatterSkip to article content

K-means clustering

Cornell University

In unsupervised learning, we now only have features (x1,,xnx_{1},\ldots,x_{n}, where xiRdx_{i} \in \mathbb{R}^{d}) and no corresponding labels yiy_{i}. Therefore, we are not looking to make predictions. Instead, we are interested in uncovering structure in the feature vectors themselves.

In this lecture, we introduce K-means clustering. As we saw with KNN, sometimes we may suffer from the [curse of dimensionality](./2021-09-02-K-Nearest-Neighbors.md/#Curse of Dimenisionality), but any dataset that we can make predictions on must have “interesting” structures. K-Means will discover the clustered structure of the high dimensional data. (This is a bit different from PCA, which discovers low dimensional structure. PCA transforms high dimensional data points to be desribed by some fewer lower dimensional vectors. K-Means, on the other hand, describes a data point using its relation with each clusters: xi[c1,c2,,ck]x_i \to [c_1, c_2, \dots, c_k] where cic_i describes xix_i’s relation with cluster cic_i)

Objective

Given a set of nn data points x1,,xnx_{1},\ldots,x_{n} with xiRd x_{i} \in \mathbb{R}^{d} , the goal of k-means clustering is to partition the data into k k groups where each group contains similar objects as defined by their Euclidean distances. Formally, we want to partition the n feature vectors into kk clusters CjC_j.

We now define what it means for a clustering to be “good.” We want to find clusters where the data points within each cluster are tightly grouped (i.e., exhibit low variability). Recall x22=xTx \parallel x \parallel_{2}^{2} = x^{T}x, so xixj2 \parallel x_{i} - x_{j} \parallel_{2} is the euclidean distance between the vectors xi x_{i} and xj. x_{j}. Mathematically, for any cluster assignment we define

Z(C1,,Ck)==1k12Ci,jCxixj22Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}) = \sum\limits_{\ell = 1}^{k}\frac{1}{2|\mathcal{C}_{\ell}|}\sum\limits_{i,j \in \mathcal{C}_{\ell}} \parallel x_{i} - x_{j} \parallel_{2}^{2}

The smaller Z Z is, the better we consider the clustering. Roughly speaking, we are trying to pick a clustering that minimizes the pairwise squared distances within each cluster (normalized by the cluster sizes).

The definition of Z Z can be rewritten in terms of the centroid of each cluster: $$

Z(C1,,Ck)==1kZk==1kiCxiμ22\begin{align} Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k}) &= \sum\limits_{\ell = 1}^{k} Z_k \\ &= \sum\limits_{\ell = 1}^{k}\sum\limits_{i \in \mathcal{C}_{\ell}} \parallel x_{i} - \mu_{\ell} \parallel_{2}^{2} \end{align}
wherewhere

\mu_{\ell} = \frac{1}{|\mathcal{C}{\ell}|}\sum\limits{i \in \mathcal{C}{\ell}}x{i}, Z_\ell = \sum\limits_{i \in \mathcal{C}{\ell}} \parallel x{i} - \mu_{\ell} \parallel_{2}^{2} $ \mu_{\ell}isthemean/centroidofthepointsincluster is the mean/centroid of the points in cluster \elland and Z_\ell$ 为各 cluster 中所有点与其对应 centroid 的距离和

Now that we have a formal way to define a good clustering of points, we would like to solve

minC1,,CkZ(C1,,Ck)\min\limits_{\mathcal{C}_{1},\ldots,\mathcal{C}_{k}}Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k})

Unfortunately, solving this equation is not computationally feasible. We cannot do much better then simply trying all possible k k way partitions of the data and seeing which one yields the smallest objective value, so we have to relax our goal of finding the “best” clustering.

Lloyd’s algorithm for k-means

The practical approach we will take finding a good clustering is to use a local search algorithm.

This does not ensure convergence to the global solution.

Input: data {xi}i=1n \{ x_{i}\}_{i = 1}^{n} and k k

Initialize C1,,Ck C_{1},\ldots,C_{k}

While not converged:

  1. Compute μ=1CiCxi \mu_{\ell} = \frac{1}{|\mathcal{C}_{\ell}|}\sum\limits_{i \in \mathcal{C}_{\ell}}x_{i} for =1,,k. \ell = 1,\ldots,k. [O(dn)O(dn) - 每个 feature vector 都对应一个 centroid]
  2. Update C1,,Ck \mathcal{C}_{1},\ldots,\mathcal{C}_{k} by assigning each data point xi x_{i} to the cluster whose centroid μ \mu_{\ell} it is closest to. [O(dnk)O(dnk) - 每个 feature vector 都要与 kk 个 centroid 算距离]
  3. If the cluster assignments didn’t change, we have converged.

Return: cluster assignments C1,,Ck \mathcal{C}_{1},\ldots,\mathcal{C}_{k}

This process actually has a nice feature that the objective function ZZ is non-increasing (and only fails to decrease if the cluster assignments have converged). This is like loop invariant - termination.

The cost of Lloyd’s algorithm is O(ndk) \mathcal{O}(ndk) per iteration. 注意每次计算(加法或乘法)花费的时间都与 dimension d=xid = |x_i| 有关。In practice its efficiency is strongly dependent on how many iterations it takes to converge.

How many clusters and how they are initialized

These 2 are the implicit input / super parameters we have to decide for a KNN algorithm.

Cluster Number kk

We want to pick a kk that minimizes Zk=Z(C1,,Ck)Z_k = Z(\mathcal{C}_{1},\ldots,\mathcal{C}_{k})

However, this is problematic because we expect the in-cluster variation to decrease naturally with k. k. In fact, if we have exactly nn clusters, Zn=0 Z_{n} = 0 since every point is its own cluster. So, instead, we often look for the point at which Zk Z_{k} stops decreasing quickly.

The intuition is that as kk+1 k\rightarrow k + 1 if a single big cluster naturally subdivides into two clear clusters we will see that Zk+1Zk Z_{k + 1} \ll Z_{k}. In contrast, if we see Zk+1Zk Z_{k + 1} \approx Z_{k}, this might mean we have spitted up a clear cluster that does not naturally subdivide.

Figure 5: Choosing the number of clusters by looking for a "knee" inthe objective function.

Initial Cluster Assignment

Initial cluster assignment This is particularly important because Lloyd’s algorithm only finds a local optima. So, very different initialization could result in convergence to very different solutions.

Figure 6: The result of k-means clustering for a simple data set fromtwo different initializations---one results in a good clustering and oneresults in a clustering that is far fromoptimal.

We can

Decision boundaries

As said at the very beginning, unsupervised learning algorithm finds underlying structure of data points.

Observe that given two cluster centroids μ \mu_{\ell} and μ\mu_{\ell^{\prime}} , the set of points equidistant from the two centroids defined as {xRd    xμ2=xμ2} \{ x \in \mathbb{R}^{d}\;|\; \parallel x - \mu_{\ell} \parallel_{2} = \parallel x - \mu_{\ell^{\prime}} \parallel_{2}\} is a line (就是 μ \mu_{\ell}μ\mu_{\ell^{\prime}} 的中位线). This means that given a set of cluster centroids μi,,μk \mu_{i},\ldots,\mu_{k} they implicitly partition space up into k k domains whose boundaries are linear.

Figure 7: Cluster partition boundaries as determined by k-meansclustering---they form a Voronoi tessellation of space based on thecluster centroids.