Skip to article frontmatterSkip to article content

Principal Component Analysis

Cornell University

PCA finds the low dimensional structure of high dimensional data.

Basic Ideas

Whereas k-means clustering sought to partition the data into homogeneous subgroups, principal component analysis (PCA) will seek to find, if it exists, low-dimensional structure in the data set {x}i=1n \{ x\}_{i = 1}^{n} (xiRdx_{i} \in \mathbb{R}^{d} ). The features in which the data vary the most describes this lower-dimensional subspace.

PCA finds a low-dimensional representation of the data that captures most of the interesting behavior. Here, “interesting” will be defined as variability. This is analogous to computing “composite” features (i.e. linear combinations of entries in each xi x_{i} that explain most of the variability in the data.)

Take a simple example: if there exists one unit vector uRd u \in \mathbb{R}^{d} such that xiαiu x_{i} \approx \alpha_{i}u for some scalar αi \alpha_{i}, then we can roughly represent all xix_i by a much smaller number of “features.” (in this case only one uu) If we accept that uu is a valid “feature” for all of our data, we could approximate our data using only the scalar αi \alpha_{i} to describe how xix_i behaves on that feature uu. More concisely, we say that the xi x_{i} approximately lie in span{u}span\{u\}, a subspace of dimension 1.

This is illustrated in the next figure, where we see two dimensional data that approximately lies in a one dimensional subspace span{v}span\{v\}. If we know what vv is, we can approximate all data using only α\alpha.

Figure 9: An example where two dimensional data approximately lies ina one dimensional subspace.

Centering the data

Typically in unsupervised learning, we are interested in understanding relationships between data points and not necessarily bulk properties of the data. If the data has a sufficiently large mean, i.e. μ=1nixi \mu = \frac{1}{n}\sum\limits_{i}x_{i} is sufficiently far from zero, the best approximation of each data point is roughly μ\mu.

Figure 10: For data with a non-zero mean the best approximation isachieved using a vector similar to the mean; in contrast, most of theinteresting behavior in the data may occur in completely differentdirections.

Therefore, to actually understand the relationship between features we would like to center our data before applying PCA by deleting the mean from data. Specifically, we let

x^i=xiμ{\hat{x}}_{i} = x_{i} - \mu

where μ=1nixi. \mu = \frac{1}{n}\sum\limits_{i}x_{i}. We now simply work with the centered feature vectors {x^i}i=1n \{{\hat{x}}_{i}\}_{i = 1}^{n} and will do so throughout the remainder of these notes.

Maximizing the variance

Describing Problem

To do PCA, we want to find a small set of composite features that capture most of the variability in the data. To illustrate this point, we will first consider finding a single composite feature that captures the most variability in the data set, then proceed to find the 2nd one, 3rd one...

Mathematically, we want to find a vector ϕ=[α1,,αd]Rd\phi = [\alpha_1, \dots, \alpha_d]\in \mathbb{R}^{d} such that the sample variance of the scalars

zi=ϕTx^iz_{i} = \phi^{T}{\hat{x}}_{i}

is as large as possible. Note ϕ\phi here is the weight: given a data point xix_i, for each dimension [xi]j[x_i]_j we assign it a weight αj\alpha_j, so we have a weighted sum across dimension of the original data point xix_i. We try to come up with a composite feature that varies a lot, so we want to maximize the variance of ziz_i. Note the mean μ^z\hat\mu_z of these zz is 0 because we did the normalization previous step.

Var(zi)=(ziμ^z)2=(zi0)2=(ϕTx^i)2Var(z_i) = (z_i - \hat\mu_z)^2 = (z_i - 0)^2 = (\phi^{T}{\hat{x}}_{i})^2

We also have to constrain ϕ2=1 \parallel \phi \parallel_{2} = 1, or we could artificially inflate the variance by simply increasing the magnitude of the entries in ϕ \phi.

We can now formally define the first principal component of a data set:

First principal component: The first principal component of a data set {xi}i=1n \{ x_{i}\}_{i = 1}^{n} is the vector ϕRd \phi \in \mathbb{R}^{d} that solves

maxϕ2=11ni=1n(ϕTx^i)2\max\limits_{\parallel \phi \parallel_{2} = 1}\frac{1}{n}\sum\limits_{i = 1}^{n}(\phi^{T}{\hat{x}}_{i})^{2}

We will consider the data matrix

X^=[x^1x^n]Rd×n\hat{X} = \begin{bmatrix} | & & | \\ {\hat{x}}_{1} & \cdots & {\hat{x}}_{n} \\ | & & | \\ \end{bmatrix} \in \mathbb{R}^{d \times n}

. This allows us to rephrase the problem as

maxϕ2=1X^Tϕ2\max\limits_{\parallel \phi \parallel_{2} = 1} \parallel {\hat{X}}^{T}\phi \parallel_{2}

In other words, ϕ \phi is the unit vector that makes the matrix X^T {\hat{X}}^{T} as large as possible.

Solving Problem - 1st PC

We will show how to find the first PC (Principal Components) in this section. We can use singular value decomposition (SVD). To simplify this presentation we make the reasonable assumption that nd. n \geq d. We can always decompose the matrix X^ \hat{X} as

X^=UΣVT\hat{X} = U\Sigma V^{T}

where UU is a d×d d \times d orthonormal matrix, V V is an n×dn \times d orthonormal matrix, Σ \Sigma is a d×dd \times d diagonal matrix with Σii=σi0, \Sigma_{ii} = \sigma_{i} \geq 0, and σ1σ2σd \sigma_{1} \geq \sigma_{2} \geq \cdots \geq \sigma_{d}. Remind again we have nn data vectors. Each vector has dd dimension.

U=[u1ud],Σ=[σ1σd],and V=[v1vd]U = \begin{bmatrix} | & & | \\ u_{1} & \cdots & u_{d} \\ | & & | \\ \end{bmatrix},\quad \Sigma = \begin{bmatrix} \sigma_{1} & & \\ & \ddots & \\ & & \sigma_{d} \\ \end{bmatrix},\quad\text{and } V = \begin{bmatrix} | & & | \\ v_{1} & \cdots & v_{d} \\ | & & | \\ \end{bmatrix}\quad

we call ui u_{i} the left singular vectors, vi v_{i} the right singular vectors, and σi \sigma_{i} the singular values. This SVD can be done in O(nd2) \mathcal{O}(nd^{2}). 这些都是 SVD 的定义而已

Claim: we achieve maxX^Tϕ2max \parallel {\hat{X}}^{T}\phi \parallel_{2} when we have ϕ=u1 \phi = u_{1}.

Proof: Note UU is a d×d d \times d orthonormal matrix, so its columns constitute a base of Rd\mathbb{R}^d. Since we defined ϕRd\phi \in \mathbb{R}^d, we can write ϕ\phi as a combination of these bases:

ϕ=i=1daiui=U[a1a2ad]\begin{align} \phi &= \sum\limits_{i = 1}^{d}a_{i}u_{i} \\ &= U\begin{bmatrix} a_1 \\ a_2 \\ \dots \\ a_d \end{bmatrix} \end{align}

where iai2=1 \sum\limits_{i}a_{i}^{2} = 1 (because we want ϕ2=1 \parallel \phi \parallel_{2} = 1 ). We now observe

ΣUTϕ=ΣUTU[a1a2ad]=Σ(UTU)[a1a2ad]=ΣI[a1a2ad]=[σ1a1σ2a2σdad]\begin{align} \Sigma U^{T} \phi &= \Sigma U^T U \begin{bmatrix} a_1 \\ a_2 \\ \dots \\ a_d \end{bmatrix} = \Sigma (U^T U) \begin{bmatrix} a_1 \\ a_2 \\ \dots \\ a_d \end{bmatrix} = \Sigma I \begin{bmatrix} a_1 \\ a_2 \\ \dots \\ a_d \end{bmatrix} = \begin{bmatrix} \sigma_1 a_1 \\ \sigma_2 a_2 \\ \dots \\ \sigma_d a_d \\ \end{bmatrix} \\ \end{align}

Therefore,

$$

X^Tϕ22=VΣUTϕ22=[v1vd][σ1a1σ2a2σdad]22=σ1a1[v1]+σ2a2[v2]++σdad[vd]22=i=1d(σiai)2\begin{align} {\parallel {\hat{X}}^{T}\phi \parallel_{2}^{2}} &= \parallel V \Sigma U^{T}\phi \parallel_{2}^{2} \\ &= \left\| \begin{bmatrix} | & & | \\ v_{1} & \cdots & v_{d} \\ | & & | \\ \end{bmatrix} \begin{bmatrix} \sigma_1 a_1 \\ \sigma_2 a_2 \\ \dots \\ \sigma_d a_d \\ \end{bmatrix} \right\|_{2}^{2} \\ &= \left\| \sigma_1 a_1 \begin{bmatrix} | \\ v_{1}\\ | \\ \end{bmatrix} + \sigma_2 a_2 \begin{bmatrix} | \\ v_{2}\\ | \\ \end{bmatrix} + \dots + \sigma_d a_d \begin{bmatrix} | \\ v_{d}\\ | \\ \end{bmatrix} \right\|_{2}^{2} \\ &= \sum\limits_{i = 1}^{d}(\sigma_{i}a_{i})^{2} \\ \end{align}

$$

According to assumption σ1σ2σd \sigma_{1} \geq \sigma_{2} \geq \cdots \geq \sigma_{d}, we achieve max when a1=1 a_{1} = 1 and ai=0a_{i} = 0 for i1 i \neq 1. 证毕

Solving Problem - Other PCs

So, the first left singular value of X^ \hat{X} gives us the first principal component of the data. What about finding additional directions? Formally, we want to find ψ \psi such that yi=ψTx^i y_{i} = \psi^{T}{\hat{x}}_{i} has the next most variability. 但我们又不能简单地说 ψϕ\psi \not= \phi, 因为我们可以随便找一个间隔 ϕ\phi 非常近的 ϕ\phi 即可。 Therefore, we need to force the second PC to be orthogonal to the first, i.e. ψTϕ=0 \psi^{T}\phi = 0 . We can actually find the second principal component is ψ=u2 \psi = u_{2} in our SVD matrix UU. Fig.11 illustrates how the first two principal components look for a stylized data set. We see that they reveal directions in which the data varies significantly.

Figure 11: Two principal components for a simple dataset.

More generally, the SVD actually gives us all the principal components of the data set {xi}i=1n \{ x_{i}\}_{i = 1}^{n}.

Principal components: Principal component \ell of data set {xi}i=1n \{ x_{i}\}_{i = 1}^{n} is denoted ϕ \phi_{\ell} and satisfies

ϕ=u\phi_{\ell} = u_{\ell}

where X^=UΣVT \hat{X} = U\Sigma V^{T} is the SVD of X^. \hat{X}.

Explaining variability in the data

Having obtained the answer, we can look back into our problem maxϕ2=1X^Tϕ2\max\limits_{\parallel \phi \parallel_{2} = 1} \parallel {\hat{X}}^{T}\phi \parallel_{2} in a cleaner way.

(X^Tϕ)2=(X^Tϕ)T(X^Tϕ)=ϕTX^X^Tϕ=ϕTσϕwe set ϕ as a eigenvector of X^X^T=σϕTϕ=σϕ2=1\begin{align} ({\hat{X}}^{T}\phi)^2 &= ({\hat{X}}^{T}\phi)^T ({\hat{X}}^{T}\phi) \\ &= \phi^T{\hat{X}}{\hat{X}}^{T}\phi \\ &= \phi^T \sigma \phi &&\text{we set $\phi$ as a eigenvector of ${\hat{X}}{\hat{X}}^{T}$} \\ &= \sigma \phi^T \phi \\ &= \sigma &&\parallel \phi \parallel_{2} = 1 \end{align}

The equation above helps us recall that we were maximizing the variance of this vector X^Tϕ{\hat{X}}^{T}\phi and the maximized variance is the eigenvalue of matrix X^X^T{\hat{X}}{\hat{X}}^{T}. In other words, the singular values reveal the sample variance of ϕTxi\phi_\ell^Tx_i. This result generalizes to PCs other than the first one.

If we project our data into the span of the first kk eigenvectors, the variance we can achieve is the first kk eigenvalues summed up Formally, the total variability of the data captured by the first kk PC is i=1kσi2 \sum\limits_{i = 1}^{k}\sigma_{i}^{2}. This is the value we kept with kk interesting composite dimension.

One way we determine kk - how many vectors we want for PCA is to look at how much of the variance at all dimensions are captured by our kk composite interesting dimensions:

i=1kσi2i=1dσi2=variance at k composite interesting dimensionsvariance at all dimensions=variance captured by k PCtotal variance in data\frac{\sum\limits_{i = 1}^{k}\sigma_{i}^{2}}{\sum\limits_{i = 1}^{d}\sigma_{i}^{2}} = \frac{\text{variance at $k$ composite interesting dimensions}}{\text{variance at all dimensions}} = \frac{\text{variance captured by $k$ PC}}{\text{total variance in data}}

We want to pick a relatively small kk such that a relatively big ratio can be achieved. In practice, we can do a similar thing as we did with K-Means: we can pick k k by identifying when we have diminishing returns in explaining variance by adding more principal components, i.e. looking for a knee in the plot of singular values.

PCA Application

We use PCA to reduce dimensions.

A common use of PCA is for data visualization. In particular, if we have high dimensional data that is hard to visualize we can sometimes see key features of the data by plotting its projection onto a few (1, 2, or 3) principal components. For example, if d=2 d = 2 this corresponds to forming a scatter plot of (ϕ1Tx^i,ϕ2Tx^i) (\phi_{1}^{T}{\hat{x}}_{i},\phi_{2}^{T}{\hat{x}}_{i}).