Skip to article frontmatterSkip to article content

Sparsity and Dimension Reduction

Cornell University

草啊,我搞错把第一页笔记扔了 ( ̄艸 ̄)

Dimension Reduction

So far we’ve talked about what constraints the number of iterations we have to run for an optimization algorithm to converge. However, when we really talk about the running time, we will multiply the number of iterations by the time of doing one gradient update step. Take the case of Gradient Descent:

O(nκlog1ϵ)#iterations of GDO(nκdlog1ϵ)running time of GD\underset{\text{\#iterations of GD}}{\mathcal O(n\kappa \log{\frac 1 \epsilon})} \rightarrow \underset{\text{running time of GD}}{\mathcal O(n\kappa d \log{\frac 1 \epsilon})}

This is the same for other algorithms too. Therefore, to speed up computation, we also want to make dd small. So we introduce dimension reduction. That is, we want to transform our original problem in dimension DD into one that has a smaller dimension dd

Usually, dimensionality reduction is done using unsupervised learning (learning without using labels yy) Some examples are:

Random Projection - JL Transform

This is also called the Johnson-Lindenstrauss transform or just JL transform because it is a result of Johnson-Lindenstrauss lemma.

For an arbitrary error tolerance 0<ϵ<10 \lt \epsilon \lt 1 , for arbitrary points x,yRdx, y \in \R^d, for any target dimension d>8lognϵ2d \gt 8 \frac {\log n}{\epsilon^2}, there exists a linear map / matrix A:RDRdA: \R^D \to \R^d s.t.

(1ϵ)xy2AxAy2(1+ϵ)xy2.(1 - \epsilon) \| x - y \|^2 \le \| Ax - Ay \|^2 \le (1 + \epsilon) \| x - y \|^2.

This property is so nice exactly because it doesn’t depend on the original dimension DD.

Finding A

How can we find this matrix AA though? We will use a random matrix where each entry is drawn i.i.d. from the normal distribution N(0,σ2)\mathcal{N}(0, \sigma^2)

For this random matrix AA, calculate the expected value of two point’s distance after transformation: (note AA is the only thing random here. x,yx, y are just fixed constants)

E[AxAy2]=E[A(xy)2]=E[(xy)TATA(xy)]=(xy)TE[ATA](xy)E[ATA]ij=k=1dE[AikTAkj]=k=1dE[AkiAkj]={k=1dE[Aki]E[Akj]if ijk=1dE[Aki2]if i=j={k=1d00if ijk=1dσ2if i=j={0if ijdσ2if i=jE[ATA]=dσ2IE[AxAy2]=dσ2xy2\begin{align*} \mathbb E \left[ \|Ax - Ay\|^2 \right] &= \mathbb E \left[ \|A(x - y)\|^2 \right] \\ &= \mathbb E \left[ (x - y)^T A^T A (x - y) \right]\\ &= (x - y)^T \mathbb E \left[A^T A\right] (x - y) \\ \mathbb E \left[A^T A\right]_{ij} &= \sum^d_{k=1} \mathbb E \left[A^T_{ik} A_{kj} \right]\\ &= \sum^d_{k=1} \mathbb E \left[A_{ki} A_{kj} \right]\\ &= \begin{cases} \sum^d_{k=1} \mathbb E [A_{ki}] \mathbb E [A_{kj}] &\text{if $i \not = j$}\\ \sum^d_{k=1} \mathbb E [A_{ki}^2] &\text{if $i=j$} \end{cases}\\ &= \begin{cases} \sum^d_{k=1} 0 \cdot 0 &\text{if $i \not = j$}\\ \sum^d_{k=1} \sigma^2 &\text{if $i=j$} \end{cases}\\ &= \begin{cases} 0 &\text{if $i \not = j$}\\ d\sigma^2 &\text{if $i=j$} \end{cases}\\ \mathbb E \left[A^T A\right] &= d\sigma^2 I\\ \mathbb E \left[ \|Ax - Ay\|^2 \right] &= d\sigma^2 \|x - y\|^2 \end{align*}

From this derivation, we see that as long as we set σ2=1d\sigma^2 = \frac 1 d, we are guaranteed to draw an AA that in expectation meets requirements.

A Caveat

While using Gaussian random variables is sufficient, it’s not very computationally efficient to generate the matrix A, communicate it, and multiply by it. There are other methods that achieve the same effect, so if you want to use random projection at scale, you should consider using these methods.

Autoencoder

We feed our data point xRDx \in \R^D into an encoder to get a latent vector zRdz \in \R^d, and feed this latent vector to a decoder to try to restore the original data point x^RD\hat x \in \R^D. The training loss is just the mean squared reconstruction loss.

L(x,x^)=xx^2\mathcal L(x, \hat x) = \|x - \hat x\|^2

This is a classic example of unsupervised learning where you don’t have label associated with each training input.

When we use an autoencoder for dimension compression, after training, we just throw away the decoder and just use the encoder.

Autoencoder has no theoretical guarantee of why it works whatsoever.

Christopher De Sa

Sparsity

Previously with dimension reduction, we deal with high computational cost brought by high dimension through reducing to a lower dimension. With sparsity, we deal with it by working at the same dimension but with a lower cost. Define:

density(x)=# nonzero entries of xsize of x\operatorname*{density}(x) = \frac {\text{\# nonzero entries of x}} {\text{size of x}}

Some common ways to make a model sparse:

Sparse Vector

If we have a sparse vector, we can represent it using a (non-zero-index, non-zero-value) pair

[0002.8003.70]={(4,2.8),(7,3.7)}\begin{bmatrix} 0 & 0 & 0 & 2.8 & 0 & 0 & 3.7 & 0\end{bmatrix} = \{(4, 2.8), (7, 3.7) \}

In computer, it’ll be just two arrays

In this way, we have transformed a vector of 8 float into one length 2 integer array and one length 2 float array.

Sparse Matrix - COO

To store a sparse matrix, we use Coordinate List (COO), where we just add another column indices array:

[005.4003.601.70000]={(1,3,5.4),(1,6,3.6),(2,2,1.7)}\begin{bmatrix} 0 & 0 & 5.4 & 0 & 0 & 3.6 \\ 0 & 1.7 & 0 & 0 & 0 & 0 \end{bmatrix} = \{(1, 3, 5.4), (1, 6, 3.6), (2, 2, 1.7)\}

In computer, it’ll be three arrays:

Here each number in matrix is represented by 3 numbers in sparse representation. So we need at least 1/3 density in the original matrix to get memory benefit. In fact, with all the memory overheads, we usually need 1/4 density to get memory benefit.

However, this representation is hard to compute: we notice that there is no particular order of the values. The only thing to locate a value is to go through this sparse representation. Therefore, counting this computational overheads, we need 2% density to get real benefit.

Sparse Matrix Improved - CSR

So we have an improved version go by Compressed Sparse Row (CSR). It basically stores each row as a sparse vector and concatenate them together, keeping track of the number of non-zero values in each row for quicker lookup using row offset.

[005.4003.601.70000]={(3,5.4),(6,3.6)},{(2,1.7)}\begin{bmatrix} 0 & 0 & 5.4 & 0 & 0 & 3.6 \\ 0 & 1.7 & 0 & 0 & 0 & 0 \end{bmatrix} = \{(3, 5.4), (6, 3.6)\}, \{(2, 1.7)\}
  1. Store rows as sparse vectors:
    • indices: [3, 6] [2]
    • values: [5.4, 3.6] [1.7]
  2. Concat these row vectors with row offset:
    • row_offsets = [1, 3]
    • (column) indices = [3, 6, 2]
    • values = [5.4, 3.6, 1.7]

For consistency with matrix index, assume these arrays start with 1. s = row_offsets[i] means that the i-th row starts from the s-th element in indices-values pair (and end at the row_offsets[i+1]-th element in the indices-values pair)

For example, if we want to know what’s in row 1, we look up row_offsets[1] = 1, row_offsets[1+1] = 3 so row 1 starts at the 1st element in indices-values pair and the 3rd element is the first element in the second row, so row 1 is just the 1st and 2nd element in the indices-values pair. So we look at the 1st and 2nd elements and found them to be (3, 5.4) and (6, 3.6). That is in the 1st row, the 3rd column has a non-zero value 5.4 and the 6th column has a non-zero value 3.6

This representation format allows faster lookup and computation.