Sparsity and Dimension Reduction - Incomplete CS Notes @ Cornell

草啊，我搞错把第一页笔记扔了 (￣艸￣)

Dimension Reduction¶

So far we’ve talked about what constraints the number of iterations we have to run for an optimization algorithm to converge. However, when we really talk about the running time, we will multiply the number of iterations by the time of doing one gradient update step. Take the case of Gradient Descent:

\underset{\text{\#iterations of GD}}{\mathcal O(n\kappa \log{\frac 1 \epsilon})} \rightarrow \underset{\text{running time of GD}}{\mathcal O(n\kappa d \log{\frac 1 \epsilon})}

(1)

This is the same for other algorithms too. Therefore, to speed up computation, we also want to make $d$ small. So we introduce dimension reduction. That is, we want to transform our original problem in dimension $D$ into one that has a smaller dimension $d$

Usually, dimensionality reduction is done using unsupervised learning (learning without using labels $y$ ) Some examples are:

Random projection: choose a $d$ -dimensional linear subspace at random and project onto it.
Principal Component Analysis: find the $d$ components of the data with the largest variance (represent the data the best), keep those, and throw out all the other components.
Autoencoders: learn a lower-dimension version of the data using a neural network
LDA, t-SNE, word embeddings...

Random Projection - JL Transform¶

This is also called the Johnson-Lindenstrauss transform or just JL transform because it is a result of Johnson-Lindenstrauss lemma.

For an arbitrary error tolerance $0 \lt \epsilon \lt 1$ , for arbitrary points $x, y \in \R^d$ , for any target dimension $d \gt 8 \frac {\log n}{\epsilon^2}$ , there exists a linear map / matrix $A: \R^D \to \R^d$ s.t.

(1 - \epsilon) \| x - y \|^2 \le \| Ax - Ay \|^2 \le (1 + \epsilon) \| x - y \|^2.

(2)

This property is so nice exactly because it doesn’t depend on the original dimension $D$ .

Finding A¶

How can we find this matrix $A$ though? We will use a random matrix where each entry is drawn i.i.d. from the normal distribution $\mathcal{N}(0, \sigma^2)$

For this random matrix $A$ , calculate the expected value of two point’s distance after transformation: (note $A$ is the only thing random here. $x, y$ are just fixed constants)

\begin{align*} \mathbb E \left[ \|Ax - Ay\|^2 \right] &= \mathbb E \left[ \|A(x - y)\|^2 \right] \\ &= \mathbb E \left[ (x - y)^T A^T A (x - y) \right]\\ &= (x - y)^T \mathbb E \left[A^T A\right] (x - y) \\ \mathbb E \left[A^T A\right]_{ij} &= \sum^d_{k=1} \mathbb E \left[A^T_{ik} A_{kj} \right]\\ &= \sum^d_{k=1} \mathbb E \left[A_{ki} A_{kj} \right]\\ &= \begin{cases} \sum^d_{k=1} \mathbb E [A_{ki}] \mathbb E [A_{kj}] &\text{if $i \not = j$}\\ \sum^d_{k=1} \mathbb E [A_{ki}^2] &\text{if $i=j$} \end{cases}\\ &= \begin{cases} \sum^d_{k=1} 0 \cdot 0 &\text{if $i \not = j$}\\ \sum^d_{k=1} \sigma^2 &\text{if $i=j$} \end{cases}\\ &= \begin{cases} 0 &\text{if $i \not = j$}\\ d\sigma^2 &\text{if $i=j$} \end{cases}\\ \mathbb E \left[A^T A\right] &= d\sigma^2 I\\ \mathbb E \left[ \|Ax - Ay\|^2 \right] &= d\sigma^2 \|x - y\|^2 \end{align*}

(3)

From this derivation, we see that as long as we set $\sigma^2 = \frac 1 d$ , we are guaranteed to draw an $A$ that in expectation meets requirements.

A Caveat¶

While using Gaussian random variables is sufficient, it’s not very computationally efficient to generate the matrix A, communicate it, and multiply by it. There are other methods that achieve the same effect, so if you want to use random projection at scale, you should consider using these methods.

Autoencoder¶

We feed our data point $x \in \R^D$ into an encoder to get a latent vector $z \in \R^d$ , and feed this latent vector to a decoder to try to restore the original data point $\hat x \in \R^D$ . The training loss is just the mean squared reconstruction loss.

\mathcal L(x, \hat x) = \|x - \hat x\|^2

(4)

This is a classic example of unsupervised learning where you don’t have label associated with each training input.

When we use an autoencoder for dimension compression, after training, we just throw away the decoder and just use the encoder.

Autoencoder has no theoretical guarantee of why it works whatsoever.

Christopher De Sa

Sparsity¶

Previously with dimension reduction, we deal with high computational cost brought by high dimension through reducing to a lower dimension. With sparsity, we deal with it by working at the same dimension but with a lower cost. Define:

\operatorname*{density}(x) = \frac {\text{\# nonzero entries of x}} {\text{size of x}}

(5)

Some common ways to make a model sparse:

L1 regularization

Sparse Vector¶

If we have a sparse vector, we can represent it using a (non-zero-index, non-zero-value) pair

\begin{bmatrix} 0 & 0 & 0 & 2.8 & 0 & 0 & 3.7 & 0\end{bmatrix} = \{(4, 2.8), (7, 3.7) \}

(6)

In computer, it’ll be just two arrays

indices = [4, 7]
values = [2.8, 3.7]

In this way, we have transformed a vector of 8 float into one length 2 integer array and one length 2 float array.

Sparse Matrix - COO¶

To store a sparse matrix, we use Coordinate List (COO), where we just add another column indices array:

\begin{bmatrix} 0 & 0 & 5.4 & 0 & 0 & 3.6 \\ 0 & 1.7 & 0 & 0 & 0 & 0 \end{bmatrix} = \{(1, 3, 5.4), (1, 6, 3.6), (2, 2, 1.7)\}

(7)

In computer, it’ll be three arrays:

row = [1, 1, 2]
column = [3, 6, 2]
values = [5.4, 3.6, 1.7]

Here each number in matrix is represented by 3 numbers in sparse representation. So we need at least 1/3 density in the original matrix to get memory benefit. In fact, with all the memory overheads, we usually need 1/4 density to get memory benefit.

However, this representation is hard to compute: we notice that there is no particular order of the values. The only thing to locate a value is to go through this sparse representation. Therefore, counting this computational overheads, we need 2% density to get real benefit.

Sparse Matrix Improved - CSR¶

So we have an improved version go by Compressed Sparse Row (CSR). It basically stores each row as a sparse vector and concatenate them together, keeping track of the number of non-zero values in each row for quicker lookup using row offset.

\begin{bmatrix} 0 & 0 & 5.4 & 0 & 0 & 3.6 \\ 0 & 1.7 & 0 & 0 & 0 & 0 \end{bmatrix} = \{(3, 5.4), (6, 3.6)\}, \{(2, 1.7)\}

(8)

Store rows as sparse vectors:
- indices: [3, 6] [2]
- values: [5.4, 3.6] [1.7]
Concat these row vectors with row offset:
- row_offsets = [1, 3]
- (column) indices = [3, 6, 2]
- values = [5.4, 3.6, 1.7]

For consistency with matrix index, assume these arrays start with 1. s = row_offsets[i] means that the i-th row starts from the s-th element in indices-values pair (and end at the row_offsets[i+1]-th element in the indices-values pair)

For example, if we want to know what’s in row 1, we look up row_offsets[1] = 1, row_offsets[1+1] = 3 so row 1 starts at the 1st element in indices-values pair and the 3rd element is the first element in the second row, so row 1 is just the 1st and 2nd element in the indices-values pair. So we look at the 1st and 2nd elements and found them to be (3, 5.4) and (6, 3.6). That is in the 1st row, the 3rd column has a non-zero value 5.4 and the 6th column has a non-zero value 3.6

This representation format allows faster lookup and computation.

CS4787 Principles of Large-Scale Machine Learning

Adaptive Learning Rate and Variance Reduction

CS4787 Principles of Large-Scale Machine Learning

Neural Network Review