Important Distributions and Linear Regression - Incomplete CS Notes @ Cornell

Two Important Distributions in ML¶

Gaussian Distribution¶

When you have a bunch of Random Variables adding together, it is likely to be Gaussian distributed, but we do want these Random Variables to have finite mean and finite variance. In this case, they will be Gaussian according to the Central Limit Theorem.

Power Law Distribution¶

P(X) = X^{-k}

(1)

The sum of a bunch finite mean but infinite variance Random Variables doesn’t usually converge, but when it does, it gives Power Law Distribution. Examples of power law: wealth per person, word frequency.

Power Law is scale free: Select an arbitrary part of the graph and zoom into it, we will find it have the exact same shape as the original graph.

Linear Regression¶

Assumptions¶

注意以下讨论中我们认为直线过原点，如果不过就将线加一维即可，详见前文 Perceptron

We assume that label is in the space of real numbers and is somewhat linearly distributed. Mathematically,

y_{i} \in \mathbb{R}, \; y_{i} = \mathbf{w}^\top\mathbf{x}_i + \epsilon_i

(2)

where $\epsilon$ is a the noise: we said label is “somewhat” linear, so they can’t always be exactly on the line. The $\epsilon$ here is the offset between the label and the line.

We also assume that the noise is Gaussian distributed

\epsilon_i \sim N(0, \sigma^2)

(3)

so for a fixed distance, each point has the same probability of being that distance away from the line. It is also reasonable to have mean as 0, because if it $\gt 0$ , it means majority points a more off to the above, so we can just move our line up a bit; similar for $\lt 0$ , we just move the line down a bit. 根据我们这里的假设，可以通过将 $\epsilon_i$ 的 distribution 向我们的模型预测直线 $\mathbf{w}^\top\mathbf{x}_i$ 移动得到

y_i|\mathbf{x}_i \sim N(\mathbf{w}^\top\mathbf{x}_i, \sigma^2) \Rightarrow P(y_i|\mathbf{x}_i,\mathbf{w})=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(\mathbf{x}_i^\top\mathbf{w}-y_i)^2}{2\sigma^2}}

(4)

有了 $y$ 的分布，我们现在希望找出 $\mathbf w$ 。

Estimating with MLE¶

\begin{aligned} \mathbf{w} &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} P(y_1,\mathbf{x}_1,...,y_n,\mathbf{x}_n|\mathbf{w})\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \prod_{i=1}^n P(y_i,\mathbf{x}_i|\mathbf{w}) & \textrm{Because data points are independently sampled.}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \prod_{i=1}^n P(y_i|\mathbf{x}_i,\mathbf{w})P(\mathbf{x}_i|\mathbf{w}) & \textrm{Chain rule of probability.}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \prod_{i=1}^n P(y_i|\mathbf{x}_i,\mathbf{w})P(\mathbf{x}_i) & \textrm{$\mathbf{x}_i$ is independent of $\mathbf{w}$, we only model $P(y_i|\mathbf{x})$}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \prod_{i=1}^n P(y_i|\mathbf{x}_i,\mathbf{w}) & \textrm{$P(\mathbf{x}_i)$ is a constant - can be dropped}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \sum_{i=1}^n \log\left[P(y_i|\mathbf{x}_i,\mathbf{w})\right] & \textrm{log is a monotonic function}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \sum_{i=1}^n \left[ \log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) + \log\left(e^{-\frac{(\mathbf{x}_i^\top\mathbf{w}-y_i)^2}{2\sigma^2}}\right)\right] & \textrm{Plugging in probability distribution}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} -\frac{1}{2\sigma^2}\sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2 & \textrm{First term is a constant, and $\log(e^z)=z$}\\ &= \operatorname*{argmin}_{\mathbf{\mathbf{w}}} \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2 & \textrm{$\frac{1}{n}$ makes the loss interpretable (average squared error).}\\ \end{aligned}

(5)

Therefore, maximizing parameter $\textbf{w}$ is equivalent to minimizing a loss function, $l(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2$ . This particular loss function is also known as the squared loss or Ordinary Least Squares (OLS). OLS has a closed form:

$\mathbf{w} = (\mathbf{X X^\top})^{-1}\mathbf{X}\mathbf{y}^\top$ where $\mathbf{X}=\left[\mathbf{x}_1,\dots,\mathbf{x}_n\right]$ and $\mathbf{y}=\left[y_1,\dots,y_n\right]$ .

Estimating with MAP¶

With MAP, we can have a prior belief on our $\textbf w$ , so we make the additional assumption:

$P(\mathbf{w}) \sim \mathcal N(0, \tau^2I) = \frac{1}{\sqrt{2\pi\tau^2}}e^{-\frac{\mathbf{w}^\top\mathbf{w}}{2\tau^2}}$ This is to say: all features have a same deviation $\tau^2$ and are drawn independently from each other, so each feature has the same probability of being big or small.

\begin{align} \mathbf{w} &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} P(\mathbf{w}|y_1,\mathbf{x}_1,...,y_n,\mathbf{x}_n)\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \frac{P(y_1,\mathbf{x}_1,...,y_n,\mathbf{x}_n|\mathbf{w})P(\mathbf{w})}{P(y_1,\mathbf{x}_1,...,y_n,\mathbf{x}_n)}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \left[\prod_{i=1}^n P(y_i|\mathbf{x}_i,\mathbf{w})\right]P(\mathbf{w})\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \sum_{i=1}^n \log P(y_i|\mathbf{x}_i,\mathbf{w})+ \log P(\mathbf{w})\\ &= \operatorname*{argmin}_{\mathbf{\mathbf{w}}} \frac{1}{2\sigma^2} \sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2 + \frac{1}{2\tau^2}\mathbf{w}^\top\mathbf{w} &&\text{具体步骤与上文一样} \\ &= \operatorname*{argmin}_{\mathbf{\mathbf{w}}} \frac{1}{n} \sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2 + \lambda|| \mathbf{w}||_2^2 &&\text{给整个式子$\times \frac{2\sigma^2}{n}$ 并设 } \lambda=\frac{\sigma^2}{n\tau^2}\\ \end{align}

(6)

This objective is known as Ridge Regression. It has a closed form solution of: $\mathbf{w} = (\mathbf{X X^{\top}}+\lambda \mathbf{I})^{-1}\mathbf{X}\mathbf{y}^\top,$ where $\mathbf{X}=\left[\mathbf{x}_1,\dots,\mathbf{x}_n\right]$ and $\mathbf{y}=\left[y_1,\dots,y_n\right]$ .

Summary¶

Matrix Form Linear Regression¶

\begin{align} \ell(w) &= \frac{1}{n} \sum_{i=1}^n (x_i^T w - y_i)^2 = \frac 1 n \|Xw - Y\|^2\\ \nabla \ell(w) &= \frac{2}{n} \sum_{i=1}^n x_i (x_i^T w - y_i) = \frac 2 n X^T (Xw - Y)\\ \nabla^2 \ell(w) &= \frac{2 }{n} \sum_{i=1}^n x_i x_i^T = \frac 2 n X^T X \end{align}

(7)

Ordinary Least Squares:¶

$\operatorname*{min}_{\mathbf{\mathbf{w}}} \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2$ .
Squared loss.
No regularization.
Closed form: $\mathbf{w} = (\mathbf{X X^\top})^{-1}\mathbf{X} \mathbf{y}^\top$ .

Ridge Regression:¶

$\operatorname*{min}_{\mathbf{\mathbf{w}}} \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2 + \lambda ||\mathbf{w}||_2^2$ .
Squared loss.
$l2\text{-regularization}$ .
Closed form: $\mathbf{w} = (\mathbf{X X^{\top}}+\lambda \mathbf{I})^{-1}\mathbf{X} \mathbf{y}^\top$ .

CS4780 Intro to Machine Learning

Gradient Descent

CS4780 Intro to Machine Learning

SVM - Support Vector Machine