Skip to article frontmatterSkip to article content

Important Distributions and Linear Regression

Cornell University

Two Important Distributions in ML

Gaussian Distribution

When you have a bunch of Random Variables adding together, it is likely to be Gaussian distributed, but we do want these Random Variables to have finite mean and finite variance. In this case, they will be Gaussian according to the Central Limit Theorem.

Power Law Distribution

P(X)=XkP(X) = X^{-k}

The sum of a bunch finite mean but infinite variance Random Variables doesn’t usually converge, but when it does, it gives Power Law Distribution. Examples of power law: wealth per person, word frequency.

Power Law is scale free: Select an arbitrary part of the graph and zoom into it, we will find it have the exact same shape as the original graph.

Linear Regression

Assumptions

注意以下讨论中我们认为直线过原点,如果不过就将线加一维即可,详见前文 Perceptron

img

We assume that label is in the space of real numbers and is somewhat linearly distributed. Mathematically,

yiR,  yi=wxi+ϵiy_{i} \in \mathbb{R}, \; y_{i} = \mathbf{w}^\top\mathbf{x}_i + \epsilon_i

where ϵ\epsilon is a the noise: we said label is “somewhat” linear, so they can’t always be exactly on the line. The ϵ\epsilon here is the offset between the label and the line.

We also assume that the noise is Gaussian distributed

ϵiN(0,σ2)\epsilon_i \sim N(0, \sigma^2)

so for a fixed distance, each point has the same probability of being that distance away from the line. It is also reasonable to have mean as 0, because if it >0\gt 0, it means majority points a more off to the above, so we can just move our line up a bit; similar for <0\lt 0, we just move the line down a bit. 根据我们这里的假设,可以通过将 ϵi\epsilon_i 的 distribution 向我们的模型预测直线 wxi\mathbf{w}^\top\mathbf{x}_i 移动得到

yixiN(wxi,σ2)P(yixi,w)=12πσ2e(xiwyi)22σ2y_i|\mathbf{x}_i \sim N(\mathbf{w}^\top\mathbf{x}_i, \sigma^2) \Rightarrow P(y_i|\mathbf{x}_i,\mathbf{w})=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(\mathbf{x}_i^\top\mathbf{w}-y_i)^2}{2\sigma^2}}

有了 yy 的分布,我们现在希望找出 w\mathbf w

Estimating with MLE

w=argmaxwP(y1,x1,...,yn,xnw)=argmaxwi=1nP(yi,xiw)Because data points are independently sampled.=argmaxwi=1nP(yixi,w)P(xiw)Chain rule of probability.=argmaxwi=1nP(yixi,w)P(xi)xi is independent of w, we only model P(yix)=argmaxwi=1nP(yixi,w)P(xi) is a constant - can be dropped=argmaxwi=1nlog[P(yixi,w)]log is a monotonic function=argmaxwi=1n[log(12πσ2)+log(e(xiwyi)22σ2)]Plugging in probability distribution=argmaxw12σ2i=1n(xiwyi)2First term is a constant, and log(ez)=z=argminw1ni=1n(xiwyi)21n makes the loss interpretable (average squared error).\begin{aligned} \mathbf{w} &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} P(y_1,\mathbf{x}_1,...,y_n,\mathbf{x}_n|\mathbf{w})\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \prod_{i=1}^n P(y_i,\mathbf{x}_i|\mathbf{w}) & \textrm{Because data points are independently sampled.}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \prod_{i=1}^n P(y_i|\mathbf{x}_i,\mathbf{w})P(\mathbf{x}_i|\mathbf{w}) & \textrm{Chain rule of probability.}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \prod_{i=1}^n P(y_i|\mathbf{x}_i,\mathbf{w})P(\mathbf{x}_i) & \textrm{$\mathbf{x}_i$ is independent of $\mathbf{w}$, we only model $P(y_i|\mathbf{x})$}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \prod_{i=1}^n P(y_i|\mathbf{x}_i,\mathbf{w}) & \textrm{$P(\mathbf{x}_i)$ is a constant - can be dropped}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \sum_{i=1}^n \log\left[P(y_i|\mathbf{x}_i,\mathbf{w})\right] & \textrm{log is a monotonic function}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \sum_{i=1}^n \left[ \log\left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) + \log\left(e^{-\frac{(\mathbf{x}_i^\top\mathbf{w}-y_i)^2}{2\sigma^2}}\right)\right] & \textrm{Plugging in probability distribution}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} -\frac{1}{2\sigma^2}\sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2 & \textrm{First term is a constant, and $\log(e^z)=z$}\\ &= \operatorname*{argmin}_{\mathbf{\mathbf{w}}} \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2 & \textrm{$\frac{1}{n}$ makes the loss interpretable (average squared error).}\\ \end{aligned}

Therefore, maximizing parameter w\textbf{w} is equivalent to minimizing a loss function, l(w)=1ni=1n(xiwyi)2l(\mathbf{w}) = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2. This particular loss function is also known as the squared loss or Ordinary Least Squares (OLS). OLS has a closed form:

w=(XX)1Xy\mathbf{w} = (\mathbf{X X^\top})^{-1}\mathbf{X}\mathbf{y}^\top where X=[x1,,xn]\mathbf{X}=\left[\mathbf{x}_1,\dots,\mathbf{x}_n\right] and y=[y1,,yn]\mathbf{y}=\left[y_1,\dots,y_n\right].

Estimating with MAP

With MAP, we can have a prior belief on our w\textbf w, so we make the additional assumption:

P(w)N(0,τ2I)=12πτ2eww2τ2P(\mathbf{w}) \sim \mathcal N(0, \tau^2I) = \frac{1}{\sqrt{2\pi\tau^2}}e^{-\frac{\mathbf{w}^\top\mathbf{w}}{2\tau^2}} This is to say: all features have a same deviation τ2\tau^2 and are drawn independently from each other, so each feature has the same probability of being big or small.

w=argmaxwP(wy1,x1,...,yn,xn)=argmaxwP(y1,x1,...,yn,xnw)P(w)P(y1,x1,...,yn,xn)=argmaxw[i=1nP(yixi,w)]P(w)=argmaxwi=1nlogP(yixi,w)+logP(w)=argminw12σ2i=1n(xiwyi)2+12τ2ww具体步骤与上文一样=argminw1ni=1n(xiwyi)2+λw22给整个式子×2σ2n 并设 λ=σ2nτ2\begin{align} \mathbf{w} &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} P(\mathbf{w}|y_1,\mathbf{x}_1,...,y_n,\mathbf{x}_n)\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \frac{P(y_1,\mathbf{x}_1,...,y_n,\mathbf{x}_n|\mathbf{w})P(\mathbf{w})}{P(y_1,\mathbf{x}_1,...,y_n,\mathbf{x}_n)}\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \left[\prod_{i=1}^n P(y_i|\mathbf{x}_i,\mathbf{w})\right]P(\mathbf{w})\\ &= \operatorname*{argmax}_{\mathbf{\mathbf{w}}} \sum_{i=1}^n \log P(y_i|\mathbf{x}_i,\mathbf{w})+ \log P(\mathbf{w})\\ &= \operatorname*{argmin}_{\mathbf{\mathbf{w}}} \frac{1}{2\sigma^2} \sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2 + \frac{1}{2\tau^2}\mathbf{w}^\top\mathbf{w} &&\text{具体步骤与上文一样} \\ &= \operatorname*{argmin}_{\mathbf{\mathbf{w}}} \frac{1}{n} \sum_{i=1}^n (\mathbf{x}_i^\top\mathbf{w}-y_i)^2 + \lambda|| \mathbf{w}||_2^2 &&\text{给整个式子$\times \frac{2\sigma^2}{n}$ 并设 } \lambda=\frac{\sigma^2}{n\tau^2}\\ \end{align}

This objective is known as Ridge Regression. It has a closed form solution of: w=(XX+λI)1Xy,\mathbf{w} = (\mathbf{X X^{\top}}+\lambda \mathbf{I})^{-1}\mathbf{X}\mathbf{y}^\top, where X=[x1,,xn]\mathbf{X}=\left[\mathbf{x}_1,\dots,\mathbf{x}_n\right] and y=[y1,,yn]\mathbf{y}=\left[y_1,\dots,y_n\right].

Summary

Matrix Form Linear Regression

(w)=1ni=1n(xiTwyi)2=1nXwY2(w)=2ni=1nxi(xiTwyi)=2nXT(XwY)2(w)=2ni=1nxixiT=2nXTX\begin{align} \ell(w) &= \frac{1}{n} \sum_{i=1}^n (x_i^T w - y_i)^2 = \frac 1 n \|Xw - Y\|^2\\ \nabla \ell(w) &= \frac{2}{n} \sum_{i=1}^n x_i (x_i^T w - y_i) = \frac 2 n X^T (Xw - Y)\\ \nabla^2 \ell(w) &= \frac{2 }{n} \sum_{i=1}^n x_i x_i^T = \frac 2 n X^T X \end{align}

Ordinary Least Squares:

Ridge Regression: