Skip to article frontmatterSkip to article content

Logistic Regression

Cornell University

Background Setup

When we do Logistic Regression, yy takes two values 1,+1-1, +1, so y{+1,1}y\in\{+1,-1\}. We also make the assumption that P(yxi)P(y|\mathbf{x}_i) takes on exactly this form: (if you such an assumption is just ridiculous, refer back to the first lecture where we talked about assumptions. It is indeed bold an possibly ridculous, but the model works perfectly if the distribution looks something likes this)

P(yxi)=11+ey(wTxi+b)P(y|\mathbf{x}_i)=\frac{1}{1+e^{-y(\mathbf{w}^T \mathbf{x}_i+b)}}

where w\mathbf{w} and bb are parameters of this classifier. We can perform a little sanity check and find that P(+1xi)+P(1xi)=1P(+1|\mathbf{x}_i) + P(-1|\mathbf{x}_i) = 1.

The prediction result of Logistic Regression will be

h(x)=P(+1x)=11+e(wTxi+b)h(\mathbf{x}) = P(+1 \mid \mathbf{x}) = \frac{1}{1+e^{-(\mathbf{w}^T \mathbf{x}_i+b)}}

Logistic Regression and Perceptron

LR’s Decision Boundary is Linear

For a test point x\mathbf{x}, we assign label y=1y=1 to it if

P(y=+1x)>P(y=1x)    1+e(wTxi+b)>1+e(wTxi+b)    e(wTxi+b)>e(wTxi+b)    wTxi+b>0\begin{align} & P(y=+1\mid \mathbf{x}) \gt P(y=-1\mid \mathbf{x}) \\ \iff & 1+e^{(\mathbf{w}^T \mathbf{x}_i+b)}\gt {1+e^{-(\mathbf{w}^T \mathbf{x}_i+b)}} \\ \iff & e^{(\mathbf{w}^T \mathbf{x}_i+b)} \gt e^{-(\mathbf{w}^T \mathbf{x}_i+b)}\\ \iff & \mathbf{w}^T \mathbf{x}_i+b \gt 0 \end{align}

As we see from above, it is at last a linear equation that determines our label assignment.

Difference from Perceptron

Even though we have a linear decision boundary, this doesn’t mean we will make the same decision boundary as another model (such as Perceptron) would give. In fact, LR is in many ways much better than the Perceptron even though both of them draw a hyperplane to classify data points into two classes:

PerceptronLinear Regression
only care about the point is on which side of the hyperplane, but do not care about its distance from the plane. A point only has two possibility: being classified on the head side, or on the tail side. It is too binary and does not take into account variability of distance to hyperplane.care about point’s distance to the hyperplane. The predicted value actually represents “how confident we are with our estimation”. Therefore, when wTx+b=0w^Tx+b=0, the point xx is on the plane, which is equivalent to having 0.5 probability on head side and 0.5 probability on the tail side.

LR and Gaussian Naive Bayes

Logistic Regression is implied by a special form of Gaussian Naive Bayes. Recall for a Gaussian Naive Bayes, we have the distribution of each feature given a label cc as P(xαy=c)=N(μαc,σαc2)P(x_\alpha \mid y=c) = \mathcal{N}\left(\mu_{\alpha c}, \sigma^{2}_{\alpha c}\right). Here, let’s assume that all features have the same deviation from the mean regardless of the label given, so we actually have

P(xαy=c)=N(μαc,σα2)P(x_\alpha \mid y=c) = \mathcal{N}\left(\mu_{\alpha c}, \sigma^{2}_{\alpha}\right)

In this case, we can actually prove that our Gaussian Naive Bayes gives the same result as a Logistic Regression with

wi=μi,1μi,+1σi2,b=ln1ππ+iμi,+12μi,122σi2\mathbf{w}_i = \frac{\mu_{i,-1}-\mu_{i,+1}}{\sigma^2_i}, b = \ln \frac{1-\pi}{\pi}+\sum_i \frac{\mu^2_{i,+1}-\mu^2_{i,-1}}{2\sigma^2_i}

A detailed proof can be found here in section 3.1.

Therefore, we have shown that Logistic Regression, a discriminative model, can actually be inferred by a special case of Gaussian Naive Bayes, a generative model. We usually call Logistic Regression the discriminative counterpart of Naive Bayes.

Estimating w\mathbf{w} in LR

Throughout this section we absorbed the parameter bb into w\mathbf{w} through an additional constant dimension (similar to the Perceptron).

In addition, we only give out the formula for finding w^\hat{\mathbf{w}}. We will talk about methods of actually calculating the value (minimum) in the next lecture.

Maximum Likelihood Estimate (MLE)

We want to find the parameter w\mathbf w that maximizes

P(yX;w)=i=1nP(yixi;w)\begin{aligned} P(\mathbf y \mid X; \mathbf{w}) = \prod_{i=1}^n P(y_i \mid \mathbf{x}_i; \mathbf{w}) \end{aligned}

where XX is all the training feature vectors X=[x1,,xi,,xn]Rd×nX=\left[\mathbf{x}_1, \dots,\mathbf{x}_i, \dots, \mathbf{x}_n\right] \in \mathbb R^{d \times n} and y\mathbf y is the labels of all data. We can turn it into a big product because of course all samples are drawn i.i.d. From the previous equation,

w^MLE=argmaxw  log(i=1nP(yixi,w))=argmaxwi=1nlog(1+eyiwTxi)=argminwi=1nlog(1+eyiwTxi)\begin{aligned} \hat{\mathbf{w}}_{MLE} &= \operatorname*{argmax}_{\mathbf{w}} \; \log \bigg(\prod_{i=1}^n P(y_i|\mathbf{x}_i,\mathbf{w})\bigg)\\ &= \operatorname*{argmax}_{\mathbf{w}} -\sum_{i=1}^n \log(1+e^{-y_i \mathbf{w}^T \mathbf{x}_i})\\ &=\operatorname*{argmin}_{\mathbf{w}}\sum_{i=1}^n \log(1+e^{-y_i \mathbf{w}^T \mathbf{x}_i}) \end{aligned}

Note that unlike in other algorithms, we do not set a constraint on w\mathbf w here. That is because the w|| \mathbf w ||, size of w\mathbf w mattters here. It isn’t something simply defines a hyperplane, but this value effect how quickly/steep Logistic Regression changes from 0 to 1. When w|| \mathbf w || is big, LR changes faster and the graph appears more steep.

Maximum a Posteriori Estimate (MAP)

In the MAP estimate we treat w\mathbf{w} as a random variable and can specify a prior belief distribution over it. We may use the Gaussian approximation wN(0,σ2I)\mathbf{w} \sim \mathbf{\mathcal{N}}(\mathbf 0,\sigma^2 I), which says we do not have preferential direction of the plane w\mathbf w describes, but we do make an assumption about the scale of w\mathbf w.

Our goal in MAP is to find the most likely model parameters given the data, i.e., the parameters that maximaize the posterior.

w^MAP=  argmaxwP(wX,y)=  argmaxwP(yX,w)  P(w)=  argmaxwlog(P(yX,w)P(w))=  argminwi=1nlog(1+eyiwTxi)+λww,,\begin{aligned} \hat{\mathbf{w}}_{MAP} = & \; \operatorname*{argmax}_{\mathbf{w}} P(\mathbf{w} \mid X, \mathbf y)\\ = & \; \operatorname*{argmax}_{\mathbf{w}} \propto P(\mathbf y \mid X, \mathbf{w}) \; P(\mathbf{w}) \\ = & \; \operatorname*{argmax}_{\mathbf{w}}\log \, \left(P(\mathbf y \mid X, \mathbf{w}) P(\mathbf{w})\right) \\ = & \; \operatorname*{argmin}_{\mathbf{w}} \sum_{i=1}^n \log(1+e^{-y_i\mathbf{w}^T \mathbf{x}_i})+\lambda\mathbf{w}^\top\mathbf{w}, \end{aligned},

where λ=12σ2\lambda = \frac{1}{2\sigma^2}. Note we are implicitly making w\mathbf w small here by also trying to minimize ww\mathbf{w}^\top\mathbf{w}.