Logistic Regression

Background Setup¶

When we do Logistic Regression, $y$ takes two values $-1, +1$ , so $y\in\{+1,-1\}$ . We also make the assumption that $P(y|\mathbf{x}_i)$ takes on exactly this form: (if you such an assumption is just ridiculous, refer back to the first lecture where we talked about assumptions. It is indeed bold an possibly ridculous, but the model works perfectly if the distribution looks something likes this)

P(y|\mathbf{x}_i)=\frac{1}{1+e^{-y(\mathbf{w}^T \mathbf{x}_i+b)}}

(1)

where $\mathbf{w}$ and $b$ are parameters of this classifier. We can perform a little sanity check and find that $P(+1|\mathbf{x}_i) + P(-1|\mathbf{x}_i) = 1$ .

The prediction result of Logistic Regression will be

h(\mathbf{x}) = P(+1 \mid \mathbf{x}) = \frac{1}{1+e^{-(\mathbf{w}^T \mathbf{x}_i+b)}}

(2)

Logistic Regression and Perceptron¶

LR’s Decision Boundary is Linear¶

For a test point $\mathbf{x}$ , we assign label $y=1$ to it if

\begin{align} & P(y=+1\mid \mathbf{x}) \gt P(y=-1\mid \mathbf{x}) \\ \iff & 1+e^{(\mathbf{w}^T \mathbf{x}_i+b)}\gt {1+e^{-(\mathbf{w}^T \mathbf{x}_i+b)}} \\ \iff & e^{(\mathbf{w}^T \mathbf{x}_i+b)} \gt e^{-(\mathbf{w}^T \mathbf{x}_i+b)}\\ \iff & \mathbf{w}^T \mathbf{x}_i+b \gt 0 \end{align}

(3)

As we see from above, it is at last a linear equation that determines our label assignment.

Difference from Perceptron¶

Even though we have a linear decision boundary, this doesn’t mean we will make the same decision boundary as another model (such as Perceptron) would give. In fact, LR is in many ways much better than the Perceptron even though both of them draw a hyperplane to classify data points into two classes:

Perceptron	Linear Regression
only care about the point is on which side of the hyperplane, but do not care about its distance from the plane. A point only has two possibility: being classified on the head side, or on the tail side. It is too binary and does not take into account variability of distance to hyperplane.	care about point’s distance to the hyperplane. The predicted value actually represents “how confident we are with our estimation”. Therefore, when $w^Tx+b=0$ , the point $x$ is on the plane, which is equivalent to having 0.5 probability on head side and 0.5 probability on the tail side.

LR and Gaussian Naive Bayes¶

Logistic Regression is implied by a special form of Gaussian Naive Bayes. Recall for a Gaussian Naive Bayes, we have the distribution of each feature given a label $c$ as $P(x_\alpha \mid y=c) = \mathcal{N}\left(\mu_{\alpha c}, \sigma^{2}_{\alpha c}\right)$ . Here, let’s assume that all features have the same deviation from the mean regardless of the label given, so we actually have

P(x_\alpha \mid y=c) = \mathcal{N}\left(\mu_{\alpha c}, \sigma^{2}_{\alpha}\right)

(4)

In this case, we can actually prove that our Gaussian Naive Bayes gives the same result as a Logistic Regression with

\mathbf{w}_i = \frac{\mu_{i,-1}-\mu_{i,+1}}{\sigma^2_i}, b = \ln \frac{1-\pi}{\pi}+\sum_i \frac{\mu^2_{i,+1}-\mu^2_{i,-1}}{2\sigma^2_i}

(5)

A detailed proof can be found here in section 3.1.

Therefore, we have shown that Logistic Regression, a discriminative model, can actually be inferred by a special case of Gaussian Naive Bayes, a generative model. We usually call Logistic Regression the discriminative counterpart of Naive Bayes.

Estimating $\mathbf{w}$ in LR¶

Throughout this section we absorbed the parameter $b$ into $\mathbf{w}$ through an additional constant dimension (similar to the Perceptron).

In addition, we only give out the formula for finding $\hat{\mathbf{w}}$ . We will talk about methods of actually calculating the value (minimum) in the next lecture.

Maximum Likelihood Estimate (MLE)¶

We want to find the parameter $\mathbf w$ that maximizes

\begin{aligned} P(\mathbf y \mid X; \mathbf{w}) = \prod_{i=1}^n P(y_i \mid \mathbf{x}_i; \mathbf{w}) \end{aligned}

(6)

where $X$ is all the training feature vectors $X=\left[\mathbf{x}_1, \dots,\mathbf{x}_i, \dots, \mathbf{x}_n\right] \in \mathbb R^{d \times n}$ and $\mathbf y$ is the labels of all data. We can turn it into a big product because of course all samples are drawn i.i.d. From the previous equation,

\begin{aligned} \hat{\mathbf{w}}_{MLE} &= \operatorname*{argmax}_{\mathbf{w}} \; \log \bigg(\prod_{i=1}^n P(y_i|\mathbf{x}_i,\mathbf{w})\bigg)\\ &= \operatorname*{argmax}_{\mathbf{w}} -\sum_{i=1}^n \log(1+e^{-y_i \mathbf{w}^T \mathbf{x}_i})\\ &=\operatorname*{argmin}_{\mathbf{w}}\sum_{i=1}^n \log(1+e^{-y_i \mathbf{w}^T \mathbf{x}_i}) \end{aligned}

(7)

Note that unlike in other algorithms, we do not set a constraint on $\mathbf w$ here. That is because the $|| \mathbf w ||$ , size of $\mathbf w$ mattters here. It isn’t something simply defines a hyperplane, but this value effect how quickly/steep Logistic Regression changes from 0 to 1. When $|| \mathbf w ||$ is big, LR changes faster and the graph appears more steep.

Maximum a Posteriori Estimate (MAP)¶

In the MAP estimate we treat $\mathbf{w}$ as a random variable and can specify a prior belief distribution over it. We may use the Gaussian approximation $\mathbf{w} \sim \mathbf{\mathcal{N}}(\mathbf 0,\sigma^2 I)$ , which says we do not have preferential direction of the plane $\mathbf w$ describes, but we do make an assumption about the scale of $\mathbf w$ .