Logistic Regression
Background Setup¶
When we do Logistic Regression, takes two values , so . We also make the assumption that takes on exactly this form: (if you such an assumption is just ridiculous, refer back to the first lecture where we talked about assumptions. It is indeed bold an possibly ridculous, but the model works perfectly if the distribution looks something likes this)
where and are parameters of this classifier. We can perform a little sanity check and find that .
The prediction result of Logistic Regression will be
Logistic Regression and Perceptron¶
LR’s Decision Boundary is Linear¶
For a test point , we assign label to it if
As we see from above, it is at last a linear equation that determines our label assignment.
Difference from Perceptron¶
Even though we have a linear decision boundary, this doesn’t mean we will make the same decision boundary as another model (such as Perceptron) would give. In fact, LR is in many ways much better than the Perceptron even though both of them draw a hyperplane to classify data points into two classes:
Perceptron | Linear Regression |
---|---|
only care about the point is on which side of the hyperplane, but do not care about its distance from the plane. A point only has two possibility: being classified on the head side, or on the tail side. It is too binary and does not take into account variability of distance to hyperplane. | care about point’s distance to the hyperplane. The predicted value actually represents “how confident we are with our estimation”. Therefore, when , the point is on the plane, which is equivalent to having 0.5 probability on head side and 0.5 probability on the tail side. |
LR and Gaussian Naive Bayes¶
Logistic Regression is implied by a special form of Gaussian Naive Bayes. Recall for a Gaussian Naive Bayes, we have the distribution of each feature given a label as . Here, let’s assume that all features have the same deviation from the mean regardless of the label given, so we actually have
In this case, we can actually prove that our Gaussian Naive Bayes gives the same result as a Logistic Regression with
A detailed proof can be found here in section 3.1.
Therefore, we have shown that Logistic Regression, a discriminative model, can actually be inferred by a special case of Gaussian Naive Bayes, a generative model. We usually call Logistic Regression the discriminative counterpart of Naive Bayes.
Estimating in LR¶
Throughout this section we absorbed the parameter into through an additional constant dimension (similar to the Perceptron).
In addition, we only give out the formula for finding . We will talk about methods of actually calculating the value (minimum) in the next lecture.
Maximum Likelihood Estimate (MLE)¶
We want to find the parameter that maximizes
where is all the training feature vectors and is the labels of all data. We can turn it into a big product because of course all samples are drawn i.i.d. From the previous equation,
Note that unlike in other algorithms, we do not set a constraint on here. That is because the , size of mattters here. It isn’t something simply defines a hyperplane, but this value effect how quickly/steep Logistic Regression changes from 0 to 1. When is big, LR changes faster and the graph appears more steep.
Maximum a Posteriori Estimate (MAP)¶
In the MAP estimate we treat as a random variable and can specify a prior belief distribution over it. We may use the Gaussian approximation , which says we do not have preferential direction of the plane describes, but we do make an assumption about the scale of .
Our goal in MAP is to find the most likely model parameters given the data, i.e., the parameters that maximaize the posterior.
where . Note we are implicitly making small here by also trying to minimize .