MLE and MAP - Incomplete CS Notes @ Cornell

We will talk about two ways of estimating a parameter: MLE and MAP. Their formulas look really similar to generative learning and discriminative as if one corresponded to generative and the other corresponded to discriminative. However, MLE/MAP has nothing to do with generative/discriminative learning. They just look similar. That’s all.

MLE and Coin Toss¶

Say we have a coin, and we want to find $\theta = P(H)$ : the probability that this coin comes up heads when we toss it.

We toss it $n = 10$ times and obtain the following sequence of outcomes: $D=\{H, T, T, H, H, H, T, T, T, T\}$ . Therefore, we observed $n_H=4$ heads and $n_T=6$ tails. So, intuitively,

P(H) \approx \frac{n_H}{n_H + n_T} = \frac{4}{10}= 0.4

(1)

. We will derive this more formally with Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE)¶

The estimator we just mentioned is the Maximum Likelihood Estimate. In particular, we want to find a distribution that makes the data we observed as likely as possible. MLE is done in two steps:

Make an explicit modeling assumption about what type of distribution your data was sampled from.
Set the parameters of this distribution so that the data you observed is as likely as possible.

Before proceeding to MLE, we already observed our data. Namely, in our coin example, $D, n, n_H, n_T$ are are set values.

We will assume that coin toss is of binomial distribution $Bin(n,\theta)$ , where $n$ is number of tosses we made and $\theta$ is the probability coming up head we are trying to estimate. Formally, given this is a binomial distribution with probability $\theta$ , we have

\begin{align} P(D\mid \theta) &= \begin{pmatrix} n_H + n_T \\ n_H \end{pmatrix} \theta^{n_H} (1 - \theta)^{n_T}, \end{align}

(2)

MLE Principle: Find $\hat{\theta}$ to maximize $P(D; \theta)$ , the likelihood of the data, :

\begin{align} \hat{\theta}_{MLE} &= \operatorname*{argmax}_{\theta} \,P(D ; \theta) \end{align}

(3)

Note we have $P(D;\theta)$ here. ; means $\theta$ is a parameter of this distribution, just like $\mu, \gamma$ are parameters in $\mathcal{N}(\mu, \gamma)$ . This will be different from how we view $\theta$ in MAP we will talk later. You can still write this as $P(D\mid \theta)$ , but just remember when we say “given $\theta$ ” in a MLE context, we don’t mean $\theta$ is a Random Variable. It should be treated as a parameter.

A common procedure we take to solve a maximum problem:

We don’t want to see all these products, so take the $\log$ of the function to convert them into sums.
Compute its derivative, and equate it with zero to find the extreme point.

Applying this procedure:

\begin{align} \hat{\theta}_{MLE} &= \operatorname*{argmax}_{\theta} \,P(D; \theta) \\ &= \operatorname*{argmax}_{\theta} \begin{pmatrix} n_H + n_T \\ n_H \end{pmatrix} \theta^{n_H} (1 - \theta)^{n_T} \\ &= \operatorname*{argmax}_{\theta} \,\log\begin{pmatrix} n_H + n_T \\ n_H \end{pmatrix} + n_H \cdot \log(\theta) + n_T \cdot \log(1 - \theta) && \text{combinatorial is just constant}\\ &= \operatorname*{argmax}_{\theta} \, n_H \cdot \log(\theta) + n_T \cdot \log(1 - \theta) \end{align}

(4)

We can then solve for $\theta$ by taking the derivative and equating it with zero. This results in

\begin{align} \frac{n_H}{\theta} = \frac{n_T}{1 - \theta} \Longrightarrow n_H - n_H\theta = n_T\theta \Longrightarrow \theta = \frac{n_H}{n_H + n_T} \end{align}

(5)

Smoothing¶

If $n$ is large and your model/distribution is correct, then MLE finds the true parameters, but the MLE can overfit the data if $n$ is small. It works well when $n$ is large. For example, suppose you observe H,H,H,H,H. $\hat{\theta}_{MLE} = \frac{n_H}{n_H + n_T}= \frac{5}{5} = 1$ .

Simple fix: We can add $m$ imaginary throws and get some imaginary results based on our intuition about the distribution. For example, if we have a hunch that that $\theta$ is close to 0.5. Call our intuition probability $\theta'$ . The $m$ imaginary throws will result in $m\theta'$ heads and $m(1-\theta')$ tails. Add these imaginary results to our data, so

\hat{\theta} = \frac{n_H + m\theta'}{n_H + n_T + m}

(6)

For large $n$ , this is an insignificant change. For small $n$ , it incorporates your “prior belief” about what $\theta$ should be.

This idea of “prior belief” actually gives rise to another class of thinking about distribution.

MAP and Coin Toss with Prior Knowledge¶

The Bayesian Way¶

Frequentists think $\theta$ is the probability distribution from where our data is drawn. Therefore, it should be treated as a constant, something we cannot control.
Bayesians consider it just another random variable, whose distribution can reflect our prior knowledge / assumption about the data distribution.

Formally, they model $\theta$ as a random variable, drawn from a distribution $P(\theta)$ . Note that $\theta$ is not a random variable associated with an event in a sample space. You can specify a prior belief $P(\theta)$ defining what values you believe $\theta$ is likely to take on.

Now, we can look at $P(\theta \mid D) = \frac{P(D\mid \theta) P(\theta)}{P(D)}$ , where

$P(\theta)$ is the prior distribution over the parameter(s) $\theta$ - what I believe before seeing any data.
$P(D \mid \theta)$ is the likelihood of the data given the parameter(s) $\theta$ - same as in MLE
$P(\theta \mid D)$ is the posterior distribution over the parameter(s) $\theta$ - what I believe after observing the data.
$P(D)$ is the total probability of drawing such a data considering all possible distributions. $P(D) = \int_{\theta'}P(D|\theta')P(\theta')$ , but this value is just a constant in our maximization problem and we can ignore it.

A prior distribution reflects our uncertainty about the true value of p before observing the coin tosses. After the experiment is performed and the data are gathered, the prior distribution is updated using Bayes’ rule; this yields the posterior distribution, which reflects our new beliefs about p.

Story 8.3.3 (Beta-Binomial conjugacy) from Introduction to Probability - Joseph K. Blitzstein, Jessica Hwang

A natural choice for the prior $P(\theta$ ) is the Beta distribution

\begin{align} P(\theta) = \frac{\theta^{\alpha - 1}(1 - \theta)^{\beta - 1}}{B(\alpha, \beta)} \end{align}

(7)

where $B(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)}$ is the normalization constant. Why do we choose this complicated thing as our prior belief?

If we choose Beta Distribution as our prior distribution, we will see later that after performing an experiment of binomial distribution and when we want to update to the posterior distribution, the posterior distribution of $\theta$ after observing $n_H$ is still a Beta distribution! This is a special relationship between the Beta and Binomial distributions called conjugacy: if we have a Beta prior distribution on p and data that are conditionally Binomial given p, then when going from prior to posterior, we don’t leave the family of Beta distributions. We say that the Beta is the conjugate prior of the Binomial.

8.3.3 (Beta-Binomial conjugacy) from Introduction to Probability - Joseph K. Blitzstein, Jessica Hwang

For a Beta Distribution $P(\theta) \propto \theta^{\alpha - 1}(1 - \theta)^{\beta - 1}$ . $\alpha$ 和 $\beta$ 就是 $\theta$ 和 $1-\theta$ 两个方向的权重, relatively

$\alpha \gt \beta: \theta \to 1$
$\beta \gt \alpha : \theta \to 0$
$\beta = \alpha : \theta \to 0.5$

假设 $\alpha =3, \beta=1$ , $P(\theta) = \propto \theta^{2}(1 - \theta)^{0} = \theta^{2}$ . 观察到 $\theta$ 越大 $P(\theta)$ 越大，即越大的 $\theta$ 越容易被取到

Maximum a Posteriori Probability Estimation (MAP)¶

MAP Principle: Find $\hat{\theta}$ that maximizes the posterior distribution $P(\theta \mid D)$ :

\begin{align} \hat{\theta}_{MAP} &= \operatorname*{argmax}_{\theta} \,P(\theta \mid D) \\ &= \operatorname*{argmax}_{\theta} \frac{P(D\mid \theta) P(\theta)}{P(D)} && \text{By Bayes rule} \\ &= \operatorname*{argmax}_{\theta} \, \log P(D \mid \theta) + \log P(\theta) && \text{Ignore Constant $D$, Take log} \end{align}

(8)

For out coin flipping scenario, we get:

\begin{align} \hat{\theta}_{MAP} &= \operatorname*{argmax}_{\theta} \;P(\theta | Data) \\ &= \operatorname*{argmax}_{\theta} \;\log(P(Data | \theta)) + \log(P(\theta)) \\ &= \operatorname*{argmax}_{\theta} \;n_H \cdot \log(\theta) + n_T \cdot \log(1 - \theta) + (\alpha - 1)\cdot \log(\theta) + (\beta - 1) \cdot \log(1 - \theta) \\ &= \operatorname*{argmax}_{\theta} \;(n_H + \alpha - 1) \cdot \log(\theta) + (n_T + \beta - 1) \cdot \log(1 - \theta) \\ &\Longrightarrow \hat{\theta}_{MAP} = \frac{n_H + \alpha - 1}{n_H + n_T + \beta + \alpha - 2} \end{align}

(9)

A similar calculation will give us

\begin{align} P(\theta \mid D) \propto P(D \mid \theta) P(\theta) \propto \theta^{n_H + \alpha -1} (1 - \theta)^{n_T + \beta -1} \end{align}

(10)

We notice:

The MAP estimate is identical to MLE with $\alpha-1$ hallucinated heads and $\beta-1$ hallucinated tails
As $n \rightarrow \infty$ , $\hat\theta_{MAP} \rightarrow \hat\theta_{MLE}$ as our prior knowledge $\alpha-1$ and $\beta-1$ become irrelevant compared to very large $n_H,n_T$ .
MAP is a great estimator if an accurate prior belief is available (and mathematically tractable).

The posterior distribution can act as the prior if we subsequently observe additional data. To see this, we can imagine taking observations one at a time and after each observation updating the current posterior distribution by multiplying by the likelihood function for the new observation and then normalizing to obtain the new, revised posterior distribution. This sequential approach to learning arises naturally when we adopt a Bayesian viewpoint.

2.1.1 (The beta distribution) from Pattern Recognition and Machine Learning - Christopher Bishop

“True” Bayesian approach¶

Note that MAP is only one way to get an estimator in Bayesian way. There is much more information in $P(\theta \mid D)$ and we threw away most of them by taking only the max. A true Bayesian approach is to use the posterior predictive distribution $P(\theta\mid D)$ directly to make prediction about the label $Y$ of a test sample $X$ based on dataset $D$ . Simply put, it integrates over all possible models $\theta$ to make a prediction. Mathematically, we can represent it as:

\begin{align} P[Y\mid (D,X)] = &\int_{\theta}P[(Y,\theta) \mid (D,X)] \hspace{0.05in} d\theta && (P(A) = \int_{B}P(A,b) db) \\ = &\int_{\theta} P[(Y \mid \theta), (D,X)] \hspace{0.05in} P(\theta | D) \hspace{0.05in} d\theta && \textrm{(Chain rule: $P(A,B|C)=P(A|B,C)P(B|C)$)} \end{align}

(11)

Intuition behind each step is:

\begin{align} &P[Y\mid (D,X)] && \text{given our data $D$ and a test point $X$, estimate x's label $Y$} \\ = &\int_{\theta}P[(Y,\theta) \mid (D,X)] \hspace{0.05in} d\theta && \text{how do we estimate? we use some model $\theta$} \\ = &\int_{\theta} P[Y \mid (\theta,D,X)] \hspace{0.05in} P(\theta | D) \hspace{0.05in} d\theta && \text{how do we get $\theta$? based on training data $D$} \end{align}

(12)

Unfortunately, the above is generally intractable.

Summary¶

MLE Prediction: $P[(y|x_t);\theta]$ Learning: $\theta=\operatorname*{argmax}_\theta P(D;\theta)$ . Here $\theta$ is purely a model parameter.
MAP Prediction: $P[y|(x_t,\theta)]$ Learning: $\theta=\operatorname*{argmax}_\theta P(\theta|D)\propto P(D \mid \theta) P(\theta)$ . Here $\theta$ is a random variable.
“True Bayesian” Prediction: $P(y|x_t,D)=\int_{\theta}P(y|\theta)P(\theta|D)d\theta$ . Here $\theta$ is integrated out - our prediction takes all possible models into account.

As always the differences are subtle. In MLE we maximize $\log\left[P(D;\theta)\right]$ in MAP we maximize $\log\left[P(D|\theta)\right]+\log\left[P(\theta)\right]$ . So essentially in MAP we only add the term $\log\left[P(\theta)\right]$ to our optimization. This term is independent of the data and penalizes if the parameters, $\theta$ deviate too much from what we believe is reasonable.

CS4780 Intro to Machine Learning

The Perceptron

CS4780 Intro to Machine Learning

Naive Bayes