Skip to article frontmatterSkip to article content

MLE and MAP

Cornell University

We will talk about two ways of estimating a parameter: MLE and MAP. Their formulas look really similar to generative learning and discriminative as if one corresponded to generative and the other corresponded to discriminative. However, MLE/MAP has nothing to do with generative/discriminative learning. They just look similar. That’s all.

MLE and Coin Toss

Say we have a coin, and we want to find θ=P(H)\theta = P(H): the probability that this coin comes up heads when we toss it.

We toss it n=10n = 10 times and obtain the following sequence of outcomes: D={H,T,T,H,H,H,T,T,T,T}D=\{H, T, T, H, H, H, T, T, T, T\}. Therefore, we observed nH=4n_H=4 heads and nT=6n_T=6 tails. So, intuitively,

P(H)nHnH+nT=410=0.4P(H) \approx \frac{n_H}{n_H + n_T} = \frac{4}{10}= 0.4

. We will derive this more formally with Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE)

The estimator we just mentioned is the Maximum Likelihood Estimate. In particular, we want to find a distribution that makes the data we observed as likely as possible. MLE is done in two steps:

  1. Make an explicit modeling assumption about what type of distribution your data was sampled from.
  2. Set the parameters of this distribution so that the data you observed is as likely as possible.

Before proceeding to MLE, we already observed our data. Namely, in our coin example, D,n,nH,nTD, n, n_H, n_T are are set values.

We will assume that coin toss is of binomial distribution Bin(n,θ)Bin(n,\theta), where nn is number of tosses we made and θ\theta is the probability coming up head we are trying to estimate. Formally, given this is a binomial distribution with probability θ\theta, we have

P(Dθ)=(nH+nTnH)θnH(1θ)nT,\begin{align} P(D\mid \theta) &= \begin{pmatrix} n_H + n_T \\ n_H \end{pmatrix} \theta^{n_H} (1 - \theta)^{n_T}, \end{align}

MLE Principle: Find θ^\hat{\theta} to maximize P(D;θ)P(D; \theta), the likelihood of the data, :

θ^MLE=argmaxθP(D;θ)\begin{align} \hat{\theta}_{MLE} &= \operatorname*{argmax}_{\theta} \,P(D ; \theta) \end{align}

Note we have P(D;θ)P(D;\theta) here. ; means θ\theta is a parameter of this distribution, just like μ,γ\mu, \gamma are parameters in N(μ,γ)\mathcal{N}(\mu, \gamma). This will be different from how we view θ\theta in MAP we will talk later. You can still write this as P(Dθ) P(D\mid \theta), but just remember when we say “given θ\theta” in a MLE context, we don’t mean θ\theta is a Random Variable. It should be treated as a parameter.

A common procedure we take to solve a maximum problem:

  1. We don’t want to see all these products, so take the log\log of the function to convert them into sums.
  2. Compute its derivative, and equate it with zero to find the extreme point.

Applying this procedure:

θ^MLE=argmaxθP(D;θ)=argmaxθ(nH+nTnH)θnH(1θ)nT=argmaxθlog(nH+nTnH)+nHlog(θ)+nTlog(1θ)combinatorial is just constant=argmaxθnHlog(θ)+nTlog(1θ)\begin{align} \hat{\theta}_{MLE} &= \operatorname*{argmax}_{\theta} \,P(D; \theta) \\ &= \operatorname*{argmax}_{\theta} \begin{pmatrix} n_H + n_T \\ n_H \end{pmatrix} \theta^{n_H} (1 - \theta)^{n_T} \\ &= \operatorname*{argmax}_{\theta} \,\log\begin{pmatrix} n_H + n_T \\ n_H \end{pmatrix} + n_H \cdot \log(\theta) + n_T \cdot \log(1 - \theta) && \text{combinatorial is just constant}\\ &= \operatorname*{argmax}_{\theta} \, n_H \cdot \log(\theta) + n_T \cdot \log(1 - \theta) \end{align}

We can then solve for θ\theta by taking the derivative and equating it with zero. This results in

nHθ=nT1θnHnHθ=nTθθ=nHnH+nT\begin{align} \frac{n_H}{\theta} = \frac{n_T}{1 - \theta} \Longrightarrow n_H - n_H\theta = n_T\theta \Longrightarrow \theta = \frac{n_H}{n_H + n_T} \end{align}

Smoothing

If nn is large and your model/distribution is correct, then MLE finds the true parameters, but the MLE can overfit the data if nn is small. It works well when nn is large. For example, suppose you observe H,H,H,H,H. θ^MLE=nHnH+nT=55=1\hat{\theta}_{MLE} = \frac{n_H}{n_H + n_T}= \frac{5}{5} = 1.

Simple fix: We can add mm imaginary throws and get some imaginary results based on our intuition about the distribution. For example, if we have a hunch that that θ\theta is close to 0.5. Call our intuition probability θ\theta'. The mm imaginary throws will result in mθm\theta' heads and m(1θ)m(1-\theta') tails. Add these imaginary results to our data, so

θ^=nH+mθnH+nT+m\hat{\theta} = \frac{n_H + m\theta'}{n_H + n_T + m}

For large nn, this is an insignificant change. For small nn, it incorporates your “prior belief” about what θ\theta should be.

This idea of “prior belief” actually gives rise to another class of thinking about distribution.

MAP and Coin Toss with Prior Knowledge

The Bayesian Way

Formally, they model θ\theta as a random variable, drawn from a distribution P(θ)P(\theta). Note that θ\theta is not a random variable associated with an event in a sample space. You can specify a prior belief P(θ)P(\theta) defining what values you believe θ\theta is likely to take on.

Now, we can look at P(θD)=P(Dθ)P(θ)P(D)P(\theta \mid D) = \frac{P(D\mid \theta) P(\theta)}{P(D)} , where

A prior distribution reflects our uncertainty about the true value of p before observing the coin tosses. After the experiment is performed and the data are gathered, the prior distribution is updated using Bayes’ rule; this yields the posterior distribution, which reflects our new beliefs about p.

Story 8.3.3 (Beta-Binomial conjugacy) from Introduction to Probability - Joseph K. Blitzstein, Jessica Hwang

A natural choice for the prior P(θP(\theta) is the Beta distribution

P(θ)=θα1(1θ)β1B(α,β)\begin{align} P(\theta) = \frac{\theta^{\alpha - 1}(1 - \theta)^{\beta - 1}}{B(\alpha, \beta)} \end{align}

where B(α,β)=Γ(α)Γ(β)Γ(α+β)B(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)} is the normalization constant. Why do we choose this complicated thing as our prior belief?

If we choose Beta Distribution as our prior distribution, we will see later that after performing an experiment of binomial distribution and when we want to update to the posterior distribution, the posterior distribution of θ\theta after observing nHn_H is still a Beta distribution! This is a special relationship between the Beta and Binomial distributions called conjugacy: if we have a Beta prior distribution on p and data that are conditionally Binomial given p, then when going from prior to posterior, we don’t leave the family of Beta distributions. We say that the Beta is the conjugate prior of the Binomial.

8.3.3 (Beta-Binomial conjugacy) from Introduction to Probability - Joseph K. Blitzstein, Jessica Hwang

For a Beta Distribution P(θ)θα1(1θ)β1P(\theta) \propto \theta^{\alpha - 1}(1 - \theta)^{\beta - 1}. α\alphaβ\beta 就是 θ\theta1θ1-\theta 两个方向的权重, relatively

假设 α=3,β=1\alpha =3, \beta=1, P(θ)=θ2(1θ)0=θ2P(\theta) = \propto \theta^{2}(1 - \theta)^{0} = \theta^{2}. 观察到 θ\theta 越大 P(θ)P(\theta) 越大,即越大的 θ \theta 越容易被取到

Maximum a Posteriori Probability Estimation (MAP)

MAP Principle: Find θ^\hat{\theta} that maximizes the posterior distribution P(θD)P(\theta \mid D):

θ^MAP=argmaxθP(θD)=argmaxθP(Dθ)P(θ)P(D)By Bayes rule=argmaxθlogP(Dθ)+logP(θ)Ignore Constant D, Take log\begin{align} \hat{\theta}_{MAP} &= \operatorname*{argmax}_{\theta} \,P(\theta \mid D) \\ &= \operatorname*{argmax}_{\theta} \frac{P(D\mid \theta) P(\theta)}{P(D)} && \text{By Bayes rule} \\ &= \operatorname*{argmax}_{\theta} \, \log P(D \mid \theta) + \log P(\theta) && \text{Ignore Constant $D$, Take log} \end{align}

For out coin flipping scenario, we get:

θ^MAP=argmaxθ  P(θData)=argmaxθ  log(P(Dataθ))+log(P(θ))=argmaxθ  nHlog(θ)+nTlog(1θ)+(α1)log(θ)+(β1)log(1θ)=argmaxθ  (nH+α1)log(θ)+(nT+β1)log(1θ)θ^MAP=nH+α1nH+nT+β+α2\begin{align} \hat{\theta}_{MAP} &= \operatorname*{argmax}_{\theta} \;P(\theta | Data) \\ &= \operatorname*{argmax}_{\theta} \;\log(P(Data | \theta)) + \log(P(\theta)) \\ &= \operatorname*{argmax}_{\theta} \;n_H \cdot \log(\theta) + n_T \cdot \log(1 - \theta) + (\alpha - 1)\cdot \log(\theta) + (\beta - 1) \cdot \log(1 - \theta) \\ &= \operatorname*{argmax}_{\theta} \;(n_H + \alpha - 1) \cdot \log(\theta) + (n_T + \beta - 1) \cdot \log(1 - \theta) \\ &\Longrightarrow \hat{\theta}_{MAP} = \frac{n_H + \alpha - 1}{n_H + n_T + \beta + \alpha - 2} \end{align}

A similar calculation will give us

P(θD)P(Dθ)P(θ)θnH+α1(1θ)nT+β1\begin{align} P(\theta \mid D) \propto P(D \mid \theta) P(\theta) \propto \theta^{n_H + \alpha -1} (1 - \theta)^{n_T + \beta -1} \end{align}

We notice:

The posterior distribution can act as the prior if we subsequently observe additional data. To see this, we can imagine taking observations one at a time and after each observation updating the current posterior distribution by multiplying by the likelihood function for the new observation and then normalizing to obtain the new, revised posterior distribution. This sequential approach to learning arises naturally when we adopt a Bayesian viewpoint.

2.1.1 (The beta distribution) from Pattern Recognition and Machine Learning - Christopher Bishop

“True” Bayesian approach

Note that MAP is only one way to get an estimator in Bayesian way. There is much more information in P(θD)P(\theta \mid D) and we threw away most of them by taking only the max. A true Bayesian approach is to use the posterior predictive distribution P(θD)P(\theta\mid D) directly to make prediction about the label YY of a test sample XX based on dataset DD. Simply put, it integrates over all possible models θ\theta to make a prediction. Mathematically, we can represent it as:

P[Y(D,X)]=θP[(Y,θ)(D,X)]dθ(P(A)=BP(A,b)db)=θP[(Yθ),(D,X)]P(θD)dθ(Chain rule: P(A,BC)=P(AB,C)P(BC))\begin{align} P[Y\mid (D,X)] = &\int_{\theta}P[(Y,\theta) \mid (D,X)] \hspace{0.05in} d\theta && (P(A) = \int_{B}P(A,b) db) \\ = &\int_{\theta} P[(Y \mid \theta), (D,X)] \hspace{0.05in} P(\theta | D) \hspace{0.05in} d\theta && \textrm{(Chain rule: $P(A,B|C)=P(A|B,C)P(B|C)$)} \end{align}

Intuition behind each step is:

P[Y(D,X)]given our data D and a test point X, estimate x’s label Y=θP[(Y,θ)(D,X)]dθhow do we estimate? we use some model θ=θP[Y(θ,D,X)]P(θD)dθhow do we get θ? based on training data D\begin{align} &P[Y\mid (D,X)] && \text{given our data $D$ and a test point $X$, estimate x's label $Y$} \\ = &\int_{\theta}P[(Y,\theta) \mid (D,X)] \hspace{0.05in} d\theta && \text{how do we estimate? we use some model $\theta$} \\ = &\int_{\theta} P[Y \mid (\theta,D,X)] \hspace{0.05in} P(\theta | D) \hspace{0.05in} d\theta && \text{how do we get $\theta$? based on training data $D$} \end{align}

Unfortunately, the above is generally intractable.

Summary

As always the differences are subtle. In MLE we maximize log[P(D;θ)]\log\left[P(D;\theta)\right] in MAP we maximize log[P(Dθ)]+log[P(θ)]\log\left[P(D|\theta)\right]+\log\left[P(\theta)\right]. So essentially in MAP we only add the term log[P(θ)]\log\left[P(\theta)\right] to our optimization. This term is independent of the data and penalizes if the parameters, θ\theta deviate too much from what we believe is reasonable.