MLE and MAP
We will talk about two ways of estimating a parameter: MLE and MAP. Their formulas look really similar to generative learning and discriminative as if one corresponded to generative and the other corresponded to discriminative. However, MLE/MAP has nothing to do with generative/discriminative learning. They just look similar. That’s all.
MLE and Coin Toss¶
Say we have a coin, and we want to find : the probability that this coin comes up heads when we toss it.
We toss it times and obtain the following sequence of outcomes: . Therefore, we observed heads and tails. So, intuitively,
. We will derive this more formally with Maximum Likelihood Estimation (MLE)
Maximum Likelihood Estimation (MLE)¶
The estimator we just mentioned is the Maximum Likelihood Estimate. In particular, we want to find a distribution that makes the data we observed as likely as possible. MLE is done in two steps:
- Make an explicit modeling assumption about what type of distribution your data was sampled from.
- Set the parameters of this distribution so that the data you observed is as likely as possible.
Before proceeding to MLE, we already observed our data. Namely, in our coin example, are are set values.
We will assume that coin toss is of binomial distribution , where is number of tosses we made and is the probability coming up head we are trying to estimate. Formally, given this is a binomial distribution with probability , we have
MLE Principle: Find to maximize , the likelihood of the data, :
Note we have here. ;
means is a parameter of this distribution, just like are parameters in . This will be different from how we view in MAP we will talk later. You can still write this as , but just remember when we say “given ” in a MLE context, we don’t mean is a Random Variable. It should be treated as a parameter.
A common procedure we take to solve a maximum problem:
- We don’t want to see all these products, so take the of the function to convert them into sums.
- Compute its derivative, and equate it with zero to find the extreme point.
Applying this procedure:
We can then solve for by taking the derivative and equating it with zero. This results in
Smoothing¶
If is large and your model/distribution is correct, then MLE finds the true parameters, but the MLE can overfit the data if is small. It works well when is large. For example, suppose you observe H,H,H,H,H. .
Simple fix: We can add imaginary throws and get some imaginary results based on our intuition about the distribution. For example, if we have a hunch that that is close to 0.5. Call our intuition probability . The imaginary throws will result in heads and tails. Add these imaginary results to our data, so
For large , this is an insignificant change. For small , it incorporates your “prior belief” about what should be.
This idea of “prior belief” actually gives rise to another class of thinking about distribution.
MAP and Coin Toss with Prior Knowledge¶
The Bayesian Way¶
- Frequentists think is the probability distribution from where our data is drawn. Therefore, it should be treated as a constant, something we cannot control.
- Bayesians consider it just another random variable, whose distribution can reflect our prior knowledge / assumption about the data distribution.
Formally, they model as a random variable, drawn from a distribution . Note that is not a random variable associated with an event in a sample space. You can specify a prior belief defining what values you believe is likely to take on.
Now, we can look at , where
- is the prior distribution over the parameter(s) - what I believe before seeing any data.
- is the likelihood of the data given the parameter(s) - same as in MLE
- is the posterior distribution over the parameter(s) - what I believe after observing the data.
- is the total probability of drawing such a data considering all possible distributions. , but this value is just a constant in our maximization problem and we can ignore it.
A prior distribution reflects our uncertainty about the true value of p before observing the coin tosses. After the experiment is performed and the data are gathered, the prior distribution is updated using Bayes’ rule; this yields the posterior distribution, which reflects our new beliefs about p.
Story 8.3.3 (Beta-Binomial conjugacy) from Introduction to Probability - Joseph K. Blitzstein, Jessica Hwang
A natural choice for the prior ) is the Beta distribution
where is the normalization constant. Why do we choose this complicated thing as our prior belief?
If we choose Beta Distribution as our prior distribution, we will see later that after performing an experiment of binomial distribution and when we want to update to the posterior distribution, the posterior distribution of after observing is still a Beta distribution! This is a special relationship between the Beta and Binomial distributions called conjugacy: if we have a Beta prior distribution on p and data that are conditionally Binomial given p, then when going from prior to posterior, we don’t leave the family of Beta distributions. We say that the Beta is the conjugate prior of the Binomial.
8.3.3 (Beta-Binomial conjugacy) from Introduction to Probability - Joseph K. Blitzstein, Jessica Hwang
For a Beta Distribution . 和 就是 和 两个方向的权重, relatively
假设 , . 观察到 越大 越大,即越大的 越容易被取到
Maximum a Posteriori Probability Estimation (MAP)¶
MAP Principle: Find that maximizes the posterior distribution :
For out coin flipping scenario, we get:
A similar calculation will give us
We notice:
- The MAP estimate is identical to MLE with hallucinated heads and hallucinated tails
- As , as our prior knowledge and become irrelevant compared to very large .
- MAP is a great estimator if an accurate prior belief is available (and mathematically tractable).
The posterior distribution can act as the prior if we subsequently observe additional data. To see this, we can imagine taking observations one at a time and after each observation updating the current posterior distribution by multiplying by the likelihood function for the new observation and then normalizing to obtain the new, revised posterior distribution. This sequential approach to learning arises naturally when we adopt a Bayesian viewpoint.
2.1.1 (The beta distribution) from Pattern Recognition and Machine Learning - Christopher Bishop
“True” Bayesian approach¶
Note that MAP is only one way to get an estimator in Bayesian way. There is much more information in and we threw away most of them by taking only the max. A true Bayesian approach is to use the posterior predictive distribution directly to make prediction about the label of a test sample based on dataset . Simply put, it integrates over all possible models to make a prediction. Mathematically, we can represent it as:
Intuition behind each step is:
Unfortunately, the above is generally intractable.
Summary¶
- MLE Prediction: Learning: . Here is purely a model parameter.
- MAP Prediction: Learning: . Here is a random variable.
- “True Bayesian” Prediction: . Here is integrated out - our prediction takes all possible models into account.
As always the differences are subtle. In MLE we maximize in MAP we maximize . So essentially in MAP we only add the term to our optimization. This term is independent of the data and penalizes if the parameters, deviate too much from what we believe is reasonable.