Probabilistic LatentVariable Models

The two general forms of probabilistic models are:

p(x): a typicalprobabilistic distribution. In this model, we call <spanclass="math inline">x the query.
<spanclass="math inline">p(y ∣ x): aconditional probabilistic distribution. In this model, we cal <spanclass="math inline">x the evidence and <spanclass="math inline">y the query.

Latent variable models are models that have variables other than thequery and the evidence.

<spanclass="math inline">p(x) = ∑_zp(x ∣ z) p(z)

A classic latent variable model of <spanclass="math inline">p(x) is the mixture model,where p(x) isactually a mixture of several different probabilistic model. Forexample, in the following graph, <spanclass="math inline">z is a discrete variablerepresenting which class a datapoint belongs to and is represented bydifferent colors here. <spanclass="math inline">p(x ∣ z) is eachclass’s probability distribution, where in this case can each be modeledby a Gaussian. And <spanclass="math inline">p(x) when we observe it, isjust a bunch of uncolored datapoints and is hard to fit a distributionon it. However, we can see it’s roughly spread in 3 clusters so weintroduce the latent variable representing class and we can now well fita Gaussian mixture model on it (a mixture of 3 Gaussians)

<img src="/images/Mixture-Gaussian-Distribution.png"alt="Gaussian Mixture Model" />
<spanclass="math inline">p(y ∣ x) = ∑_zp(y ∣ x, z) p(z)or <spanclass="math inline">p(y ∣ x) = ∑_zp(y ∣ z) p(z ∣ x):the conditional probability is a bit more free. You can decompose andmodel it using z as youlike.

An example of latent conditional model is the mixture densitynetwork, which we use in RL’s imitation learning to deal withmulti-modal situations each requiring a different distribution.

Latent Variable Models inGeneral

When we use latent variable models, it means we want todecompose a complicated distribution into several simple / easydistributions. By complicated, we mean it’snot possible to write it in a well-defined distribution. Bysimple / easy, we mean we can write it as awell-defined parametrized distribution, where the parameters can becomplex, but the distribution itself is easy to write (a Gaussian ofjust mean and sigma, or as a Bernoulli with just one variable, etc.)<spanclass="math display">p(x) = ∫p(x ∣ z)p(z)dz

p(z) is an “easy”prior we choose. For example a Gaussian, a categorical distribution,etc.
p(x ∣ z)should also be an easy distribution, like a Gaussian: $ p(x z) =({nn}(z), {nn}(z))$ even though the mapping from <spanclass="math inline">z to the actual parameters ofGaussian can be complex, where in this case we have to model the mappingthrough a neural network and this mapping is the learnable part.
p(x) iscomplicated, not possible to write out as any well-defined distribution.Therefore, we decompose it into the two parts above that areeasy to parametrize as a probability distribution and learn theparameters inside the distribution.

Generative models are not equal to latent variable models. We usuallymodel generative models as latent variable ones because generativemodels are usually complex probability distributions and we can make iteasier by introducing one or more latent variable.

How to Train a LatentVariable Model

Given dataset <spanclass="math inline">𝒟 = {x₁, x₂, …, x_N},to fit a typical probabilistic model <spanclass="math inline">p_θ(x),we use the maximum likelihood estimation: $$\theta \leftarrow \underset{\theta}{arg\!\max} \frac 1 N \sum_i \logp_\theta(x_i)$$ In the latent variable model set up, we can substitute thedefinition in and an MLE would look like $$\theta \leftarrow \underset{\theta}{arg\!\max} \frac 1 N\sum_i \log \left( \int p_\theta(x_i \mid z) p(z) dz \right)$$ <spanclass="math inline">p_θ(x ∣ z)and p(z) aredistributions of our choices, but this integral is still intractablewhen z is continuous. So nowit’s time to do some math tricks.

Variational Inference

Variational Approximation

First, we construct an easy / simple probability distribution <spanclass="math inline">q_i(z)to approximate <spanclass="math inline">p(z|x_i)- the posterior distribution specific to datapoint <spanclass="math inline">x_i. By easy weagain mean it is easy to parametrize (a Gaussian, a Bernoulli, etc.)

We will show that by introducing this <spanclass="math inline">q_i(z),we can actually construct a lower bound of <spanclass="math inline">log p(x_i).What’s good with this lower bound? Later on, we will also prove thisbound is sufficiently tight, so as we push up the value of this lowerbound, we push up the value of <spanclass="math inline">p(x_i)which is exactly what we want.

$$\begin{align}\log p(x_{i})&= \log\int_{z}p(x_{i}|z)p(z)\\&= \log\int_{z}p(x_{i}|z)p(z) \frac{q_i(z)}{q_i(z)} \\&= \log \mathbb E_{z\sim q_{i}(z)} \left[\frac{p(x_{i}|z)p(z)}{q_{i}(z)}\right] \\&\geq \mathbb E_{z\sim q_{i}(z)}\left[\log\frac{p(x_{i}|z)p(z)}{q_{i}(z)}\right] &\text{\# Jensen'sInequality} \\&= \mathbb E_{z\sim q_{i}(z)} \left[\log p(x_{i}|z)+\log p(z)\right] - \mathbb E_{z\sim q_{i}(z)} \left[ \log {q_{i}(z)}\right]\\&= \mathbb E_{z\sim q_{i}(z)} \left[\log p(x_{i}|z)+\log p(z)\right] + \mathcal H_{z\sim q_{i}(z)} (q_i)= \mathcal L_i(p, q_i)\end{align}$$ Recall p(x)is a difficult probability distribution, so we decompose it into twoeasy distributions <spanclass="math inline">p(x|z) and <spanclass="math inline">p(z), and use an easydistribution <spanclass="math inline">q_i(z)to approximate the posterior <spanclass="math inline">p(z|x_i).Now the good thing is: everything here is tractable: for the first term,we can fix a <spanclass="math inline">q_i(z)of our choice (recall <spanclass="math inline">q_i is adistribution we constructed), sample some <spanclass="math inline">z, and evaluate the expression. Forthe second term, we notice it is just the entropy of a distribution andfor simple distributions (we constructed <spanclass="math inline">q_i to besimple), it has a closed form (even if it doesn’t, you can simply sampleand evaluate)

We call the final lower bound we derived here the variancelower bound or evidence lower bound (ELBO).$$\begin{align}\log p(x_{i})&\geq \mathcal L_i(p, q_i) \\&= \mathbb E_{z\sim q_{i}(z)} \left[\log p(x_{i}|z)+\log p(z)\right] + \mathcal H_{z\sim q_{i}(z)} (q_i)\end{align}$$ ### Effect of Pushing Up ELBO (Intuitively)

Assume our p(⋅) is a fixedvalue, what does pushing up ELBO mean? Here, we give out an intuitiveexplanation. First, we look at the first term with thetwo log combined. $$\begin{align} &\mathbb E_{z\sim q_{i}(z)} \left[\log p(x_{i}|z)+\log p(z)\right] \\= &\mathbb E_{z\sim q_{i}(z)} \left[\log p(x_{i},z) \right]\end{align}$$ To maximize this value, we just have to find a distribution ofz, inside which we have thelargest value of <spanclass="math inline">p(x_i, z).Therefore, we want z todistribute mostly under the peak of <spanclass="math inline">p(x_i, z),Since <spanclass="math inline">q_i(z)is the distribution we currently have for z, we want <spanclass="math inline">q_i(z)to sit mostly under the peak of <spanclass="math inline">p(x_i, z).In the following graph, the y-axis is <spanclass="math inline">p(x_i, z),the distribution we try to maximize, and the x-axis is our latentvariable z. There is also a hidden axis - the probability mass(distribution) of z. Weproject this hidden axis to the y-axis in this graph. To maximize thisfirst term, we spread z’s massas much under the peak of <spanclass="math inline">p(x_i, z)as possible, which makes the green part of this graph.

Now we take the second term entropy intoconsideration. <spanclass="math display">ℒ_i(p, q_i) = 𝔼_{z ∼ q_i(z)}[log p(x_i, z)] + ℋ_{z ∼ q_i(z)}(q_i)From our entropy post, weknow entropy measures the expected code length of communicating theevent described by a random variable. So the more random this variableis, the more code words it’s required to communicate it. Therefore, themore spread out / uniform the distribution is, the higher the entropy.If we’re maxing the entropy, we don’t want the distribution to beskinny. See the following graph.

When we consider both entropy and the first term, we should achievethis probability distribution depicted in brown. If we don’t have theentropy, z will want to situnder the most likely point, but since we added entropy, <spanclass="math inline">z now tries to cover it. Inconclusion, (equal sign “=” reads “in effect”) maximize evidence lowerbound = cover most of the <spanclass="math inline">p(x_i|z)distribution = maximize approximation between <spanclass="math inline">q_i and <spanclass="math inline">p(x_i|z).

Effect of Pushing UpELBO (Analytically)

Can we measure how good our approximation is? That is, can we measurethe distance between <spanclass="math inline">p(x_i|z)and q_i? Infact, we have a nice, analytical way to look at it using KLdivergence. For two arbitrary distribution <spanclass="math inline">p, q of <spanclass="math inline">x, the KL divergence of <spanclass="math inline">q from <spanclass="math inline">p (the distance from <spanclass="math inline">q to <spanclass="math inline">p, note KL divergence is notsymmetric) is

$$\begin{align}D_{\mathrm{KL}}(q|p)&=E_{x\sim q(x)}\left[\log{\frac{q(x)}{p(x)}}\right]\\&=E_{x \sim q(x)}[\log q(x)]-E_{x \sim q(x)}[\log p(x)]\\&=-E_{x \sim q(x)}[\log p(x)]-H(q)\end{align}$$ Doesn’t this look similar to our evidence lower bound?Borrowing that explanation, minimizing KL divergence = cover most of thep(z) distribution =maximize approximation between <spanclass="math inline">q and <spanclass="math inline">p.

Having understood the definition of KL divergence, let’s use it tomeasure the distance between <spanclass="math inline">q_i(z)and <spanclass="math inline">p(z|x_i)- the distribution we want <spanclass="math inline">q_i toapproximate: $$\begin{align}D_{KL}(q_{i}(z)\|p(z \mid x_{i}))&= E_{z\simq_{i}(z)}\left[\log{\frac{q_{i}(z)}{p(z|x_{i})}}\right]\\&= E_{z\simq_{i}(z)}\left[\log{\frac{q_{i}(z)p(x_{i})}{p(x_{i},z)}}\right]\\&= -E_{z\sim q_{i}(z)}\left[\log p(x_{i}|z)+\log p(z)\right] +E_{z\sim q_{i}(z)}\log q_i(z) + E_{z\sim q_{i}(z)}\log p(x_{i})\\&= -E_{z\sim q_{i}(z)}\left[\log p(x_{i}|z)+\log p(z)\right] +\mathcal H(q_i) + E_{z\sim q_{i}(z)}\log p(x_{i})\\&= -\mathcal L(p, q_i) + \log p(x_i)\\\log p(x_i) &= \mathcal L(p, q_i) + D_{KL}(q_{i}(x_{i})\|p(z \midx_{i}))\end{align}$$ Therefore, having a good approximation of <spanclass="math inline">q_i to <spanclass="math inline">p(x_i|z)= driving KL divergence, which is always a non-negative number, to 0 =the evidence lower bound is a tight bound or even equal to <spanclass="math inline">log p(x_i)- the ultimate thing we want to optimize.

Looking at our optimization objective <spanclass="math inline">ℒ here: <spanclass="math display">ℒ(p, q_i) = log p(x_i) − D_KL(q_i(x_i)∥p(z ∣ x_i))

When we optimize w.r.t. <spanclass="math inline">q: note the first term <spanclass="math inline">log p(x_i)is independent of q, so itsvalue stays the same. We are in effect optimizing against the KLdivergence only, making the distance between our approximation <spanclass="math inline">q_i and <spanclass="math inline">p(z|x_i)smaller. The best / extreme case is we have <spanclass="math inline">D_KL = 0,so <spanclass="math inline">ℒ = log p(x_i).
When we optimize w.r.t. <spanclass="math inline">p: Recall our ultimate goal is tomake <spanclass="math inline">log p(x_i)bigger, so we make a better model in theory. Only in theory because wedon’t know whether the bound is tight or not.

The Learning Algorithm?

Therefore, when we optimize <spanclass="math inline">ℒ(p, q_i)w.r.t. q, we make the boundtighter (make ℒ a better approximationof <spanclass="math inline">log p(x_i)); when we optimize <spanclass="math inline">ℒ(p, q_i)w.r.t. p, we make a bettermodel in general.

By alternating these two steps, we have the actual learningalgorithm. Let’s review: which parts are learnable in these twodistributions?

In our latentvariable model setup, we decompose the complicated distributionp(x) into two easydistributions <spanclass="math inline">p(x|z) and <spanclass="math inline">p(z), where the mappingfrom z to actual parameters ofthis p(x|z)distribution needs to be modeled by a complex network. Therefore, theonly distribution in the ppart with learnable parameters is <spanclass="math inline">p(x|z). We denoteit with <spanclass="math inline">p_θ(x|z).
In our ELBO setup, wealso introduced a simple <spanclass="math inline">q_i(z)for each datapoint <spanclass="math inline">x_i toapproximate the posterior <spanclass="math inline">p(z|x_i).To optimize w.r.t. q, weoptimize the parameters of each distribution. If <spanclass="math inline">q_i(z) = 𝒩(μ_i, σ_i),we optimize each <spanclass="math inline">μ_i, σ_i.(we can optimize the entropy value for sure, but I’m not entirelysure how you would take gradient of the expectation term <spanclass="math inline">E_{z ∼ q_i(z)}[log p(z)])

Therefore, we have our learning algorithm: <spanclass="math display">$$\begin{align}&\text{for each $x_i$ in $\{x_1, \dots, x_N\}$: }\\&\hspace{4mm} \text{sample $z \sim q_i(z)$}\\&\hspace{4mm} \text{optimize against $p$:}\\&\hspace{4mm} \hspace{4mm} \nabla_\theta \mathcal L (p_\theta, q_i)= \nabla_\theta \log p_\theta(x_i|z) \\&\hspace{4mm} \hspace{4mm} \theta \leftarrow \theta + \alpha\nabla_\theta \mathcal L (p, q_i) \\&\hspace{4mm} \text{optimize against $q$:}\\&\hspace{4mm} \hspace{4mm} \nabla_{\mu_i, \sigma_i} \mathcal L(p_\theta, q_i) = \nabla_{\mu_i, \sigma_i} \left[\mathbb E_{z\simq_{i}(z)} \left[\log p(x_{i}|z)+\log p(z) \right] + \mathcal H_{z\simq_{i}(z)} (q_i) \right] \\&\hspace{4mm} \hspace{4mm} (\mu_i, \sigma_i) \leftarrow (\mu_i,\sigma_i) + \alpha \nabla_{\mu_i, \sigma_i} \mathcal L (p, q_i) \\\end{align}$$

There’s a problem with optimizing <spanclass="math inline">q_i though. Notewe have a separate q for eachdata point i, which means ifwe have N data points, we willhave to store <spanclass="math inline">N × (|μ_i| + |σ_i|)parameters assuming we chose <spanclass="math inline">q_i to beGaussian. In machine learning, the number of data points <spanclass="math inline">N is usually in millions, makingthis model unwieldily big. It’s true in inference time we do not useq at all (we’ll see why thisis true in the last chapter about VAE), but in training time, we stillneed them so it’s necessary to keep all these parameters.

Therefore, instead of having a separate <spanclass="math inline">q_i(⋅) toapproximate each data point’s <spanclass="math inline">P(⋅|x_i)specifically, we use a learnable model <spanclass="math inline">q_ϕ(⋅|x_i)to approximate <spanclass="math inline">p(⋅|x_i)This learnable network will take in a datapoint <spanclass="math inline">x_i, predicts thecorresponding <spanclass="math inline">μ_i, σ_i.We can then sample z from thispredicted network.

Amortized

By adapting q to be alearnable network <spanclass="math inline">q_ϕ instead,model size does not depend on the number of datapoints anymore.Therefore, it is amortized.

The variational lower bound becomes: <spanclass="math display">ℒ(p_θ(x_i|z), q_ϕ(z|x_i)) = 𝔼_{z ∼ q_ϕ(z|x_i)}[log p_θ(x_i|z) + log p(z)] + ℋ(q_ϕ(z|x_i))The learning algorithm naturally becomes: $$ $$

Gradient OverExpectation (Policy Gradient)

The question now boils down to how do we calculate this gradient<spanclass="math inline">∇_ϕℒ(p_θ, q_ϕ).

The second term entropy is easy. We purposefully chose <spanclass="math inline">q to be a simple distribution, sothere is usually a close form of its entropy and we just have to look itup.

The meat is in the first part. How do we take gradient w.r.t.parameter ϕ in the expectationterm’s distribution <spanclass="math inline">q_ϕ ? Note theterm inside expectation is independent of <spanclass="math inline">ϕ, so we can rewrite it as <spanclass="math inline">R(x_i, z) = log p_θ(x_i|z) + log p(z)and call the whole thing <spanclass="math inline">J.
<spanclass="math display">J(ϕ) = ∇_ϕ𝔼_{z ∼ q_ϕ(z|x_i)}[R(x_i, z)]We chose these namings purposefully because we encountered somethingsimilar back in the <ahref="https://slides.com/sarahdean-2/sp24-cs-4789-lec-16?token=KNeurk-c#/11/0/4">policygradient part of reinforcement learning LINK???</a>.Say we have a trajectory τ,sampled from the state transition function with learnable policy <spanclass="math inline">π_θ, the finalexpected value we can get from starting state <spanclass="math inline">s₀ can be written as thefollowing, where R(τ)is a reward function returning the reward of this trajectory. <spanclass="math display">J(θ) = V^π_θ(s₀) = 𝔼_{τ ∼ P_s₀^π_θ}[R(τ)]We can take the gradient of this value function <spanclass="math inline">V w.r.t our policy <spanclass="math inline">π_θ, so this iscalled policy gradient. If you’re unfamiliar with RL setup, you justhave to know we can derive the following gradient and we can approximateit by sampling M trajectories.$$ $$Pugging in our $q$ and $\phi$,$$ $$

Reparametrization Trick

We have our full learning algorithm and it’s ready to go now.However, there is a tiny improvement we can do.

We defined our <spanclass="math inline">q_ϕ to be anormal distribution <spanclass="math inline">𝒩(μ_ϕ, σ_ϕ)Observe that all normal distributions can be written as a function ofthe unit normal distribution. Therefore, a sample <spanclass="math inline">z is in effect: <spanclass="math display">z ∼ 𝒩(μ_ϕ, σ_ϕ) ⇔ z = μ_ϕ + ϵσ_ϕ, ϵ ∼ 𝒩(0, 1)Let’s rewrite our expectation term to now sample an <spanclass="math inline">ϵ from the unit normal distributioninstead. By decomposing z intothese two parts, we separate the stochastic part and changed <spanclass="math inline">z from a sample of some stochasticdistribution into a deterministic function <spanclass="math inline">z(ϕ, ϵ)parametrized by ϕ and randomvariable ϵ that is independentof ϕ. <spanclass="math inline">ϵ takes the stochastic part alone.Our learnable parameter ϕ nowonly parametrizes deterministic quantity. <spanclass="math display">∇_ϕJ(ϕ) = ∇_ϕ𝔼_{ϵ ∼ 𝒩(0, 1)}[R(x_i, μ_ϕ + ϵσ_ϕ)]Aside from these theoretical benefits, mathematically, we do not have totake gradient w.r.t an expectation of parametrized distribution anymore.Instead, the gradient can go straight into the expectation term now likehow we usually interchange gradient and expectation (think aboutdiscrete case, expectation is just a big sum so we can do it). <spanclass="math display">∇_ϕJ(ϕ) = 𝔼_{ϵ ∼ 𝒩(0, 1)}[∇_ϕR(x_i, μ_ϕ + ϵσ_ϕ)]Further, to approximate this expectation, we just sample some <spanclass="math inline">ϵ from this normal distribution.$$\nabla_\phi J(\phi)\approx \frac 1 M \sum_j^M \nabla_\phi R(x_i, \mu_\phi + \epsilon_j\sigma_\phi)$$

With reparametrization, we achieve a lower variance than policygradient because we are using the derivative of R. (Unfortunatelythe lecturer didn’t provide a quantitative analysis on this and I don’tknow how to prove it) On the other hand, previously, we only tookderivative w.r.t. the probability distribution. Why didn’t we usederivative of R back in RL with policy gradient? It’s not we don’t wantto but we can’t: we can’t use reparametrization in RL because in RL weusually cannot take derivative w.r.t. reward R.

Method	Formula	Approximation	Benefit	Deficit
Policy Gradient	<spanclass="math inline">∇_ϕ𝔼_{z ∼ q_ϕ(z ∣ x_i)}[R(x_i, z)]</span>	$\frac 1 M \sum_j^M \nabla_\phi[\logq_\phi(z_j \mid x_i)] R(x_i,z_j)$	works with both discrete and continuous latent variable <spanclass="math inline">z</span>	High variance, requires multiple samples & small learningrates
Reparametrization	<spanclass="math inline">𝔼_{ϵ ∼ 𝒩(0, 1)}[∇_ϕR(x_i, μ_ϕ + ϵσ_ϕ)]</span>	$\frac 1 M \sum_j^M \nabla_\phi R(x_i,\mu_\phi + \epsilon_j \sigma_\phi)$	low variance, simple to implement (we’ll see soon)	only works with continuous variable <spanclass="math inline">z</span> and have to model it with aGaussian

In fact, you can forget about the policy gradient method and simplytake it for granted that you cannot back propagate a sampled value <spanclass="math inline">∇_ϕ𝔼_{z ∼ q_ϕ(z|x_i)},so you have to find some way to make our <spanclass="math inline">z deterministic, which is what we’redoing here with our reparametrization trick.

Left is without the “reparameterization trick”, and right is with it.Red shows sampling operations that are non-differentiable. Blue showsloss layers. We forward the network by going up and back propagate it bygoing down. The forward behavior of these networks is identical, butback propagation can be applied only to the right network. Figure copiedfrom Carl Doersch: Tutorialon Variational Autoencoders

Looking at <spanclass="math inline">ℒ Directly

$$\begin{align}\mathcal L_i = \mathcal L \left( p_\theta(x_i | z), q_\phi(z | x_i)\right)&= \mathbb E_{z\sim q_\phi(z | x_i)} \left[\logp_\theta(x_{i}|z)+\log p(z) \right] + \mathcal H (q_\phi(z|x_i))\\&= \mathbb E_{z\sim q_\phi(z | x_i)} \left[\log p_\theta(x_{i}|z)\right] + \mathbb E_{z\sim q_\phi(z | x_i)} \left[\log p(z) \right] + \mathcalH (q_\phi(z|x_i))\\&= \mathbb E_{z\sim q_\phi(z | x_i)} \left[\logp_\theta(x_{i}|z)\right] - D_{KL}(q_\phi(z | x_i)\|p(z)) \\&= \mathbb E_{\epsilon \sim \mathcal N(0,1)} \left[\logp_\theta(x_{i}| \mu_\phi + \epsilon \sigma_\phi)\right] -D_{KL}(q_\phi(z | x_i)\|p(z)) \\&\approx \frac 1 M \sum_j^M \log p_\theta(x_{i}| \mu_\phi +\epsilon_j \sigma_\phi) - D_{KL}(q_\phi(z | x_i)\|p(z)) \\\end{align}$$

For the first term, we can just evaluate it. For the second KL term,since we chose both distributions to be easy (in this case Gaussian),there often is a nice analytical form for it.

Therefore, we can go ahead to maximize the variational lower boundℒ. We can also draw out the followingcomputational graph for the log term and conclude we can back propagatethis graph without any problem. On the other hand, if we didn’t do thereparametrization trick, we will get stuck at <spanclass="math inline">z: you cannot back propagate <spanclass="math inline">z - a sampled value instead of avariable. And we will have to seek help from policy gradient. Withreparametrization, we decompose <spanclass="math inline">z into two variables <spanclass="math inline">μ_ϕ, σ_ϕwe can back propagate through and one stochastic value <spanclass="math inline">ϵ we do not care about.

Variational Autoencoder

Setup and Interpretation

What we have gone though constitutes the full pipeline of avariational autoencoder.

In a variation autoencoder, we have observed variable <spanclass="math inline">x and latent variable <spanclass="math inline">z

encoder <spanclass="math inline">q_ϕ(z|x) = 𝒩(μ_ϕ(x), σ_ϕ(x))
decoder <spanclass="math inline">p_θ(x|z) = 𝒩(μ_θ(z), σ_θ(z))

In training, given an observed sample <spanclass="math inline">x_i, we encode itto latent variable <spanclass="math inline">z_i using <spanclass="math inline">q_ϕ, then triesto decode it back with decoder <spanclass="math inline">p_θ. We maximizethe variational lower bound during the process. For all <spanclass="math inline">N samples, the training objectivelooks like: (where the ϵ is asampled value) $$\max_{\phi,\theta} \frac 1 N \sum_i^N \log p_\theta\left(x_{i}|\mu_\phi(x_i) + \epsilon \sigma_\phi(x_i)\right) - D_{KL}(q_\phi(z |x_i)\|p(z)) \\$$ In inference (generation), we sample a <spanclass="math inline">z from our prior <spanclass="math inline">p(z), then decode it usingp_θ: <spanclass="math inline">z ∼ p(z), x ∼ p_θ(x|z)

Why does the variational autoencoder work? We talked about manybenefits of maximizing this variational lower bound in <ahref="#Effect-of-Pushing-Up-ELBO-(Analytically)">previous chapter</a>.Let’s look at it again in this decoder-encoder setup,. <spanclass="math display">ℒ_i = 𝔼_{z ∼ q_ϕ(z|x_i)}[log p_θ(x_i|z)] − D_KL(q_ϕ(z|x_i)∥p(z))

The first <spanclass="math inline">log p_θ termmaximizes the probability of our observed image <spanclass="math inline">x given a sample <spanclass="math inline">z, so the model makes decoder <spanclass="math inline">p_θ toreconstruct image x asaccurate as possible.
The second KL term restricts the encoding of an image to be close tothe actual prior, which makes sure at inference / generate time, we candirectly sample from the prior.

Comparison with Auto-Encoder

The VAE’s decoder is trained to convert random points in theembedding space (generated by perturbing the input encodings) tosensible outputs. By contrast, the decoder for the deterministicautoencoder only ever gets as inputs the exact encodings of the trainingset, so it does not know what to do with random inputs that are outsidewhat it was trained on. So a standard autoencoder cannot create newsamples.

The reason the VAE is better at sample is that it embeds images intoGaussians in latent space, whereas the AE embeds images into points,which are like delta functions. The advantage of using a latentdistribution is that it encourages local smoothness, since a given imagemay map to multiple nearby places, depending on the stochastic sampling.By contrast, in an AE, the latent space is typically not smooth, soimages from different classes often end up next to each other. Figurecopied from <ahref="https://probml.github.io/pml-book/book1.html">ProbabilisticMachine Learning: An Introduction - Figure 20.26</a>

We can leverage the smoothness of the latent space to perform imageinterpolation in latent space.

Reference

Most content of this blog post comes from <ahref="https://www.youtube.com/watch?v=UTMpM4orS30">Berkeley CS 285(Sergey Levine): Lecture 18, Variational Inference</a>, which I thinkorganized his lecture based on <ahref="https://arxiv.org/abs/1906.02691">An Introduction to VariationalAutoencoders</a> (2.1-2.7, and 2.9.1), or more in-depth on the author’sPhD thesis VariationalInference and Deep Learning: A New Synthesis I found this wonderfultutorial in <ahref="https://probml.github.io/pml-book/book2.html">ProbabilisticMachine Learning: Advanced Topics</a>

Some graph come from <ahref="https://probml.github.io/pml-book/book1.html">ProbabilisticMachine Learning: An Introduction</a> itself and <ahref="https://arxiv.org/abs/1606.05908">Carl Doersch: Tutorial onVariational Autoencoders</a>, which is referenced in the previousbook.

Note though the Probabilistic Machine Learning book itselfis a horrible book with extremely confusing explanations.