Expectation Maximization (EM) Algorithm

Reference

CS 229 note
Coursera course

Objective of EM Algorithm

From wikipedia,

In statistics, an expectation–maximization (EM) algorithm is an iterative method to find ML or MAP estimates of parameters in statistical models, where the model depends on unobserved latent variables.

In the statistical model, we have:

$\mathbf {X}$ - observed data
$\mathbf {Z}$ - unobserved latent data or missing value
$\boldsymbol {\theta}$ - unknown parameters

Then:
$\begin{equation} \text{likelihood} = L(\mathbf {X}, \mathbf {Z}, \boldsymbol {\theta}) = p(\mathbf {X}, \mathbf {Z}| \boldsymbol {\theta}) \end{equation}$

Since, we do not observe $\mathbf {Z}$ , we need to marginalize them out.

$\begin{equation} \text{marginal likelihood} = L(\mathbf {X}, \boldsymbol {\theta}) = \int p(\mathbf {X}, \mathbf {Z}| \boldsymbol {\theta}) dZ = p(\mathbf {X}| \boldsymbol {\theta}) \end{equation}$

In its general form, EM algorithm maximizes the marginal likelihood of the problem:

EM Objective $\max_{ \boldsymbol {\theta}}p(\mathbf {X}| \boldsymbol {\theta})$

General Idea

Suppose, we want to maximize the following marginal likelihood function:

enter image description here

Similar to variational inference, here, we approximate $p(\mathbf {X}| \boldsymbol {\theta})$ with a family of lower bounds $\mathcal{L}(\boldsymbol {\theta}, \boldsymbol {q})$ . Here, $\boldsymbol {q}$ are known as variational parameters. In other words, we find the variational lower bounds as follows:
$p(\mathbf {X}| \boldsymbol {\theta})\ge \mathcal{L}(\boldsymbol {\theta}, \boldsymbol {q}),~ \text{for any }\boldsymbol {q}$

Initialization: Suppose, we start with point $\theta^k$ in the above figure.

enter image description here

E-step: We want to find the best lower bound from the family of lower bounds that goes through the point $\theta^k$ .
$q^{k+1} = \arg \max _q \mathcal{L}(\theta^k, q)$
The red lower bound in the figure below is the best lower bound that goes through the point $\theta^k$ .

enter image description here

M-step: Once we found the best lower bound at E-step, as shown above, we want to find the point $\theta^{k+1}$ that maximizes the lower bound.
$\theta^{k+1}= \arg \max _{\theta} \mathcal{L}(\theta, q^{k+1})$
As shown in the figure below, $\theta^{k+1}$ is the new point.

enter image description here

Continue Iteration: As shown in the figure below.

enter image description here

E-step and KL divergence

$\begin{eqnarray} \text{marginal log likelihood}&=&\log p(\mathbf {X}| \boldsymbol {\theta})\\ &=&\sum_{i=1}^N\log p(x_i|\theta)\\ && [\text{assuming }N \text{ independent input points}]\\ &=&\sum_{i=1}^N\log \sum_{z_i} p(x_i,z_i|\theta)\\ &=&\sum_{i=1}^N\log \sum_{z_i} \dfrac{q_i(z_i)}{q_i(z_i)}p(x_i,z_i|\theta)\\ &=&\sum_{i=1}^N\log \sum_{z_i} q(z_j) \dfrac{p(x_i,z_j|\theta)}{q(z_j)}\\ &=&\sum_{i=1}^N\log \mathbb{E} _{q_i(z_i)}\left[\dfrac{p(x_i,z_i|\theta)}{q_i(z_i)}\right]\\ &\ge&\sum_{i=1}^N \mathbb{E} _{q_i(z_i)}\left[\log \dfrac{p(x_i,z_i|\theta)}{q(z_i)} \right]\\ &&\left[\text{Jensen's Inequality}~ f(\mathbb{E}[X])\ge \mathbb{E}[f(X)] ~\text{for concave}~ f\right]\\ \sum_{i=1}^N\log p(x_i|\theta)&\ge&\sum_{i=1}^N \sum_{z_i} {q_i(z_i)}\log \dfrac{p(x_i,z_i|\theta)}{q_i(z_i)} \end{eqnarray}$

Now, in E-step, we are trying to approximate the marginal likelihood $p(\mathbf {X}| \boldsymbol {\theta})$ with a family of distribution $\mathcal{L}(\boldsymbol {\theta}, \boldsymbol {q})$ . We want to choose a distribution, that minimizes the gap shown in the figure:

enter image description here

Now, if we set Jensen’s lower bound from the above to the approximation function, i.e. $\mathcal{L}(\boldsymbol {\theta}, \boldsymbol {q})=\sum_{i=1}^N \sum_{z_i} {q_i(z_i)}\log \dfrac{p(x_i,z_i|\theta)}{q_i(z_i)}$ , then the gap becomes:

$\begin{eqnarray} \text{gap} &=& \sum_{i=1}^N\log p(x_i|\theta)-\sum_{i=1}^N \sum_{z_i} {q_i(z_i)}\log \dfrac{p(x_i,z_i|\theta)}{q_i(z_i)} \\ &=&\sum_{i=1}^N\log p(x_i|\theta)\sum_{z_i} {q_i(z_i)}-\sum_{i=1}^N \sum_{z_i} {q_i(z_i)}\log \dfrac{p(x_i,z_i|\theta)}{q_i(z_i)} \\ &&[\text{Since} \sum_{z_i} {q_i(z_i)}=1]\\ &=&\sum_{i=1}^N \sum_{z_i} {q_i(z_i)}\left[\log p(x_i|\theta)-\log \dfrac{p(x_i,z_i|\theta)}{q_i(z_i)} \right]\\ &=&\sum_{i=1}^N \sum_{z_i} {q_i(z_i)}\log \dfrac{p(x_i|\theta)q_i(z_i)}{p(x_i,z_i|\theta)}\\ &=&\sum_{i=1}^N \sum_{z_i} {q_i(z_i)}\log \dfrac{p(x_i|\theta)q_i(z_i)}{p(x_i|\theta)p(x_i|z_i,\theta)}\\ &=&\sum_{i=1}^N \sum_{z_i} {q_i(z_i)}\log \dfrac{q_i(z_i)}{p(z_i|x_i,\theta)}\\ &=&\sum_{i=1}^N \text{KL}\left(q_i(z_i)||p(z_i|x_i,\theta)\right)\\ &\ge&0 \end{eqnarray}$

Therefore, to minimize the gap, we will need $\sum_{i=1}^N \text{KL}$ to be zero, which will only happens when both the distributions are the same. Therefore, we have the following:
$\begin{equation} \arg \max _{q_i(z_i)} \mathcal{L}(\theta, q)=p(z_i|x_i,\theta) \end{equation}$

E-step
For each $i$ ,
$\text{Set } q_i(z_i) = p(z_i|x_i,\theta)$

M Step

In M-step, we want to find $\arg \max _{\boldsymbol {\theta}} \mathcal{L}(\boldsymbol {\theta}, \boldsymbol {q})$ . Now,
$\begin{eqnarray} \max _{\boldsymbol {\theta}} \mathcal{L}(\boldsymbol {\theta}, \boldsymbol {q})&=&\max _{\boldsymbol {\theta}} \sum_{i=1}^N \sum_{z_i} {q_i(z_i)}\log \dfrac{p(x_i,z_i|\theta)}{q_i(z_i)} \\ &=&\max _{\boldsymbol {\theta}} \left[\sum_{i=1}^N \sum_{z_i} {q_i(z_i)}\log p(x_i,z_i|\theta) -\sum_{i=1}^N \sum_{z_i} {q_i(z_i)}\log {q_i(z_i)}\right ]\\ &=&\max _{\boldsymbol {\theta}} \sum_{i=1}^N \sum_{z_i} {q_i(z_i)}\log p(x_i,z_i|\theta)\\ &=&\max _{\boldsymbol {\theta}} \sum_{i=1}^N \mathbb{E}_{q_i(\mathbf {z_i})}\log p(x_i, z_i|{\theta}) \end{eqnarray}$

Summary

In summary:
We approximate the marginal log-likelihood using a family of lower bound:

$\text{marginal log-likelihood}=\log p(\mathbf {X}| \boldsymbol {\theta})\ge \mathcal{L}(\theta, q)=\mathbb{E}_{q(\mathbf {Z})}\log\dfrac{P(\mathbf {X},\mathbf {Z}| \boldsymbol {\theta})}{q(\mathbf {Z})}$

In E-step, we find best approximation of the lower bound, by maximizing over this family of lower bound, so that the gap between the marginal log-likelihood and the family of lower bound is minimized:

$q^*(\mathbf {Z}) = \arg\max_{q(\mathbf {Z})}\mathcal{L}(\theta, q)=\arg\max_{q(\mathbf {Z})}\mathbb{E}_{q(\mathbf {Z})}\log\dfrac{P(\mathbf {X},\mathbf {Z}| \boldsymbol {\theta})}{q(\mathbf {Z})}=P(\mathbf {Z}|\mathbf {X}, \boldsymbol {\theta})$

It turns out, the gap between marginal log-likelihood and the lower bound is equal to the KL divergence between the lower bound and $p(\mathbf {Z}|\mathbf {X}, \boldsymbol {\theta})$
.

$\text{Gap} = \log p(\mathbf {X}| \boldsymbol {\theta})- \mathcal{L}(\theta, q)=\text{KL}(q(\mathbf {Z})||p(\mathbf {Z}|\mathbf {X}, \boldsymbol {\theta}))$

In M-step, we maximize the best approximation of the lower bound to find the best $\boldsymbol {\theta}$ .

$\boldsymbol {\theta}^* = \arg\max_{\boldsymbol {\theta}}\mathcal{L}(\boldsymbol {\theta}, q^*)=\arg\max_{\boldsymbol {\theta}}\mathbb{E}_{q^*(\mathbf {Z})}\log\dfrac{P(\mathbf {X},\mathbf {Z}| \boldsymbol {\theta})}{q^*(\mathbf {Z})}=\arg\max_{\boldsymbol {\theta}}\mathbb{E}_{q^*(\mathbf {Z})}\log{P(\mathbf {X},\mathbf {Z}| \boldsymbol {\theta})}$

Convergence

EM algortihm converges to a local maxima.

enter image description here

In the figure above, at point $\theta^k$ , we chose the best lower bound curve $\mathcal{L}(\theta, q^{k+1})$ during E-step. At M-step, we found $\theta^{k+1}$ where the lower bound $\mathcal{L}(\theta, q^{k+1})$ is maximum. The likelihood at point $\theta^{k+1}$ is $\log p(X|\theta^{k+1})$ which is definitely higher than or equal to the lower bound at that point, i.e. $\mathcal{L}(\theta^{k+1}, q^{k+1})$ .
$\begin{eqnarray} \log p(X|\theta^{k+1})&\ge&\mathcal{L}(\theta^{k+1}, q^{k+1})\ge\mathcal{L}(\theta^{k}, q^{k+1})=\log p(X|\theta^{k})\\ &&\log p(X|\theta^{k+1})\ge\log p(X|\theta^{k})\ \end{eqnarray}$

That is the likelihood value at step k+1 is higher than the value at step k. That is, at each step, EM algorithm doesn’t lower the likelihood function. This can also be used as debugging tool.

Shafi's ML Blog

Tuesday, December 5, 2017

Expectation Maximization (EM) Algorithm

Reference

Objective of EM Algorithm

General Idea

E-step and KL divergence

M Step

Summary

Convergence

No comments:

Post a Comment