1. Introduction

Statistical learning typically starts with observed data and a (parametric) model.  One possible way to estimate the model parameters is maximum likelihood, that is to pick the parameters that maximize the likelihood (probability) of the evidence.

Now while maximum likelihood leads to closed form solutions for many models (in particular for those belonging to the exponential family) and has (at least asymptotically) many desireable properties such as consistency and sufficiency there are models where a direct maximization of the (log) likelihood function is infeasible.

In practice we often encounter missing data - we don't have sufficient information to specify the complete likelihood.  Another scenario in which not all variables in the model are observed are the so-called hidden or latent variable models. In this case there are variables which are essential for the model description but never observed. For example, the underlying physics of a model may contain latent processes which are essential to describe the model, but cannot be directly measured.  Both cases can be tackled by introducing latent variable, that is by splitting observational variables into visible (those for which we actually know the state) and missing ones (those whose states would or should nominally be known). [2]

Expectation maximization is then an iterative approach to maximum likelihood parameter estimation in latent variable models.

In this article we will build up an understanding of the algorithm by considering the well known Gaussian Mixture Model before giving an generalized proof of convergence and closing with a critical discussion.

Our exposition is largely based on the books "Pattern Recognition and Machine Learning" by Christopher Bishop [1] and "Bayesian reasoning and machine learning" by David Barber [2].  Other useful references are indicated in the section Further Reading.

2. GMM Training

2.1 ML Estimate

Let's first derive the maximum likelihood estimate for the parameters $\inline&space;\theta&space;=&space;\left(\mu,&space;C&space;\right&space;)$ of a multivariate Gaussian distribution.

Assuming a set of independent identically distributed observations $\mathcal{X}$ we need to maximize the likelihood function

$L(\theta&space;)\to&space;\prod&space;_{i=n}^N&space;p\left(\left.x_n\right|\theta&space;\right)$

with respect to the parameter vector $\theta$.

As the (natural) logarithm is strictly monotone function on $\mathbb{R}^+$ we can equivalently maximize the log likelihood function

$l(\theta&space;)\to&space;\sum&space;_{n=1}^N&space;\log&space;\left(p\left(\left.x_n\right|\theta&space;\right)\right)$

which will simplify subsequent calculations for models belonging to the exponential familiy.

For our running example of a multivariate Gaussian that is

$l(\mu&space;,C)\to&space;-\frac{1}{2}&space;\sum&space;_{n=1}^N&space;\left(x_n-\mu&space;\right){}^T&space;C^{-1}&space;\left(x_n-\mu&space;\right)-\frac{N}{2}&space;(\log&space;\left|&space;2&space;{\pi&space;C}\right|&space;)$

Taking the partial derivative with respect to $\mu$ and equating it to $0$ we obtain

$\mu&space;_{\text{ML}}=\frac{\sum&space;_{n=1}^N&space;x_n}{N}$

To find $C_{}{\text{ML}}$  we rewrite the log likelihood as

$l(\mu&space;,\Sigma&space;)\to&space;-\frac{1}{2}&space;\text{Trace}\left[C^{-1}&space;\sum&space;_{n=1}^N&space;\left({x_n}-\mu&space;\right)&space;\left({x_n}-\mu&space;\right)^{\mathsf{T}}\right]-\frac{D&space;N}{2}&space;\log&space;(2&space;\pi&space;)+\frac{N}{2}&space;\log&space;\left(\left\left|&space;C^{-1}\right\right|&space;\right)$

and obtain

$C_{\text{ML}}=\frac{\sum&space;_{n=1}^N&space;\left(x_n-\mu&space;_{\text{ML}}\right)&space;\left(x_n-\mu&space;_{\text{ML}}\right){}^{\mathsf{T}}}{N}$

2.2 Known Allocations

Now let's assume that we have a Gaussian Mixture Model with known allocations as illustrated below for a three components two dimensional output example:

([1], Figure 9.5a)

Introducing a $K$-dimensional binary variable $z$ that has a $1$ of $K$ representation, the complete data log likelihood function is given by

$\log&space;(p(X,Z|\mu&space;,C&space;,\pi&space;))\to&space;\sum&space;_{n=1}^N&space;\sum&space;_{k=1}^K&space;z_{n,k}&space;\left(\log&space;\left(\mathcal{N}\left(x_n|\mu&space;_k,C_k\right)\right)+\log&space;\left(\pi&space;_k\right)\right)$

We see that this is a sum of $K$ independent components, therefore the maximization with respect to the means and covariances decouples and we have

$\mu&space;_{k_{\text{ML}}}=\frac{\sum&space;_{n=1}^N&space;z_{n,k}&space;\;x_n&space;}{\sum&space;_{n=1}^N&space;z_{n,k}}$

and

$C_{k_{\text{ML}}}=\frac{\sum&space;_{n=1}^N&space;z_{n,k}&space;\left(x_n-\mu&space;_{k_{\text{ML}}}\right)&space;\left(x_n-\mu&space;_{k_{\text{ML}}}\right){}^{\mathsf{T}}}{\sum&space;_{n=1}^N&space;z_{n,k}}$

Note that both relations depend directly on the number of observations assigned to the $k$-th component so that we have a sufficient statistic given as

$N_k&space;:=&space;\sum&space;_{n=1}^N&space;z_{n,k}$

2.3 Unknown Allocations

If the allocations are unknown we might have the following observations:

([1], Figure 9.5b)

Assuming mixing probabilies

$\pi_k&space;:=&space;p(z_k&space;==&space;1)$

the likelihood of a single observation is

$p(x)\to&space;\sum&space;_{k=1}^K&space;\pi&space;_k&space;\mathcal{N}\left(x\left|\mu&space;_k\right.,\Sigma&space;_k\right)$

For $\inline&space;N$ IID observations the corresponding log likelihood function is

$\log&space;(p(X|\pi&space;,\mu&space;,\Sigma&space;))\to&space;\sum&space;_{n=1}^N&space;\log&space;\left(\sum&space;_{k=1}^K&space;\pi&space;_k&space;\mathcal{N}\left(x\left|\mu&space;_k\right.,\Sigma&space;_k\right)\right)$

Thus the parameters to be found are $\inline&space;\left(\mu_{\{k\}},&space;C_{\{k\}},&space;\pi_{\{k\}}&space;\right)$

In a maximum likelihood approach we would again have to equate the derivatives $(\partial_{\mu_{\{k\}}},&space;\partial_{C_{\{k\}}},&space;\partial_{\pi_{\{k\}}})$ to $\inline&space;0$ and solve for the respective parameters.

However it is easy to see that the corresponding equations are coupled which prevents an easy (closed form) solution.

As we don't have the values of the mixing proportions $\pi_{\{k\}}$ we use their expected value with respect to the posterior distribution that using Bayes law is

$p(Z|X,\mu&space;,\Sigma&space;,\pi&space;)\propto&space;\prod&space;_{n=1}^N&space;\prod&space;_{k=1}^K&space;\left(\pi&space;_k&space;\mathcal{N}\left(x_n|\mu&space;_k,\Sigma&space;_k\right)\right){}^{z_{n,k}}$

Now the expected value for each individual component $\inline&space;z_{n,k}$ is

$\mathbb{E}\left(z_{n,k}\right)\to&space;\frac{\pi&space;_k&space;\mathcal{N}\left(x_n|\mu&space;_k,\Sigma&space;_k\right)}{\sum&space;_{k=1}^K&space;\pi&space;_k&space;\mathcal{N}\left(x_n|\mu&space;_k,\Sigma&space;_k\right)}\to&space;\gamma&space;\left(z_{n,k}\right)$

(The expectation of a binary $\inline&space;0,&space;1$ variable is just it's probability so the exponents drop out)

Note that $\inline&space;\gamma(z_{n,k})$ can be interpreted as the responsibility of the $k$-th component for the $n$-th data point.

Accordingly the expectation of the complete data log likelihood can be written as

$\mathbb{E}_Z(\log&space;(p(X,Z|\mu&space;,\Sigma&space;,\pi&space;)))\to&space;\sum&space;_{n=1}^N&space;\sum&space;_{k=1}^K&space;\gamma&space;\left(z_{n,k}\right)&space;\left(\log&space;\left(\mathcal{N}\left(x_n|\mu&space;_k,\Sigma&space;_k\right)\right)+\log&space;\left(\pi&space;_k\right)\right)$

which is finally the quantity to be maximized by the expectation maximization algorithm.

$\mathbb{E}_Z(\log&space;(p(X,Z|\mu&space;,\Sigma&space;,\pi&space;)))\to&space;\sum&space;_{n=1}^N&space;\sum&space;_{k=1}^K&space;\gamma&space;\left(z_{n,k}\right)&space;\left(\log&space;\left(\mathcal{N}\left(x_n|\mu&space;_k,\Sigma&space;_k\right)\right)+\log&space;\left(\pi&space;_k\right)\right)$

To this end we define

$N_k&space;:=&space;\sum&space;_{n=1}^N&space;\gamma&space;\left(z_{n,k}\right)$

and rewrite the maximum likelihood estimates as

$\mu&space;_{k_{\text{ML}}}=\frac{\sum&space;_{n=1}^N&space;x_n&space;\gamma&space;\left(z_{n,k}\right)}{\sum&space;_{n=1}^N&space;z_{n,k}}$

$C_{k_{\text{ML}}}=\frac{\sum&space;_{n=1}^N&space;\gamma&space;\left(z_{n,k}\right)&space;\left(x_n-\mu&space;_{k_{\text{ML}}}\right)&space;\left(x_n-\mu&space;_{k_{\text{ML}}}\right){}^{\mathsf{T}}}{\sum&space;_{n=1}^N&space;z_{n,k}}$

For the maximization with respect to the mixing components $\pi_{\{k\}}$ we have to enforce the summation constraint

$\sum&space;_{k=1}^K&space;\pi&space;_k=1$

and introduce a corresponding Lagrange multiplier $\lambda$ to obtain

$\pi&space;_{k_{\text{ML}}}\to&space;\frac{N_k}{N}$

2.4 Algorithm

Now the full alorithm reads as follows:

1. Initialize the parameters $\left(\overset{(i)}{\mu&space;_{\{k\}}},\overset{(i)}{\Sigma&space;_{\{k\}}},\overset{(i)}{\pi&space;_{\{k\}}}\right)$ for iteration index $i&space;=&space;0$.
2. Evaluate the responsibilities $\overset{(i)}{\gamma&space;}\left(z_{n,k}\right)$
3. Compute the maximum likelihood estimates $\inline&space;\left(\overset{(i)}{\mu&space;_{\{k\}_{\text{ML}}}},\overset{(i)}{\Sigma&space;_{\{k\}_{_{\text{ML}}}}},\overset{(i)}{\pi&space;_{\{k\}_{\text{ML}}}}\right)$
4. Continue with 2. until convergence.

Upon convergence the results can be visualized by color coding the responsibilities

([1], Figure 9.5c)

3. Generalization

As this derivation may appear rather heuristic we proceed to show that the algorithm indeed maximizes the log likelihood function in general.

For any hidden variable model the probability of the evidence can be obtained as the marginal

$p(X|\theta&space;)\to&space;\sum&space;_Z&space;p(X,Z|\theta&space;)$([1], eqn. 9.69)

If we introduce an arbitrary distribution on the hidden variables as $q(z)$, the corresponding log likelihood function can be decomposed as

$\log&space;(p(X|\theta&space;))\to&space;\mathcal{L}(q,\theta&space;)+\text{KL}(q&space;||&space;p)$

with

$\mathcal{L}(q,\theta&space;)\to&space;\sum&space;_Z&space;q(z)&space;\log&space;\left(\frac{p(X,Z|\theta&space;)}{q(z)}\right)$

$\text{KL}(q&space;||&space;p)\to&space;\sum&space;_Z&space;q(z)&space;\log&space;\left(\frac{p(Z|X,\theta&space;)}{q(z)}\right)$

where $\text{KL}(q&space;||&space;p)$ is the Kullback Leibler divergence between $\inline&space;q(z)$ and the posterior $p(Z&space;|&space;X,&space;\theta)$. As the Kullback Leibler is nonegative, with equality only for equal distributions $\mathcal{L}(q&space;,\theta)$ gives a lower bound to the complete log likelihood.

Now the E step does exactly set $q(z)$ equal to the posterior distribution $p(Z&space;|&space;X,&space;\overset{(i)}{\theta})$ .  Consequently the Kullback-Leibler divergence vanishes and the lower bound can be written as

$\mathcal{L}(q,\theta&space;)\to&space;\sum&space;_Z&space;p\left(Z|X,\overset{(i)}{\theta&space;}\right)&space;\log&space;(p(X,Z|\theta&space;))-p\left(Z|X,\overset{(i)}{\theta&space;}\right)&space;\log&space;\left(p\left(Z|X,\overset{(i)}{\theta&space;}\right)\right)$

Here the first term is exactly the expectation of the complete data log likelihood. The second term is the negative entropy of the $q$ distribution and can be neglected for the optimization with respect to $\inline&space;\theta$ in the M step. Note that the optimization parameters appear only in the argument of the $\log$ function which is particulary useful if the joint distribution $\inline&space;p(X,&space;Z&space;|&space;\theta)$ is a member of the exponential familiy.

Now the updated values $\overset{(i)}{\theta}$ imply a non zero KL divergence and lead to the next E step.

([1], Figures 9.12 and 9.13)

Thus iteration of both steps consistently increases the log likelihood function towards a (local) maximum.

4. Discussion

While the expectation maximization algorithm is a highly versatile procedure there are some caveats

• As the complete data log likelihood function is in general non-convex the algorithm can get stuck in optima that are either only local or pathological. For the example of a Gaussian Mixture that means that one of the components can collapse into a single point so that $\mu_k&space;=&space;x_n$ which introduces a term
$\mathcal{N}\left(x_n|x_n,i&space;\sigma&space;_k^2\right)\to&space;\frac{1}{\sqrt{2&space;\pi&space;}&space;\sigma&space;_k}$
so that the complete log likelihood goes to $\infty$ as $\sigma&space;\to&space;0$. To avoid such cases one can use heuristics to detect the collapsing component and subsequently reset its mean and covariance randomly before continuing with the optimization.  To find the best local maximum it may be advisable to evaluate several runs of the algorithm each initialized at random.
• A further issue in finding maximum likelihood solutions arises from the fact that for any given maximum likelihood solution, a $K$-component mixture will have a total of $K!$ equivalent solutions corresponding to the $K!$ ways of assigning $K$ sets of parameters to $K$ components. This problem is known as identifiability and becomes an issue when we whish to intrepret the parameter values discovered by the model. [5]
• Depending on the distributions involved a complete maximization in both the E and M step may become infeasible introducing the need for approximations such as generalized, partial and stochastic EM. [3]
• The maximum likelihood nature of  the procedure makes it prone to overfitting.  Also it may be desirable to include prior knowledge which can be done by adopting a Bayesian approach leading to the Variational Bayes Framework [4]