1. Introduction

Statistical learning typically starts with observed data and a (parametric) model.  One possible way to estimate the model parameters is maximum likelihood, that is to pick the parameters that maximize the likelihood (probability) of the evidence.

Now while maximum likelihood leads to closed form solutions for many models (in particular for those belonging to the exponential family) and has (at least asymptotically) many desireable properties such as consistency and sufficiency there are models where a direct maximization of the (log) likelihood function is infeasible.

In practice we often encounter missing data - we don't have sufficient information to specify the complete likelihood.  Another scenario in which not all variables in the model are observed are the so-called hidden or latent variable models. In this case there are variables which are essential for the model description but never observed. For example, the underlying physics of a model may contain latent processes which are essential to describe the model, but cannot be directly measured.  Both cases can be tackled by introducing latent variable, that is by splitting observational variables into visible (those for which we actually know the state) and missing ones (those whose states would or should nominally be known). [2]

Expectation maximization is then an iterative approach to maximum likelihood parameter estimation in latent variable models.

In this article we will build up an understanding of the algorithm by considering the well known Gaussian Mixture Model before giving an generalized proof of convergence and closing with a critical discussion.

Our exposition is largely based on the books "Pattern Recognition and Machine Learning" by Christopher Bishop [1] and "Bayesian reasoning and machine learning" by David Barber [2].  Other useful references are indicated in the section Further Reading.

2. GMM Training

2.1 ML Estimate

Let's first derive the maximum likelihood estimate for the parameters  of a multivariate Gaussian distribution.

Assuming a set of independent identically distributed observations  we need to maximize the likelihood function

 

with respect to the parameter vector .

As the (natural) logarithm is strictly monotone function on  we can equivalently maximize the log likelihood function

which will simplify subsequent calculations for models belonging to the exponential familiy.

For our running example of a multivariate Gaussian that is

Taking the partial derivative with respect to  and equating it to  we obtain

To find   we rewrite the log likelihood as

and obtain

2.2 Known Allocations

Now let's assume that we have a Gaussian Mixture Model with known allocations as illustrated below for a three components two dimensional output example:

([1], Figure 9.5a)

Introducing a -dimensional binary variable  that has a  of  representation, the complete data log likelihood function is given by

We see that this is a sum of  independent components, therefore the maximization with respect to the means and covariances decouples and we have

and

Note that both relations depend directly on the number of observations assigned to the -th component so that we have a sufficient statistic given as

2.3 Unknown Allocations

If the allocations are unknown we might have the following observations:

([1], Figure 9.5b)

Assuming mixing probabilies 

 

the likelihood of a single observation is

For  IID observations the corresponding log likelihood function is 

Thus the parameters to be found are 

 In a maximum likelihood approach we would again have to equate the derivatives  to  and solve for the respective parameters.

However it is easy to see that the corresponding equations are coupled which prevents an easy (closed form) solution.

As we don't have the values of the mixing proportions  we use their expected value with respect to the posterior distribution that using Bayes law is

 

Now the expected value for each individual component  is

(The expectation of a binary  variable is just it's probability so the exponents drop out)

Note that  can be interpreted as the responsibility of the -th component for the -th data point.

Accordingly the expectation of the complete data log likelihood can be written as

which is finally the quantity to be maximized by the expectation maximization algorithm.

 

To this end we define

and rewrite the maximum likelihood estimates as

 For the maximization with respect to the mixing components  we have to enforce the summation constraint

and introduce a corresponding Lagrange multiplier  to obtain

2.4 Algorithm

Now the full alorithm reads as follows:

  1. Initialize the parameters  for iteration index .
  2. Evaluate the responsibilities  
  3. Compute the maximum likelihood estimates  
  4. Continue with 2. until convergence.

Upon convergence the results can be visualized by color coding the responsibilities

([1], Figure 9.5c)

3. Generalization

As this derivation may appear rather heuristic we proceed to show that the algorithm indeed maximizes the log likelihood function in general.

For any hidden variable model the probability of the evidence can be obtained as the marginal

([1], eqn. 9.69)

If we introduce an arbitrary distribution on the hidden variables as , the corresponding log likelihood function can be decomposed as

with

where  is the Kullback Leibler divergence between  and the posterior . As the Kullback Leibler is nonegative, with equality only for equal distributions  gives a lower bound to the complete log likelihood.

 Now the E step does exactly set  equal to the posterior distribution  .  Consequently the Kullback-Leibler divergence vanishes and the lower bound can be written as

Here the first term is exactly the expectation of the complete data log likelihood. The second term is the negative entropy of the  distribution and can be neglected for the optimization with respect to  in the M step. Note that the optimization parameters appear only in the argument of the  function which is particulary useful if the joint distribution  is a member of the exponential familiy.

Now the updated values  imply a non zero KL divergence and lead to the next E step.

([1], Figures 9.12 and 9.13)

 

Thus iteration of both steps consistently increases the log likelihood function towards a (local) maximum.

 

4. Discussion

While the expectation maximization algorithm is a highly versatile procedure there are some caveats

  • As the complete data log likelihood function is in general non-convex the algorithm can get stuck in optima that are either only local or pathological. For the example of a Gaussian Mixture that means that one of the components can collapse into a single point so that  which introduces a term

    so that the complete log likelihood goes to  as . To avoid such cases one can use heuristics to detect the collapsing component and subsequently reset its mean and covariance randomly before continuing with the optimization.  To find the best local maximum it may be advisable to evaluate several runs of the algorithm each initialized at random.
  • A further issue in finding maximum likelihood solutions arises from the fact that for any given maximum likelihood solution, a -component mixture will have a total of  equivalent solutions corresponding to the  ways of assigning  sets of parameters to  components. This problem is known as identifiability and becomes an issue when we whish to intrepret the parameter values discovered by the model. [5]
  • Depending on the distributions involved a complete maximization in both the E and M step may become infeasible introducing the need for approximations such as generalized, partial and stochastic EM. [3]
  • The maximum likelihood nature of  the procedure makes it prone to overfitting.  Also it may be desirable to include prior knowledge which can be done by adopting a Bayesian approach leading to the Variational Bayes Framework [4]

5. Further Reading

[1] Bishop, Christopher M. Pattern recognition and machine learning.  Springer,  2006.

[2] Barber, David. Bayesian reasoning and machine learning. Cambridge University Press, 2014.

[3] McLachlan, Geoffrey, and Thriyambakam Krishnan. The EM algorithm and extensions.  John Wiley & Sons, 2007.

[4] Beal, M.J. Variational Algorithms for Approximate Bayesian Inference PhD. Thesis, Gatsby Computational Neuroscience Unit, University College London, 2003.

[5] Casella, George, and Roger L. Berger. Statistical inference. Duxbury, 2002.


Contents