Statistical learning typically starts with observed data and a (parametric) model. One possible way to estimate the model parameters is maximum likelihood, that is to pick the parameters that maximize the likelihood (probability) of the evidence.
Now while maximum likelihood leads to closed form solutions for many models (in particular for those belonging to the exponential family) and has (at least asymptotically) many desireable properties such as consistency and sufficiency there are models where a direct maximization of the (log) likelihood function is infeasible.
In practice we often encounter missing data - we don't have sufficient information to specify the complete likelihood. Another scenario in which not all variables in the model are observed are the so-called hidden or latent variable models. In this case there are variables which are essential for the model description but never observed. For example, the underlying physics of a model may contain latent processes which are essential to describe the model, but cannot be directly measured. Both cases can be tackled by introducing latent variable, that is by splitting observational variables into visible (those for which we actually know the state) and missing ones (those whose states would or should nominally be known). 
Expectation maximization is then an iterative approach to maximum likelihood parameter estimation in latent variable models.
In this article we will build up an understanding of the algorithm by considering the well known Gaussian Mixture Model before giving an generalized proof of convergence and closing with a critical discussion.
Our exposition is largely based on the books "Pattern Recognition and Machine Learning" by Christopher Bishop  and "Bayesian reasoning and machine learning" by David Barber . Other useful references are indicated in the section Further Reading.
2. GMM Training
2.1 ML Estimate
Let's first derive the maximum likelihood estimate for the parameters of a multivariate Gaussian distribution.
Assuming a set of independent identically distributed observations we need to maximize the likelihood function
with respect to the parameter vector .
As the (natural) logarithm is strictly monotone function on we can equivalently maximize the log likelihood function
which will simplify subsequent calculations for models belonging to the exponential familiy.
For our running example of a multivariate Gaussian that is
Taking the partial derivative with respect to and equating it to we obtain
To find we rewrite the log likelihood as
2.2 Known Allocations
Now let's assume that we have a Gaussian Mixture Model with known allocations as illustrated below for a three components two dimensional output example:
(, Figure 9.5a)
Introducing a -dimensional binary variable that has a of representation, the complete data log likelihood function is given by
We see that this is a sum of independent components, therefore the maximization with respect to the means and covariances decouples and we have
Note that both relations depend directly on the number of observations assigned to the -th component so that we have a sufficient statistic given as
2.3 Unknown Allocations
If the allocations are unknown we might have the following observations:
(, Figure 9.5b)
Assuming mixing probabilies
the likelihood of a single observation is
For IID observations the corresponding log likelihood function is
Thus the parameters to be found are
In a maximum likelihood approach we would again have to equate the derivatives to and solve for the respective parameters.
However it is easy to see that the corresponding equations are coupled which prevents an easy (closed form) solution.
As we don't have the values of the mixing proportions we use their expected value with respect to the posterior distribution that using Bayes law is
Now the expected value for each individual component is
(The expectation of a binary variable is just it's probability so the exponents drop out)
Note that can be interpreted as the responsibility of the -th component for the -th data point.
Accordingly the expectation of the complete data log likelihood can be written as
which is finally the quantity to be maximized by the expectation maximization algorithm.
To this end we define
and rewrite the maximum likelihood estimates as
For the maximization with respect to the mixing components we have to enforce the summation constraint
and introduce a corresponding Lagrange multiplier to obtain
Now the full alorithm reads as follows:
- Initialize the parameters for iteration index .
- Evaluate the responsibilities
- Compute the maximum likelihood estimates
- Continue with 2. until convergence.
Upon convergence the results can be visualized by color coding the responsibilities
(, Figure 9.5c)
As this derivation may appear rather heuristic we proceed to show that the algorithm indeed maximizes the log likelihood function in general.
For any hidden variable model the probability of the evidence can be obtained as the marginal
(, eqn. 9.69)
If we introduce an arbitrary distribution on the hidden variables as , the corresponding log likelihood function can be decomposed as
where is the Kullback Leibler divergence between and the posterior . As the Kullback Leibler is nonegative, with equality only for equal distributions gives a lower bound to the complete log likelihood.
Now the E step does exactly set equal to the posterior distribution . Consequently the Kullback-Leibler divergence vanishes and the lower bound can be written as
Here the first term is exactly the expectation of the complete data log likelihood. The second term is the negative entropy of the distribution and can be neglected for the optimization with respect to in the M step. Note that the optimization parameters appear only in the argument of the function which is particulary useful if the joint distribution is a member of the exponential familiy.
Now the updated values imply a non zero KL divergence and lead to the next E step.
(, Figures 9.12 and 9.13)
Thus iteration of both steps consistently increases the log likelihood function towards a (local) maximum.
While the expectation maximization algorithm is a highly versatile procedure there are some caveats
- As the complete data log likelihood function is in general non-convex the algorithm can get stuck in optima that are either only local or pathological. For the example of a Gaussian Mixture that means that one of the components can collapse into a single point so that which introduces a term
so that the complete log likelihood goes to as . To avoid such cases one can use heuristics to detect the collapsing component and subsequently reset its mean and covariance randomly before continuing with the optimization. To find the best local maximum it may be advisable to evaluate several runs of the algorithm each initialized at random.
- A further issue in finding maximum likelihood solutions arises from the fact that for any given maximum likelihood solution, a -component mixture will have a total of equivalent solutions corresponding to the ways of assigning sets of parameters to components. This problem is known as identifiability and becomes an issue when we whish to intrepret the parameter values discovered by the model. 
- Depending on the distributions involved a complete maximization in both the E and M step may become infeasible introducing the need for approximations such as generalized, partial and stochastic EM. 
- The maximum likelihood nature of the procedure makes it prone to overfitting. Also it may be desirable to include prior knowledge which can be done by adopting a Bayesian approach leading to the Variational Bayes Framework 
5. Further Reading
 Bishop, Christopher M. Pattern recognition and machine learning. Springer, 2006.
 Barber, David. Bayesian reasoning and machine learning. Cambridge University Press, 2014.
 McLachlan, Geoffrey, and Thriyambakam Krishnan. The EM algorithm and extensions. John Wiley & Sons, 2007.
 Beal, M.J. Variational Algorithms for Approximate Bayesian Inference PhD. Thesis, Gatsby Computational Neuroscience Unit, University College London, 2003.
 Casella, George, and Roger L. Berger. Statistical inference. Duxbury, 2002.