Statistical learning typically starts with observed data and a (parametric) model. One possible way to estimate the model parameters is maximum likelihood, that is to pick the parameters that maximize the likelihood (probability) of the evidence.

Now while maximum likelihood leads to closed form solutions for many models (in particular for those belonging to the exponential family) and has (at least asymptotically) many desireable properties such as consistency and sufficiency there are models where a direct maximization of the (log) likelihood function is infeasible.

In practice we often encounter missing data - we don't have sufficient information to specify the complete likelihood. Another scenario in which not all variables in the model are observed are the so-called hidden or latent variable models. In this case there are variables which are essential for the model description but never observed. For example, the underlying physics of a model may contain latent processes which are essential to describe the model, but cannot be directly measured. Both cases can be tackled by introducing latent variable, that is by splitting observational variables into visible (those for which we actually know the state) and missing ones (those whose states would or should nominally be known). [2]

Expectation maximization is then an iterative approach to maximum likelihood parameter estimation in latent variable models.

In this article we will build up an understanding of the algorithm by considering the well known Gaussian Mixture Model before giving an generalized proof of convergence and closing with a critical discussion.

Our exposition is largely based on the books "Pattern Recognition and Machine Learning" by Christopher Bishop [1] and "Bayesian reasoning and machine learning" by David Barber [2]. Other useful references are indicated in the section Further Reading.

2. GMM Training

2.1 ML Estimate

Let's first derive the maximum likelihood estimate for the parameters of a multivariate Gaussian distribution.

Assuming a set of independent identically distributed observations we need to maximize the likelihood function

with respect to the parameter vector .

As the (natural) logarithm is strictly monotone function on we can equivalently maximize the log likelihood function

which will simplify subsequent calculations for models belonging to the exponential familiy.

For our running example of a multivariate Gaussian that is

Taking the partial derivative with respect to and equating it to we obtain

To find we rewrite the log likelihood as

and obtain

2.2 Known Allocations

Now let's assume that we have a Gaussian Mixture Model with known allocations as illustrated below for a three components two dimensional output example:

([1], Figure 9.5a)

Introducing a -dimensional binary variable that has a of representation, the complete data log likelihood function is given by

We see that this is a sum of independent components, therefore the maximization with respect to the means and covariances decouples and we have

and

Note that both relations depend directly on the number of observations assigned to the -th component so that we have a sufficient statistic given as

2.3 Unknown Allocations

If the allocations are unknown we might have the following observations:

([1], Figure 9.5b)

Assuming mixing probabilies

the likelihood of a single observation is

For IID observations the corresponding log likelihood function is

Thus the parameters to be found are

In a maximum likelihood approach we would again have to equate the derivatives to and solve for the respective parameters.

However it is easy to see that the corresponding equations are coupled which prevents an easy (closed form) solution.

As we don't have the values of the mixing proportions we use their expected value with respect to the posterior distribution that using Bayes law is

Now the expected value for each individual component is

(The expectation of a binary variable is just it's probability so the exponents drop out)

Note that can be interpreted as the responsibility of the -th component for the -th data point.

Accordingly the expectation of the complete data log likelihood can be written as

which is finally the quantity to be maximized by the expectation maximization algorithm.

To this end we define

and rewrite the maximum likelihood estimates as

For the maximization with respect to the mixing components we have to enforce the summation constraint

and introduce a corresponding Lagrange multiplier to obtain

2.4 Algorithm

Now the full alorithm reads as follows:

Initialize the parameters for iteration index .

Evaluate the responsibilities

Compute the maximum likelihood estimates

Continue with 2. until convergence.

Upon convergence the results can be visualized by color coding the responsibilities

([1], Figure 9.5c)

3. Generalization

As this derivation may appear rather heuristic we proceed to show that the algorithm indeed maximizes the log likelihood function in general.

For any hidden variable model the probability of the evidence can be obtained as the marginal

([1], eqn. 9.69)

If we introduce an arbitrary distribution on the hidden variables as , the corresponding log likelihood function can be decomposed as

with

where is the Kullback Leibler divergence between and the posterior . As the Kullback Leibler is nonegative, with equality only for equal distributions gives a lower bound to the complete log likelihood.

Now the E step does exactly set equal to the posterior distribution . Consequently the Kullback-Leibler divergence vanishes and the lower bound can be written as

Here the first term is exactly the expectation of the complete data log likelihood. The second term is the negative entropy of the distribution and can be neglected for the optimization with respect to in the M step. Note that the optimization parameters appear only in the argument of the function which is particulary useful if the joint distribution is a member of the exponential familiy.

Now the updated values imply a non zero KL divergence and lead to the next E step.

([1], Figures 9.12 and 9.13)

Thus iteration of both steps consistently increases the log likelihood function towards a (local) maximum.

4. Discussion

While the expectation maximization algorithm is a highly versatile procedure there are some caveats

As the complete data log likelihood function is in general non-convex the algorithm can get stuck in optima that are either only local or pathological. For the example of a Gaussian Mixture that means that one of the components can collapse into a single point so that which introduces a term

so that the complete log likelihood goes to as . To avoid such cases one can use heuristics to detect the collapsing component and subsequently reset its mean and covariance randomly before continuing with the optimization. To find the best local maximum it may be advisable to evaluate several runs of the algorithm each initialized at random.

A further issue in finding maximum likelihood solutions arises from the fact that for any given maximum likelihood solution, a -component mixture will have a total of equivalent solutions corresponding to the ways of assigning sets of parameters to components. This problem is known as identifiability and becomes an issue when we whish to intrepret the parameter values discovered by the model. [5]

Depending on the distributions involved a complete maximization in both the E and M step may become infeasible introducing the need for approximations such as generalized, partial and stochastic EM. [3]

The maximum likelihood nature of the procedure makes it prone to overfitting. Also it may be desirable to include prior knowledge which can be done by adopting a Bayesian approach leading to the Variational Bayes Framework [4]

5. Further Reading

[1] Bishop, Christopher M. Pattern recognition and machine learning. Springer, 2006.

[2] Barber, David. Bayesian reasoning and machine learning. Cambridge University Press, 2014.

[3] McLachlan, Geoffrey, and Thriyambakam Krishnan. The EM algorithm and extensions. John Wiley & Sons, 2007.

[4] Beal, M.J. Variational Algorithms for Approximate Bayesian Inference PhD. Thesis, Gatsby Computational Neuroscience Unit, University College London, 2003.

[5] Casella, George, and Roger L. Berger. Statistical inference. Duxbury, 2002.

A stochastic process is defined as a sequence of random variables . The random variable is called state of the process at time . We assume a discrete time process here, i.e. the state is observed only at these time instances. The simplest example of such a process is a sequence of independent observations:

Sequence of independent observations [1]

Assuming all observations as independent would however mean loosing all time dependent information. To overcome this, we regard a process where the current state depends on the states observed before. To fully describe such a process, it is necessary to define the probability distribution for the initial state as well as for all following states the conditional distribution .

Dependent observations [1] (altered)

As we can see from the figure, with a growing number of states we quickly get confusingly many dependencies. A model where a state depends on all the states visited before is therefore not feasible in a practical system.

2. Markov models

One special type of a stochastic process is the so called Markov chain. It has the convenient property that the conditional probability of the current state depends only on the previous state:

As we can see, the model stays simple even when the number of states is increased.

Markov model [1]

Let's assume we want to predict the weather with a Markov model. Our model should have two states: "rainy" and "sunny". It is known that after a rainy day the next day will be sunny with a probability of 30%. After a sunshine day, the next day will bring rain with a 40% chance. The state diagram for this example is shown below.

Based on the weather of the current day, we can now make a prediction of the weather on the following day.

3. Hidden markov models

In the previous example, the weather and the state respectively can be observed directly. For many applications however, this is not the case. We thus extend our model to a so called Hidden markov model (HMM). The model consists of an underlying Markov process with states which cannot be observed directly. What can be observed are the emissions . The emission is a probabilistic function of the state . One important assumption inherent with Hidden markov models is that the current state contains all information about previous observations.

Hidden markov model [1]

Let's go back to our weather example from before and extend it to an HMM. There is a prisoner locked into a dungeon without windows. He cannot see if it's a rainy or a sunny day. The only thing that he can observe is whether the shoes of the guards are clean or dirty. He knows that if it's raining, the guards' shoes are dirty in 90% of the cases, but when it's sunny they are dirty only with a 60% chance. The state space diagram now looks as follows:

The prisoner can exploit this knowledge and make predictions about the weather outside, just by looking at the guards' shoes. Leaving the prison again and entering the outside world, one can find many applications for HMMs. They are often used as a model for temporal pattern recognition applications. In this context they are frequently applied for speech recognition tasks.

References

[1] Bishop, Christopher M. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.

[2] DeGroot, Morris H., et al. Probability and statistics. Vol. 2. Reading, MA: Addison-Wesley, 1986.

[3] Rabiner, Lawrence. "A tutorial on hidden Markov models and selected applications in speech recognition." Proceedings of the IEEE 77.2 (1989): 257-286.

Given a Hidden Markov Model and a sequence of observations, the Viterbi Algorithm determines the most probable sequence of hidden states very efficiently. For instance in speech recognition, it is used in the language model to find the most probable word out of a phonemes sequence.

1 Motivation

In many applications of Hidden Markov Models, it is often of interest to find the most probable state sequence if an observation sequence is given. A naive approach would be to compute the probability of each combination of states and then to choose the most probable state combination. However, an increase of the number of states comes along with an increase of the number of combinations. This again results in an increase of the computational costs in an exponential manner. Consequently in real life, it is only possible to determine the most probable state sequence for small . In contrast to the naive approach, the Viterbi-Algorithm is also suitable for greater since its computational costs grow only linearly. The basic idea of this algorithm is to find the most probable state sequence iteratively. For each state at time , the most probable path is searched depending only on the most probable path to state at time and the transition probability from state to state .

2 Problem Definition

Before providing you more information about the concept of the Viterbi Algorithm, the general problem is outlined in this section. Let us assume that a Hidden Markov Model is given. Then let be the number of states in the model and the number of discrete observations . At time , the observation can be made and it is assumed that the Hidden Markov Model is in state . Furthermore, is defined as the transition probability from state to state

and is defined as the observation probability distribution

and is the initial state distribution

The goal of the Viterbi Algorithm is to find the optimal state sequence such that its probability is maximal given an observation sequence . Recall that is defined as the state at time t.

3 Basic Idea

The main idea of the Viterbi Algorithm is that at every time and state the most probable state is determined in dependence of the most probable state sequence at time and the transition probability . For simplicity, let's define the variable , which gives the probability of the optimal state sequence from to state at time . Mathematically, is defined as:

where and stands for state at time . The advantage of this definition is that can be iteratively calculated by

The probability of the desired optimal state sequence is equal to . However, cannot give you any information about the optimal state sequence. Hence, a new variable is introduced, which should deliver the most probable following state at time for a given state at time :

Finally, the optimal state sequence can be determined by backtracking:

4 The Algorithm

Summarizing the ideas of the last section, the Viterbi Algorithm for finding the optimal state sequence reads:

Initialization:

Recursion: for

Termination:

The input of the algorithm is a given Hidden Markov Model with its parameters and an observation sequence. The output is the most probable state sequence .

5 A Small Example

The goal of the following example is to determine the most probable word given the phonemes sequence . For this, a Hidden Markov Model with three states (, , ) is used. It is given in the figure below. For initialization, it is assumed that the three states are initially equally probable, e.g. .

5.1 Initialization

For initialization, , and are determined first. Hereby, it is assumed that observation occurs.

5.2 Recursion

Now , and are determined iteratively for .

time step t = 2:

time step t=3:

time step t=4:

5.3 Termination

By backtracking, the most probable state sequence can be determined:

Hence, the most probable state sequence is which corresponds to the word "Anne".

6 References

[1] Bishop, C. (2006). Pattern Recognition and Machine Learning.

[2] Duda, R. & Hart, P. & Strok, D., Pattern Classification.

[3] Lee, D. (2014). Lecture notes of the lecture Machine Learning in Robotics

The basis for Gaussian Mixture Models is the Normal or Gaussian distribution with the probability density function (pdf)

with mean and variance . Plotting this one dimensional function with and leads to the famous Gauss curve

2. Multivariate normal distribution

This distribution can be extended to a multidimensional case. Let's assume a n-dimensional random variable with Gaussian distribution .

The pdf is now given by

where is the mean vector and the covariance matrix.

We now regard a two dimensional Gaussian distribution with

.

The diagonal elements describe the variance of one Gaussian random variable, the off-diagonal elements the covariance between the random variables. These are zero here, hence the two random variables are independent. The distribution or more precisely the pdf of the distribution, can be visualized in different ways. The first plot shows how the 2-d distribution is formed by two 1-d marginal distributions. The same distribution is again shown in the second plot, where the floor is given the two dimensions of the distribution and the third dimension depicts the probability. Instead of plotting the 3-d shape of the Gauss bell, one can also "cut" several times through the bell and plot the contour lines. The resulting ellipses represent lines of equal probability.

3. Gaussian mixture models

However the multivariate Gaussian distribution still has clear limitations when it comes to modelling real data, which does not necessarily follow the Gauss curve. A much more flexible approach is to use a superposition of several Gaussian distributions, each weighted by a coefficient . This is called a mixture model and the Gaussian densities are called components. Each component has its own mean and covariance . The pdf for components is given by

.

We also note that the sum of the mixing coefficients adds up to one.

By increasing the number of components and adjusting the parameters of each component, almost any set of data can be represented by a Gaussian mixture model (GMM) with high accuracy.

4. Fitting with sample data

To illustrate the advantage of GMMs over multivariate Gaussians, peakly distributed data with two dimensions has been generated (first plot). Now we want to find a model for this bimodal data set. The second figure shows a multivariate Gaussian distribution fitted to the data. It is obvious that the distribution does not well represent the original data.

Now we try the same with a Gaussian mixture model. Therefore the data has to be split into clusters first. Each cluster represents one component of the mixture model. Clustering can be done with the k-means Algorithm. After the data has been separated into clusters, each cluster is represented by a multivariate Gaussian distribution. The distribution is fitted to the data using a technique called Expectation-Maximization (EM).

Our sample data splits into two clusters. The last figure shows a GMM fitted to the data. As we can see, in contrast to simple multivariate Gaussians, the Gaussian mixture model represents the data very well. In speech recognition tasks, GMMs are frequently used to model the acoustic feature vectors. The article "Gaussian mixture models for speech recognition" deals with this application.

References

[1] Bishop, Christopher M. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.

[2] DeGroot, Morris H., et al. Probability and statistics. Vol. 2. Reading, MA: Addison-Wesley, 1986.