# 1. Introduction

A Hidden Markov Model describes the joint probability of a collection of 'hidden' and 'observed' random variables. It relies on the assumption that the distribution of hidden variables depends only on the immediate predecessors, that the realization of each variable  is independent of previous hidden variables if the value of the hidden variable in the previous timestep is known (this is often called first-order Markov property). Furthermore the current observation variables is assumed to depend only on the current hidden state (this assumption is actually a little optimistic, see the pages Acoustic Model and Discrimiative Training for further information).

The Baum–Welch algorithm uses the EM algorithm to find the maximum likelihood estimate of the parameters of a hidden Markov model given a set of observations.

More mathematically, training a Hidden Markov Model is basically a parameter estimation task, we want to find the parameters characterizing the joint distribution $p[X,&space;Z&space;\left|&space;\lambda]$ ($\lambda=&space;\left(\pi,&space;A,&space;\Phi&space;\right)$, initial state distibution $\pi$, state transition matrix $A$ and observation parameters $\Phi$) based on observations $X$.

In a  maximum likelihood approach we set the values to the value maximizing the probability of the evidence (i.e. the observed data).

If labeled training data is available the parameters could (theoretically) be determined from the data likelihood obtained through marginalization over the hidden variables.  In practice this is infeasible for non trivial cases because the joint distribution does not factorize over time indices and the computational effort therefore grows exponentially with the length of the observation sequence. (see (Bishop 2006, sec. 13.2.1) for an extensive discussion of that matter).

In this article we present the Baum Welch Algorithm, a special case of the general expectation maximization (EM) algorithm, which gives us an iterative procedure to infer the parameters in realistic scenarios.

Our presentation is to great extent based on (Bishop 2006), additional references can be found in section Further Reading

# 2. Baum Welch Algorithm

## 2.1 Preliminiaries

We assume a discrete state Hidden Markov Model with a known number of states $K$. The $K$ hidden states of the latent variables can be represented using a $1$ of $K$ coding scheme such that for each realization we are given a vector $z$ with exactly one element equal to one and the remaining equal to zero.

We represent the transition probabilities in matrix form as

$p\left(z_{n,k}=1|z_{n-1,j}=1\right)\to&space;A_{j,k}$

and obtain
$p\left(z_n|z_{n-1},A\right)\to&space;\prod&space;_{k=1}^K&space;\prod&space;_{j=1}^J&space;A_{j,k}^{z_{n-1,j}&space;z_{n,k}}$

Similarily the emission probalities can be written as

$p\left(x_n|z_n,\Phi&space;\right)\to&space;\prod&space;_{k=1}^K&space;p\left(x_n|\Phi&space;_k\right){}^{z_{n,k}}$

where $\Phi_k$ bundles the parameters for the $k$th component (For $D$ dimensional observations $x$ and Gaussian emissions $\Phi_k$ that would be $\mu_k&space;\in&space;\mathbb{R}^{D},&space;C_k&space;\in&space;\mathbb{R}^{D&space;\times&space;D}$, for a multinoulli model (see also (Bishop 2006, Appendix B)) simply a vector $\mu_k&space;\in&space;\mathbb{R}^{D}$).  Intuitively we can interpret $h_t$ as a gating variable that selects the distribution to use (see also the article on Gaussian Mixture Models).

The joint probability distribution over both latent and observed variables is then
$p(X,Z|\theta&space;)=p\left(\left.z_1\right|\pi&space;\right)&space;\left(\prod&space;_{n=2}^N&space;p\left(z_n|z_{n-1},A\right)\right)&space;\prod&space;_{m=1}^N&space;p\left(x_m|z_m,\Phi&space;\right)$
(We bundle observation and hidden variables in matrices $X&space;\in&space;\mathbb{R}^{N&space;\times&space;D},&space;Z&space;\in&space;\mathbb{R}^{N&space;\times&space;K}$ and let $\lambda$ denote the collection of parameters $(\pi,&space;A,&space;\Phi)$, where $\pi$ gives the probability distribution over the intial state)

## 2.2 Expectation Maximization

Recall that the EM Algorithm iterates maximization of a lower bound on the complete data log likelihood and maximum likelihood estimation of the parameters.  More formally the corresponding objective function is

$Q\left(\theta&space;,\overset{\text{(old)}}{\theta&space;}\right)\to&space;\sum&space;_Z&space;p\left(Z|X,\overset{\text{(old)}}{\theta&space;}\right)&space;\log&space;(p(X,Z|\theta&space;))$

$\gamma&space;\left(z_n\right)\to&space;p\left(\left.z_n\right|X,\overset{\text{(old)}}{\theta&space;}\right)$

$\zeta&space;\left(z_{n-1},z_n\right)\to&space;p\left(z_{n-1},\left.z_n\right|X,\overset{\text{(old)}}{\theta&space;}\right)$

we have

$\sum&space;_z&space;\gamma&space;\left(z_n\right)&space;z_{n,k}\to&space;\mathbb{E}\left(z_{n,k}\right)\to&space;\gamma&space;\left(z_{n,k}\right)$

and

$\sum&space;_{z_{n-1},z_n}&space;z_{n-1,j}&space;z_{n,k}&space;\zeta&space;\left(z_{n-1},z_n\right)\to&space;\mathbb{E}\left(z_{n-1,j}&space;z_{n,k}\right)\to&space;\zeta&space;\left(z_{n-1,j},z_{n,k}\right)$
(The expectation of a binary variable is just the probability that it takes the value $1$)

such that

$\inline&space;Q\left(\theta&space;,\overset{\text{old}}{\theta&space;}\right)\to&space;\sum&space;_{k=1}^K&space;\log&space;\left(\pi&space;_k\right)&space;\gamma&space;\left(z_{1,k}\right)+\sum&space;_{n=2}^N&space;\sum&space;_{j=1}^K&space;\sum&space;_{k=1}^K&space;\log&space;\left(A_{j,k}\right)&space;\zeta&space;\left(z_{n-1,j},z_{n,k}\right)+\sum&space;_{n=1}^N&space;\sum&space;_{k=1}^K&space;\gamma&space;\left(z_{n,k}\right)\text{Log}\left[p\left(x_n|\phi&space;_k\right)\right.$

### 2.2.1 E-Step

Now the E-Step is actually almost completely given by the Forward-Backward Algorithm, we have

$\gamma&space;\left(z_n\right)\to&space;\frac{p\left(z_n\right)&space;p\left(X\left|z_n\right.\right)}{p(X)}&space;\to&space;\frac{\alpha&space;\left(z_n\right)&space;\beta&space;\left(z_n\right)}{p(X)}$

where

$p(X)\to&space;\sum&space;_{n=1}^N&space;\alpha&space;\left(z_n\right)$

and

$\zeta&space;\left(z_{n-1},z_n\right)\to&space;\frac{\alpha&space;\left(z_{n-1}\right)&space;\beta&space;\left(z_n\right)&space;p\left(z_n|z_{n-1}\right)&space;p\left(x_n|z_n\right)}{p(X)}$

### 2.2.2 M-Step

In the M step we maximize $Q\left(\theta&space;,\overset{\text{old}}{\theta&space;}\right)$ with respect to the parameters $\lambda$, treating $\gamma&space;\left(z_n\right)$ and $\zeta&space;\left(z_{n-1},z_n\right)$ as constant

If the parameters of the emission densities $\Phi_k$ are independent the problem decouples and we can find the ML estimates as follows

#### 2.2.2.1 Discrete Emissions

For a multinoulli emission model each observation corresponds to a binary vector of length $D$ and we have the conditional observation distribution

$p(x|z)&space;\to&space;\prod&space;_{d=1}^D&space;\prod&space;_{k=1}^K&space;\mu&space;_{d,k}{}^{x_d,z_k}$

and the corresponding estimate is

$\mu&space;_{d,k}\to&space;\frac{\sum&space;_{n=1}^N&space;x_{n,d}&space;\gamma&space;\left(z_{n,k}\right)}{\sum&space;_{n=1}^N&space;\gamma&space;\left(z_{n,k}\right)}$

#### 2.2.2.2 Continuous Emissions

A simple but practical relevant case of a continuous observation density is a Gaussian distribution

$p\left(x\left|\phi&space;_k\right.\right)\to&space;\mathcal{N}\left(x\left|\mu&space;_k\right.,C_k\right)$

for which we obtain

$\mu&space;_k\to&space;\frac{\sum&space;_{n=1}^N&space;x_n&space;\gamma&space;\left(z_{n,k}\right)}{\sum&space;_{n=1}^N&space;\gamma&space;\left(z_{n,k}\right)}$

and

$\Sigma&space;_k\to&space;\frac{\sum&space;_{n=1}^N&space;\gamma&space;\left(z_{n,k}\right)&space;\left(x_n-\mu&space;_k\right).\left(x_n-\mu&space;_k\right){}^{\mathsf{T}}}{\sum&space;_{n=1}^N&space;\gamma&space;\left(z_{n,k}\right)}$

## 2.3 Example

For illustration let's consider a two state  ($K&space;=&space;2$) HMM with Gaussian emissions $p(x&space;\left|&space;z_k)&space;\sim&space;\mathcal{N}(x&space;|&space;\mu_k,&space;C_k)$.

We take true parameters

$&space;a\to&space;\left(&space;\begin{array}{c}&space;0.9&space;\\&space;0.1&space;\\&space;\end{array}&space;\right),&space;\quad&space;A\to&space;\left(&space;\begin{array}{cc}&space;0.95&space;&&space;0.05&space;\\&space;0.75&space;&&space;0.25&space;\\&space;\end{array}&space;\right)&space;$

$\mu_1&space;\to&space;\left(&space;\begin{array}{c}&space;-1&space;\\&space;-1&space;\\&space;\end{array}&space;\right),&space;\quad&space;C_1&space;\to&space;\left(&space;\begin{array}{cc}&space;1&space;&&space;0.3&space;\\&space;0.3&space;&&space;0.8&space;\\&space;\end{array}&space;\right)&space;$

$\mu_2&space;\to&space;\left(&space;\begin{array}{c}&space;2&space;\\&space;2&space;\\&space;\end{array}&space;\right),&space;\quad&space;C_2&space;\to&space;\left(&space;\begin{array}{cc}&space;0.7&space;&&space;0.6&space;\\&space;0.6&space;&&space;1.0&space;\\&space;\end{array}&space;\right)&space;$

A Matlab implementation based on code from Sebastien Paris (available on the Mathworks FileExchange) converges to estimated parameter values

$&space;a\to&space;\left(&space;\begin{array}{c}&space;0.999&space;\\&space;0.001&space;\\&space;\end{array}&space;\right),&space;\quad&space;A\to&space;\left(&space;\begin{array}{cc}&space;0.9247&space;&&space;0.0753&space;\\&space;0.2232&&space;0.7768&space;\\&space;\end{array}&space;\right)&space;$

$\mu_1&space;\to&space;\left(&space;\begin{array}{c}&space;-1.1382&space;\\&space;-1.0514&space;\\&space;\end{array}&space;\right),&space;\quad&space;C_1&space;\to&space;\left(&space;\begin{array}{cc}&space;0.9570&space;&&space;0.2242&space;\\&space;0.2242&space;&&space;0.7643&space;\\&space;\end{array}&space;\right)&space;$

$\mu_2&space;\to&space;\left(&space;\begin{array}{c}&space;2.1227&space;\\&space;1.8584&space;\\&space;\end{array}&space;\right),&space;\quad&space;C_2&space;\to&space;\left(&space;\begin{array}{cc}&space;0.9784&space;&&space;0.7754&space;\\&space;0.7754&space;&&space;0.9633&space;\\&space;\end{array}&space;\right)&space;$

(Parameters initialized at random, 100 training samples, 30 Iterations of the algorithm)

The resulting clustering can be visualized as follows

# 3. Discussion

As EM methods are notorious for getting stuck in local optima (the underlying problem is nonconvex) we want to reference alternative approaches and mention some caveats

•  Extensions Staying with the EM approach deterministic annealing (see i.e Rao 2001) can mitigate the danger of local minima.  Also it is advisable to spend some effort on initialisation
• Use fully labeled data for the initial settings of the parameters
• Initially ignore the time dependencies but treat the observations as IID random variables and apply standard methods (e.g. K-Means)
• Initialize at random but run the algorithm multiple times and pick the best solution.
• A comparative analysis of appropriate stopping (convergence criteria) can be found in (Abbi 2008)
• Bayesian Approaches: EM returns a maximum a posteriori (likelihood) estimate. Alternatives (more Bayesian approaches) are

• Markov Chain Monte Carlo (MCMC): Generate hidden paths using forwards-filtering and backwards sampling and obtain the parameters from their posteriors conditioned on the simulated data (see (Fruhwirt-Schnatter 2007) for details)

•  Variational Bayes EM (Beal 2003): Use the posterior mean parameters instead of the MAP estimates in the E step and update the parameters of the conjugate posteriors in the M step

• Alternative gradient based methods allowing online learning are described in (Baldi 1994)

(Abbi 2008) Revlin Abbi, Analysis of stopping criteria for the EM algorithm in the context of patient grouping according to length of stay, 4th International IEEE Conference on Intelligent Systems, 2008.

(Baldi 1994) Patrick Balid, Smooth online learning algorithms for hidden Markov models, Journal on Neural Computation, Issue 6, 1994

(Barber 2014): David Barber, Bayesian reasoning and machine learning, Cambridge University Press, 2014.

(Beal 2003) Matthew Beal, Variational Algorithms for Approximate Bayesian Inference, Ph.D. Thesis, Gatsby University, 2003

(Bishop 2006): Christopher Bishop, Pattern recognition and machine learning,  Springer,  2006.

(Fruhwirt 2007): Sabine Fruhwirt-Schnatter, Finite Mixture and Markov Switching Models, Springer 2007.

(Murphy 2012): Kevin Murphy, Machine Learning - A probabilistic perspective, MIT Press, 2012.