1. Introduction

In this article we explain how to infer the posterior marginals of the hidden state variables in a HMM from a sequence of observations/emissions.  The algorithm makes use of the principle of dynamic programming (i.e. it breaks down the problem into simpler ones) during two passes over the data. The first pass goes forward in time while the second goes backward in time; hence the name forward–backward algorithm.

(We assume fixed parameters please see the article on Baum Welch Algorithm for ways to estimate the parameters.)

We will build up the full algorithm incrementally by

  • Deriving conditional independence assumptions on the underlying graphical model
  • Applying Bayes sequentially to incorporate previous observations (forward pass)
  • Adding a backward pass so that we condition on past and previous observations

Our presentation is to great extent based on [2] from which we also adopted large parts of notation.

2. Forward-Backward Algorithm

2.1 Preliminaries

A first order Hidden Markov Model has hidden variables  and visible variables  corresponding to observed data. 

 Figure 13.4 from (Barber, 2014)

Recall that we can factorize any joint distribution   as 

(This is sometimes called "chain rule of probability", the Matlab like notation  denotes the ordered index set )

Applying the rule to our model the joint probability of hidden and observed variables is given by

To simplify the factored expressions we want to establish conditional independence relations. While these can be formally shown by D-Separation ([4], sec. 10.5.1) or tedious algebraic manipulations a more intuitive way to assert independence graphically is given by the Bayes ball rules. 


Figure 10.9 from ([4], 2012)

The rules drawn above show the most important rules in a Hidden Markov Model. As blocked paths (filled cycles) render the corresponding nodes independent,  is independent of  given  as well as  and  and are.

2.2 Forward Filtering

The goal in forward filtering is to calculate  that is the joint probability of the hidden state at the current time  together with all previous observations

To this end we first marginalize over the previous states

and apply the chain rule to obtain

Using the conditional independence assumptions of the Markov model we can simplify this equation to read

This corresponds to a recursive formulation of the joint probability   

The base case is


The -Recursion can be understood as a predictor corrector method

The filtered distribution  from the previous timestep is propagated forwards by the dynamics for one timestep to reveal a prior distribution at time . This distribution is then modulated by the observation , which serves to incorporate the new evidence into a posterior distribution. 

2.3 Backward Smoothing

If we do not only take into account previous observations but process the data offline we can possibly obtain a better (smoother) estimate. Parallel smoothing incorporates information from past and future using the fact that ( when observed )  d-separates the past from the future.  Formally that is 

We see that the desired posterior  can be factored into the past and future contribution  and 

A recursive update formula for  can be derived as follows


Repeatedly applying the chain rule we have


 So that we can identify the recursion

with initialisation 

2.5 Complete Forward Backward Algorithm

Together the  recursions make up the forward backward algorithm and the smoothed posterior distribution is given by

3. Illustration

As an illustrative toy example take a look at the "dishonest casino HMM" described by ([4] respectively [3])

Figure 10.9 from ([4])

Imagine a dishonest casino which may occasionally use a loaded (L) die skewed towards .  (For a fair (F) die the emission probabilities are given by a uniform distributions over the integers , for the loaded die the probablity is much higher e.g.  in our example) Our filtering / smoothing task in this setting would be to infer whether this is the case just by considering a sequence of games.

Typical emissions in that setting may look as follows:
Rolls: 664153216162115234653214356634261655234232315142464156663246
the corresponding (hidden) ground truth would be:
The resulting output  after forward filtering is
Figure 10.9a from ([4], 2012)
After forward-backward filtering we obtain :
Figure 10.9b from (Murphy, 2012)
We see that forwards-backwards smoothing gives indeed a better (smoother) estimate. If we threshold the estimates at  and compare to the true sequence, we find that the filtered method makes  errors out of , and the smoothed method makes

3. Discussion

  • As both forward and backward recursion involve repeated multiplication with factors  it is advisable to work in the log domain.  Alternatively if one is only interested into the posteriors normalization at each stage such that  is also a viable approach.
  • An alternative to the parallel approach presented here is correction smoothing (Barber 2014, sec. 23.2.4) which forms a direct recursion for the posterior .  This procedure also referred to as Rauch-Tung-Striebel method is sequential because since the  recursions must be completed before the  recursions may be started.
  • In relation to the Viterbi-Decoding which computes maximum a posteriori state sequence (MAP)  the approach presented can be used to compute the maximizer of the posterior marginals (MPM) .  The advantage of the joint MAP estimate is that it is always globally consistent which is desireable if we have the requirement that data should be explained by a single consistent (e.g. lingualistically plausible) path.  On the other hand the MPM is more robust since for each node we average over the values of its neighbour rather than conditioning on a specific value (see (Murphy 2012, sec. for additional details.  With regards to time and space complexity both algorithms generally are of order  and .

4. Further Reading

[1]: Christopher Bishop, Pattern recognition and machine learning,  Springer,  2006.

[2]: David Barber, Bayesian reasoning and machine learning, Cambridge University Press, 2014.

[3]: Richard Durbin, Biological sequence analysis: probabilistic models of proteins and nucleic acids, Cambridge University Press, 1998.

[4]: Kevin Murphy, Machine Learning - A probabilistic perspective, MIT Press, 2012.