In this article we explain how to infer the posterior marginals of the hidden state variables in a HMM from a sequence of observations/emissions. The algorithm makes use of the principle of dynamic programming (i.e. it breaks down the problem into simpler ones) during two passes over the data. The first pass goes forward in time while the second goes backward in time; hence the name forward–backward algorithm.
(We assume fixed parameters please see the article on Baum Welch Algorithm for ways to estimate the parameters.)
We will build up the full algorithm incrementally by
- Deriving conditional independence assumptions on the underlying graphical model
- Applying Bayes sequentially to incorporate previous observations (forward pass)
- Adding a backward pass so that we condition on past and previous observations
Our presentation is to great extent based on  from which we also adopted large parts of notation.
2. Forward-Backward Algorithm
A first order Hidden Markov Model has hidden variables and visible variables corresponding to observed data.
Figure 13.4 from (Barber, 2014)
Recall that we can factorize any joint distribution as
(This is sometimes called "chain rule of probability", the Matlab like notation denotes the ordered index set )
Applying the rule to our model the joint probability of hidden and observed variables is given by
To simplify the factored expressions we want to establish conditional independence relations. While these can be formally shown by D-Separation (, sec. 10.5.1) or tedious algebraic manipulations a more intuitive way to assert independence graphically is given by the Bayes ball rules.
Figure 10.9 from (, 2012)
The rules drawn above show the most important rules in a Hidden Markov Model. As blocked paths (filled cycles) render the corresponding nodes independent, is independent of given as well as and and are.
2.2 Forward Filtering
The goal in forward filtering is to calculate that is the joint probability of the hidden state at the current time together with all previous observations
To this end we first marginalize over the previous states
and apply the chain rule to obtain
Using the conditional independence assumptions of the Markov model we can simplify this equation to read
This corresponds to a recursive formulation of the joint probability
The base case is
The -Recursion can be understood as a predictor corrector method
The filtered distribution from the previous timestep is propagated forwards by the dynamics for one timestep to reveal a prior distribution at time . This distribution is then modulated by the observation , which serves to incorporate the new evidence into a posterior distribution.
2.3 Backward Smoothing
If we do not only take into account previous observations but process the data offline we can possibly obtain a better (smoother) estimate. Parallel smoothing incorporates information from past and future using the fact that ( when observed ) d-separates the past from the future. Formally that is
We see that the desired posterior can be factored into the past and future contribution and .
A recursive update formula for can be derived as follows
Repeatedly applying the chain rule we have
So that we can identify the recursion
2.5 Complete Forward Backward Algorithm
Together the recursions make up the forward backward algorithm and the smoothed posterior distribution is given by
As an illustrative toy example take a look at the "dishonest casino HMM" described by ( respectively )
Figure 10.9 from ()
Imagine a dishonest casino which may occasionally use a loaded (L) die skewed towards . (For a fair (F) die the emission probabilities are given by a uniform distributions over the integers , for the loaded die the probablity is much higher e.g. in our example) Our filtering / smoothing task in this setting would be to infer whether this is the case just by considering a sequence of games.
- As both forward and backward recursion involve repeated multiplication with factors it is advisable to work in the log domain. Alternatively if one is only interested into the posteriors normalization at each stage such that is also a viable approach.
- An alternative to the parallel approach presented here is correction smoothing (Barber 2014, sec. 23.2.4) which forms a direct recursion for the posterior . This procedure also referred to as Rauch-Tung-Striebel method is sequential because since the recursions must be completed before the recursions may be started.
- In relation to the Viterbi-Decoding which computes maximum a posteriori state sequence (MAP) the approach presented can be used to compute the maximizer of the posterior marginals (MPM) . The advantage of the joint MAP estimate is that it is always globally consistent which is desireable if we have the requirement that data should be explained by a single consistent (e.g. lingualistically plausible) path. On the other hand the MPM is more robust since for each node we average over the values of its neighbour rather than conditioning on a specific value (see (Murphy 2012, sec. 184.108.40.206)) for additional details. With regards to time and space complexity both algorithms generally are of order and .
4. Further Reading
: Christopher Bishop, Pattern recognition and machine learning, Springer, 2006.
: David Barber, Bayesian reasoning and machine learning, Cambridge University Press, 2014.
: Richard Durbin, Biological sequence analysis: probabilistic models of proteins and nucleic acids, Cambridge University Press, 1998.
: Kevin Murphy, Machine Learning - A probabilistic perspective, MIT Press, 2012.