# 1. Introduction

In this article we explain how to infer the posterior marginals of the hidden state variables in a HMM from a sequence of observations/emissions.  The algorithm makes use of the principle of dynamic programming (i.e. it breaks down the problem into simpler ones) during two passes over the data. The first pass goes forward in time while the second goes backward in time; hence the name forward–backward algorithm.

(We assume fixed parameters please see the article on Baum Welch Algorithm for ways to estimate the parameters.)

We will build up the full algorithm incrementally by

• Deriving conditional independence assumptions on the underlying graphical model
• Applying Bayes sequentially to incorporate previous observations (forward pass)
• Adding a backward pass so that we condition on past and previous observations

Our presentation is to great extent based on [2] from which we also adopted large parts of notation.

# 2. Forward-Backward Algorithm

## 2.1 Preliminaries

A first order Hidden Markov Model has hidden variables $\inline&space;h_t$ and visible variables $v_t$ corresponding to observed data.

Figure 13.4 from (Barber, 2014)

Recall that we can factorize any joint distribution  $p(x_{1:V})$ as

$p(x_{1:V})&space;=&space;p(x_1)&space;p(x_2&space;|&space;x_1)&space;p(x_3&space;|&space;x_{1:2}),&space;\ldots&space;p(x_V&space;|&space;x_{1:V-1})&space;$

(This is sometimes called "chain rule of probability", the Matlab like notation $i:j$ denotes the ordered index set $\{i,&space;i+1,&space;i+2,&space;i+3,&space;\ldots,&space;j-1,&space;j\}$)

Applying the rule to our model the joint probability of hidden and observed variables is given by

$p\left[h_{1:T},v_{1:T}\right]=&space;p\left(h_1\right)&space;p\left(v_1|h_1\right)&space;\prod&space;_{t=2}^T&space;p\left(h_t|h_{t-1}\right)&space;p\left(v_t|h_t\right)$

To simplify the factored expressions we want to establish conditional independence relations. While these can be formally shown by D-Separation ([4], sec. 10.5.1) or tedious algebraic manipulations a more intuitive way to assert independence graphically is given by the Bayes ball rules.

Figure 10.9 from ([4], 2012)

The rules drawn above show the most important rules in a Hidden Markov Model. As blocked paths (filled cycles) render the corresponding nodes independent, $h_{t+1}$ is independent of $h_{t-1}$ given $h_t$ as well as $v_{t-1}$ and $v_{t}$ and$v_{t+1}$ are.

## 2.2 Forward Filtering

The goal in forward filtering is to calculate $\alpha(h_t)&space;:=&space;p(h_t,&space;v_{1:t})$ that is the joint probability of the hidden state at the current time $t$ together with all previous observations

To this end we first marginalize over the previous states

$p\left[h_t,v_{1:t}\right]\to&space;\sum&space;_{h_{t-1}}&space;p\left[h_t,h_{t-1},v_{1:t}\right]$

and apply the chain rule to obtain

$p\left[h_t,v_{1:t}\right]\to&space;\sum&space;_{h_{t-1}}&space;p\left[v_t|v_{1:t-1},h_t,h_{t-1}\right]p\left[h_t|}v_{1:t-1},h_{t-1}\right]p\left[v_{1:t-1},h_{t-1}\right]$

Using the conditional independence assumptions of the Markov model we can simplify this equation to read

$p\left[h_t,v_{1:t}\right]\to&space;\sum&space;_{h_{t-1}}&space;p\left(v_t|h_t\right)p\left(h_t|h_{t-1}\right)p\left[v_{1:t-1},h_{t-1}\right]$

This corresponds to a recursive formulation of the joint probability $\alpha(h_t)&space;:=&space;p(h_t,&space;v_{1:t})$

$p\left[h_t,v_{1:t}\right]\to&space;\alpha(h_t)&space;\to&space;p(v_t&space;|&space;h_t)&space;\sum\limits_{h_{t-1}}&space;p(h_t&space;|&space;h_{t&space;-&space;1})&space;\alpha(h_{t-1})&space;\quad&space;t&space;>&space;1&space;$

The base case is

$\alpha&space;\left(h_1\right)\to&space;p\left(h_1,v_1\right)\to&space;p\left(h_1\right)&space;p\left(v_1|h_1\right)$

### Interpretation

The $\alpha$-Recursion can be understood as a predictor corrector method

$\alpha(h_t)&space;\to&space;\underbrace{p\left(v_t|h_t\right)}_{\text{Corrector}}\underbrace{\sum&space;_{h_{t-1}}&space;p\left(h_t|h_{t-1}\right)&space;\alpha&space;\left(h_{t-1}\right)}_{\text{Predictor}}$

The filtered distribution $\alpha(h_{t-1})$ from the previous timestep is propagated forwards by the dynamics for one timestep to reveal a prior distribution at time $t$. This distribution is then modulated by the observation $v_t$, which serves to incorporate the new evidence into a posterior distribution.

## 2.3 Backward Smoothing

If we do not only take into account previous observations but process the data offline we can possibly obtain a better (smoother) estimate. Parallel smoothing incorporates information from past and future using the fact that ( when observed ) $h_t$ d-separates the past from the future.  Formally that is

$p\left[h_t,v_{1:T}\right]\to&space;p\left[h_t,v_{1:t},v_{t+1:T}\right]\to&space;\underbrace{p\left[h_t,v_{1:t}\right]}_{\alpha({h_t})}\underbrace{p\left[v_{t+1:T}|h_t\right]}_{\beta(h_t)}$

We see that the desired posterior $p(h_t,&space;v_{1:T})$ can be factored into the past and future contribution $\alpha(h_t)&space;:=&space;p(h_t,&space;v_{1:T})$ and $\beta(h_t)&space;:=&space;p(v_{t&space;+&space;1&space;:&space;T}&space;|&space;h_t)$

A recursive update formula for $\beta(h_t)$ can be derived as follows

$p\left(v_{t:T}|h_{t-1}\right)\to&space;\sum&space;_{h_t}&space;p\left[v_t,v_{t+1:T},h_t|h_{t-1}\right]$

Repeatedly applying the chain rule we have

$p\left(v_{t:T}|h_{t-1}\right)\to&space;\sum&space;_{h_t}&space;p\left[v_t|v_{t+1:T},h_t,h_{t-1}\right]p\left[v_{t+1:T},h_t|h_{t-1}\right]&space;$

and

$p\left(v_{t:T}|h_{t-1}\right)\to&space;\sum&space;_{h_t}&space;p\left(v_t|h_t\right)p\left[v_{t+1:T}|h_t,h_{t-1}\right]p\left(h_t|h_{t-1}\right)&space;$

So that we can identify the recursion

$\beta&space;\left(h_{t-1}\right)&space;\leftarrow&space;\sum&space;_{h_t}&space;p\left(v_t|h_t\right)&space;\beta&space;\left(h_t\right)&space;p\left(h_t|h_{t-1}\right)&space;\quad&space;2&space;\leq&space;t&space;\leq&space;T$

with initialisation $\beta(h_T)&space;\leftarrow&space;1$

## 2.5 Complete Forward Backward Algorithm

Together the $\alpha-\beta$ recursions make up the forward backward algorithm and the smoothed posterior distribution is given by

$p\left[h_t|v_{1:T}\right]\text{:=}&space;\frac{\alpha&space;\left(h_t\right)&space;\beta&space;\left(h_t\right)}{\sum&space;_{h_t}&space;\alpha&space;\left(h_t\right)&space;\beta&space;\left(h_t\right)}&space;$

## 3. Illustration

As an illustrative toy example take a look at the "dishonest casino HMM" described by ([4] respectively [3])

Figure 10.9 from ([4])

Imagine a dishonest casino which may occasionally use a loaded (L) die skewed towards $6$.  (For a fair (F) die the emission probabilities are given by a uniform distributions over the integers $\{1:6\}$, for the loaded die the probablity is much higher e.g. $\inline&space;\frac{1}{2}$ in our example) Our filtering / smoothing task in this setting would be to infer whether this is the case just by considering a sequence of games.

Typical emissions in that setting may look as follows:

Rolls: 664153216162115234653214356634261655234232315142464156663246

the corresponding (hidden) ground truth would be:

Die:   LLLLLLLLLLLLLLFFFFFFLLLLLLLLLLLLLLFFFFFFFFFFFFFFFFFFLLLLLLLL

The resulting output $p(h_{t}&space;|&space;v_{1:t})$ after forward filtering is

Figure 10.9a from ([4], 2012)
After forward-backward filtering we obtain $p(h_{t}&space;|&space;v_{1:T})$:

Figure 10.9b from (Murphy, 2012)

We see that forwards-backwards smoothing gives indeed a better (smoother) estimate. If we threshold the estimates at $0.5$ and compare to the true sequence, we find that the filtered method makes $71$ errors out of $300$, and the smoothed method makes $49/300$

# 3. Discussion

• As both forward and backward recursion involve repeated multiplication with factors $\leq&space;1$ it is advisable to work in the log domain.  Alternatively if one is only interested into the posteriors normalization at each stage such that $\sum&space;_{h_t}&space;\beta&space;\left(h_t\right)=1$ is also a viable approach.
• An alternative to the parallel approach presented here is correction smoothing (Barber 2014, sec. 23.2.4) which forms a direct recursion for the posterior $\inline&space;\gamma&space;\left(h_t\right)=\sum&space;_{h_{t+1}}&space;p\left[h_t|h_{t+1},v_{1:t}\right]\gamma&space;\left(h_{t+1}\right)$.  This procedure also referred to as Rauch-Tung-Striebel method is sequential because since the $\alpha$ recursions must be completed before the $\gamma$ recursions may be started.
• In relation to the Viterbi-Decoding which computes maximum a posteriori state sequence (MAP) $\operatorname{argmax}_{h_{1:T}}p(h_{1:T}|v_{1:T})$ the approach presented can be used to compute the maximizer of the posterior marginals (MPM) $\left[\operatorname{argmax}_{h_{1}}p(h_{1}|v_{1:T}),&space;\ldots,&space;\operatorname{argmax}_{h_{T}}p(h_{T}|v_{1:T})\right]$.  The advantage of the joint MAP estimate is that it is always globally consistent which is desireable if we have the requirement that data should be explained by a single consistent (e.g. lingualistically plausible) path.  On the other hand the MPM is more robust since for each node we average over the values of its neighbour rather than conditioning on a specific value (see (Murphy 2012, sec. 17.4.4.1)) for additional details.  With regards to time and space complexity both algorithms generally are of order $O(K^2&space;T)$ and $O(K&space;T)$.

[1]: Christopher Bishop, Pattern recognition and machine learning,  Springer,  2006.

[2]: David Barber, Bayesian reasoning and machine learning, Cambridge University Press, 2014.

[3]: Richard Durbin, Biological sequence analysis: probabilistic models of proteins and nucleic acids, Cambridge University Press, 1998.

[4]: Kevin Murphy, Machine Learning - A probabilistic perspective, MIT Press, 2012.