# Viterbi Algorithm

Given a Hidden Markov Model and a sequence of observations, the Viterbi Algorithm determines the most probable sequence of hidden states very efficiently. For instance in speech recognition, it is used in the language model to find the most probable word out of a phonemes sequence.

# 1 Motivation

In many applications of Hidden Markov Models, it is often of interest to find the most probable state sequence if an observation sequence is given. A naive approach would be to compute the probability of each combination of states and then to choose the most probable state combination. However, an increase of the number of states $N$ comes along with an increase of the number of combinations. This again results in an increase of the computational costs in an exponential manner. Consequently in real life, it is only possible to determine the most probable state sequence for small $N$. In contrast to the naive approach, the Viterbi-Algorithm is also suitable for greater $N$ since its computational costs grow only linearly. The basic idea of this algorithm is to find the most probable state sequence iteratively. For each state $s_j$ at time $t$, the most probable path is searched depending only on the most probable path to state $s_i$ at time $t-1$ and the transition probability $a_{ij}$ from state $s_i$ to state $s_j$.

# 2 Problem Definition

Before providing you more information about the concept of the Viterbi Algorithm, the general problem is outlined in this section. Let us assume that a Hidden Markov Model is given. Then let $N$ be the number of states in the model $\left\{s_1,&space;s_2,...,s_N\right\}$ and $M$ the number of discrete observations $\left\{\nu_1,&space;\nu_2,...,\nu_M\right\}$. At time $t$, the observation $o_t$ can be made and it is assumed that the Hidden Markov Model is in state $q_t$. Furthermore, $a_{ij}$ is defined as the transition probability from state $s_i$ to state $s_j$

$a_{ij}&space;&=&space;p\left(q_t&space;=&space;s_j|q_{t-1}&space;=&space;s_i\right),$

and $b_j(k)$ is defined as the observation probability distribution

$b_j(k)&space;&=&space;p\left(o_t&space;=&space;\nu_k|q_t&space;=&space;s_j\right)$

and $\pi_i$ is the initial state distribution

$\pi_i&space;&=&space;p\left(q_1&space;=&space;s_i\right).$

The goal of the Viterbi Algorithm is to find the optimal state sequence $\mathbf{q}^*&space;=&space;\left\{q_1^*,&space;q_2^*,...,q_T^*&space;\right\}$ such that its probability $p(\mathbf{q}^*)$ is maximal given an observation sequence $\left\{o_1,&space;o_2,...,o_T\right\}$. Recall that $q_t$ is defined as the state $s_i$ at time t.

# 3 Basic Idea

The main idea of the Viterbi Algorithm is that at every time $t$ and state $s_j$ the most probable state is determined in dependence of the most probable state sequence at time $t-1$ and the transition probability $a_{ij}$. For simplicity, let's define the variable $\delta_t(i)$, which gives the probability of the optimal state sequence from $t=1$ to state $s_i$ at time $t$. Mathematically, $\delta_t(i)$ is defined as:

$\delta_t(i)&space;&=&space;\max_{q_1,...,q_{t-1}}&space;p\left(q_1,...,q_{t-1},&space;q_t&space;=&space;s_i|&space;o_1,&space;o_2,...,o_t\right)$

where $1&space;\leq&space;i&space;\leq&space;N$and $q_t(i)$ stands for state $s_i$ at time $t$. The advantage of this definition is that $\delta_{t+1}$ can be iteratively calculated by

$\delta_1(i)&space;&=&space;\pi_i&space;\cdot&space;b_i(o_1)&space;\\&space;\delta_{t+1}(j)&space;&=&space;\max_i&space;\left[\delta_t(i)&space;a_{ij}\right]&space;\cdot&space;b_j(o_{t+1}).$

The probability of the desired optimal state sequence $\mathbf{q}^*$ is equal to $\delta_T(i)$. However, $\delta_T(i)$ cannot give you any information about the optimal state sequence. Hence, a new variable $\psi$ is introduced, which should deliver the most probable following state $s_j$ at time $t+1$ for a given state $s_i$ at time $t$:

$\psi_{t+1}(j)&space;=&space;\arg\,\max_{1\leq&space;i&space;\leq&space;N}&space;\left[&space;\delta_t(i)&space;a_{ij}\right].$

Finally, the optimal state sequence $\mathbf{q}^*$ can be determined by backtracking:

$q_t^*&space;&=&space;\psi_{t+1}\left(q_{t+1}^*\right)&space;&&space;\text{for&space;}&space;t&space;=&space;T-1\text{&space;...&space;}1$

# 4 The Algorithm

Summarizing the ideas of the last section, the Viterbi Algorithm for finding the optimal state sequence reads:

• Initialization:

$\delta_1(i)&space;=&space;\pi_i&space;\cdot&space;b_i(o_1)&space;\text{&space;for&space;}1&space;\leq&space;i&space;\leq&space;N$

$\psi_1(i)&space;=&space;0&space;\text{&space;(no&space;previous&space;states)}$

• Recursion:    $t=2...T$ for $j&space;=&space;1&space;...&space;N$

$\delta_{t+1}(j)&space;=&space;\max_i&space;\left[\delta_t(i)&space;a_{ij}\right]&space;\cdot&space;b_j(o_{t+1})$

$\psi_{t+1}(j)&space;=&space;\arg\,\max_{1\leq&space;i&space;\leq&space;N}&space;\left[&space;\delta_t(i)&space;a_{ij}\right]$

• Termination:

$q_T^*&space;=&space;\arg\,\max_{1\leq&space;i\leq&space;N}&space;\left[&space;\delta_t(i)\right]$

$q_t^*&space;=&space;\psi_{t+1}\left(q_{t+1}^*\right)&space;\text{&space;for&space;}&space;t&space;=&space;T-1\text{&space;...&space;}1$

The input of the algorithm is a given Hidden Markov Model with its parameters and an observation sequence. The output is the most probable state sequence $\mathbf{q}^*$.

# 5 A Small Example

The goal of the following example is to determine the most probable word given the phonemes sequence $\left\{p_1,&space;p_2,&space;p_2,&space;p_3\right\}$. For this, a Hidden Markov Model with three states ($A$, $n$, $e$) is used. It is given in the figure below. For initialization, it is assumed that the three states are initially equally probable, e.g. $\pi_1&space;=&space;\pi_2&space;=&space;\pi_3&space;=&space;1/3$.

## 5.1 Initialization

For initialization, $\delta_1(1)$$\delta_1(2)$ and $\delta_1(3)$ are determined first. Hereby, it is assumed that observation $p_1$ occurs.

$\delta_1(1)&space;&=&space;\pi_1&space;\cdot&space;b_1(p_1)&space;=&space;\frac{1}{3}&space;\cdot&space;\frac{8}{10}&space;=&space;\frac{8}{30}$
$\delta_1(2)&space;&=&space;\pi_2&space;\cdot&space;b_2(p_1)&space;=&space;\frac{1}{3}&space;\cdot&space;\frac{1}{10}&space;=&space;\frac{1}{30}$

$\delta_1(3)&space;&=&space;\pi_3&space;\cdot&space;b_3(p_1)&space;=&space;\frac{1}{3}&space;\cdot&space;\frac{1}{10}&space;=&space;\frac{1}{30}$

## 5.2 Recursion

Now $\delta_t(1)$$\delta_t(2)$ and $\delta_t(3)$ are determined iteratively for $t&space;=&space;2,3,4$.

time step t = 2:

$\delta_2(1)&space;&=&space;\max\left\{\delta_1(1)\cdot&space;a_{11},&space;\delta_1(2)\cdot&space;a_{21},&space;\delta_1(3)\cdot&space;a_{31}\right\}\cdot&space;b_1(p_2)\\&space;&=&space;\max\left\{\frac{8}{30}\cdot\frac{2}{10},&space;\frac{1}{30}\cdot&space;\frac{3}{10},&space;\frac{1}{30}\cdot&space;\frac{4}{10}\right\}&space;\cdot&space;\frac{1}{10}&space;=&space;\frac{16}{300}&space;\cdot&space;\frac{1}{10}&space;=&space;\frac{16}{3000}$

$\delta_2(2)&space;&=&space;\max\left\{\delta_1(1)\cdot&space;a_{12},&space;\delta_1(2)\cdot&space;a_{22},&space;\delta_1(3)\cdot&space;a_{32}\right\}\cdot&space;b_2(p_2)\\&space;&=&space;\max\left\{\frac{8}{30}\cdot\frac{4}{10},&space;\frac{1}{30}\cdot&space;\frac{4}{10},&space;\frac{1}{30}\cdot&space;\frac{4}{10}\right\}&space;\cdot&space;\frac{8}{10}&space;=&space;\frac{32}{300}&space;\cdot&space;\frac{8}{10}&space;=&space;\frac{256}{3000}\\$
$\delta_2(3)&space;&=&space;\max\left\{\delta_1(1)\cdot&space;a_{13},&space;\delta_1(2)\cdot&space;a_{23},&space;\delta_1(3)\cdot&space;a_{33}\right\}\cdot&space;b_3(p_2)\\&space;&=&space;\max\left\{\frac{8}{30}\cdot\frac{4}{10},&space;\frac{1}{30}\cdot&space;\frac{3}{10},&space;\frac{1}{30}\cdot&space;\frac{2}{10}\right\}&space;\cdot&space;\frac{1}{10}&space;=&space;\frac{32}{300}&space;\cdot&space;\frac{1}{10}&space;=&space;\frac{32}{3000}\\$

$\psi_2(1)&space;&=&space;\arg\,\max\left\{\frac{8}{30}\cdot\frac{2}{10},&space;\frac{1}{30}\cdot&space;\frac{3}{10},&space;\frac{1}{30}\cdot&space;\frac{4}{10}\right\}&space;=&space;1$

$\psi_2(2)&space;&=&space;\arg\,\max\left\{\frac{8}{30}\cdot\frac{4}{10},&space;\frac{1}{30}\cdot&space;\frac{4}{10},&space;\frac{1}{30}\cdot&space;\frac{4}{10}\right\}&space;=&space;1$

$\psi_2(3)&space;&=&space;\arg\,\max\left\{\frac{8}{30}\cdot\frac{4}{10},&space;\frac{1}{30}\cdot&space;\frac{3}{10},&space;\frac{1}{30}\cdot&space;\frac{2}{10}\right\}&space;=&space;1\\$

time step t=3:

$\delta_2(1)&space;&=&space;\max\left\{\frac{16}{3000}\cdot\frac{2}{10},&space;\frac{256}{30}\cdot&space;\frac{3}{10},&space;\frac{32}{3000}\cdot&space;\frac{4}{10}\right\}&space;\cdot&space;\frac{1}{10}&space;=&space;\frac{768}{3\cdot10^6}$

$\delta_2(2)&space;&=&space;\max\left\{\frac{16}{3000}\cdot\frac{4}{10},&space;\frac{256}{3000}\cdot&space;\frac{4}{10},&space;\frac{32}{3000}\cdot&space;\frac{4}{10}\right\}&space;\cdot&space;\frac{8}{10}&space;=&space;\frac{8192}{3\cdot10^6}$

$\delta_2(3)&space;&=&space;\max\left\{\frac{16}{3000}\cdot\frac{4}{10},&space;\frac{256}{3000}\cdot&space;\frac{3}{10},&space;\frac{32}{3000}\cdot&space;\frac{2}{10}\right\}&space;\cdot&space;\frac{1}{10}&space;=&space;\frac{768}{3\cdot10^6}$

$\psi_3(1)&space;&=&space;\arg\,\max\left\{\frac{16}{3000}\cdot\frac{2}{10},&space;\frac{256}{30}\cdot&space;\frac{3}{10},&space;\frac{32}{3000}\cdot&space;\frac{4}{10}\right\}&space;=&space;2$

$\psi_3(2)&space;&=&space;\arg\,\max\left\{\frac{16}{3000}\cdot\frac{4}{10},&space;\frac{256}{3000}\cdot&space;\frac{4}{10},&space;\frac{32}{3000}\cdot&space;\frac{4}{10}\right\}&space;=&space;2$

$\psi_3(3)&space;&=&space;\arg\,\max\left\{\frac{16}{3000}\cdot\frac{4}{10},&space;\frac{256}{3000}\cdot&space;\frac{3}{10},&space;\frac{32}{3000}\cdot&space;\frac{2}{10}\right\}&space;=&space;2$

time step t=4:

$\delta_4(1)&space;&=&space;\max\left\{\frac{768}{3\cdot10^6}\cdot\frac{2}{10},&space;\frac{8192}{3\cdot10^6}\cdot&space;\frac{3}{10},&space;\frac{768}{3\cdot10^6}\cdot&space;\frac{4}{10}\right\}&space;\cdot&space;\frac{1}{10}&space;=&space;\frac{24576}{3\cdot10^8}$

$\delta_4(2)&space;&=&space;\max\left\{\frac{768}{3\cdot10^6}\cdot\frac{4}{10},&space;\frac{8192}{3\cdot10^6}\cdot&space;\frac{4}{10},&space;\frac{768}{3\cdot10^6}\cdot&space;\frac{4}{10}\right\}&space;\cdot&space;\frac{1}{10}&space;=&space;\frac{32768}{3\cdot10^8}$

$\delta_4(3)&space;&=&space;\max\left\{\frac{768}{3\cdot10^6}\cdot\frac{4}{10},&space;\frac{8192}{3\cdot10^6}\cdot&space;\frac{3}{10},&space;\frac{768}{3\cdot10^6}\cdot&space;\frac{2}{10}\right\}&space;\cdot&space;\frac{8}{10}&space;=&space;\frac{262144}{3\cdot10^8}$

$\psi_4(1)&space;&=&space;\arg\,\max\left\{\frac{768}{3\cdot10^6}\cdot\frac{2}{10},&space;\frac{8192}{3\cdot10^6}\cdot&space;\frac{3}{10},&space;\frac{768}{3\cdot10^6}\cdot&space;\frac{4}{10}\right\}&space;=&space;2$

$\psi_4(2)&space;&=&space;\arg\,\max\left\{\frac{768}{3\cdot10^6}\cdot\frac{4}{10},&space;\frac{8192}{3\cdot10^6}\cdot&space;\frac{4}{10},&space;\frac{768}{3\cdot10^6}\cdot&space;\frac{4}{10}\right\}&space;=&space;2$

$\psi_4(3)&space;&=&space;\arg\,\max\left\{\frac{768}{3\cdot10^6}\cdot\frac{4}{10},&space;\frac{8192}{3\cdot10^6}\cdot&space;\frac{3}{10},&space;\frac{768}{3\cdot10^6}\cdot&space;\frac{2}{10}\right\}&space;=&space;2$

## 5.3 Termination

By backtracking, the most probable state sequence can be determined:

$q_4^*&space;&=&space;\arg\,\max\left\{\delta_4(1),&space;\delta_4(2),&space;\delta_4(3)\right\}&space;=&space;3$

$q_3^*&space;&=&space;\psi_4(3)&space;=&space;2$
$q_2^*&space;&=&space;\psi_3(2)&space;=&space;2$
$q_1^*&space;&=&space;\psi_2(2)&space;=&space;1$

Hence, the most probable state sequence $\mathbf{q}^*$ is $\left\{1,&space;2,&space;2,&space;3\right\}$ which corresponds to the word "Anne".

# 6 References

[1] Bishop, C. (2006). Pattern Recognition and Machine Learning.

[2] Duda, R.  & Hart, P. &  Strok, D., Pattern Classification.

[3] Lee, D. (2014). Lecture notes of the lecture Machine Learning in Robotics