# 1. Introduction

A Hidden Markov Model describes the joint probability of a collection of 'hidden' and 'observed' random variables. It relies on the assumption that the distribution of hidden variables depends only on the immediate predecessors, that the realization of each variable  is independent of previous hidden variables if the value of the hidden variable in the previous timestep is known (this is often called first-order Markov property). Furthermore the current observation variables is assumed to depend only on the current hidden state (this assumption is actually a little optimistic, see the pages Acoustic Model and Discrimiative Training for further information).

The Baum–Welch algorithm uses the EM algorithm to find the maximum likelihood estimate of the parameters of a hidden Markov model given a set of observations.

More mathematically, training a Hidden Markov Model is basically a parameter estimation task, we want to find the parameters characterizing the joint distribution $p[X,&space;Z&space;\left|&space;\lambda]$ ($\lambda=&space;\left(\pi,&space;A,&space;\Phi&space;\right)$, initial state distibution $\pi$, state transition matrix $A$ and observation parameters $\Phi$) based on observations $X$.

In a  maximum likelihood approach we set the values to the value maximizing the probability of the evidence (i.e. the observed data).

If labeled training data is available the parameters could (theoretically) be determined from the data likelihood obtained through marginalization over the hidden variables.  In practice this is infeasible for non trivial cases because the joint distribution does not factorize over time indices and the computational effort therefore grows exponentially with the length of the observation sequence. (see (Bishop 2006, sec. 13.2.1) for an extensive discussion of that matter).

In this article we present the Baum Welch Algorithm, a special case of the general expectation maximization (EM) algorithm, which gives us an iterative procedure to infer the parameters in realistic scenarios.

Our presentation is to great extent based on (Bishop 2006), additional references can be found in section Further Reading

# 2. Baum Welch Algorithm

## 2.1 Preliminiaries

We assume a discrete state Hidden Markov Model with a known number of states $K$. The $K$ hidden states of the latent variables can be represented using a $1$ of $K$ coding scheme such that for each realization we are given a vector $z$ with exactly one element equal to one and the remaining equal to zero.

We represent the transition probabilities in matrix form as

$p\left(z_{n,k}=1|z_{n-1,j}=1\right)\to&space;A_{j,k}$

and obtain
$p\left(z_n|z_{n-1},A\right)\to&space;\prod&space;_{k=1}^K&space;\prod&space;_{j=1}^J&space;A_{j,k}^{z_{n-1,j}&space;z_{n,k}}$

Similarily the emission probalities can be written as

$p\left(x_n|z_n,\Phi&space;\right)\to&space;\prod&space;_{k=1}^K&space;p\left(x_n|\Phi&space;_k\right){}^{z_{n,k}}$

where $\Phi_k$ bundles the parameters for the $k$th component (For $D$ dimensional observations $x$ and Gaussian emissions $\Phi_k$ that would be $\mu_k&space;\in&space;\mathbb{R}^{D},&space;C_k&space;\in&space;\mathbb{R}^{D&space;\times&space;D}$, for a multinoulli model (see also (Bishop 2006, Appendix B)) simply a vector $\mu_k&space;\in&space;\mathbb{R}^{D}$).  Intuitively we can interpret $h_t$ as a gating variable that selects the distribution to use (see also the article on Gaussian Mixture Models).

The joint probability distribution over both latent and observed variables is then
$p(X,Z|\theta&space;)=p\left(\left.z_1\right|\pi&space;\right)&space;\left(\prod&space;_{n=2}^N&space;p\left(z_n|z_{n-1},A\right)\right)&space;\prod&space;_{m=1}^N&space;p\left(x_m|z_m,\Phi&space;\right)$
(We bundle observation and hidden variables in matrices $X&space;\in&space;\mathbb{R}^{N&space;\times&space;D},&space;Z&space;\in&space;\mathbb{R}^{N&space;\times&space;K}$ and let $\lambda$ denote the collection of parameters $(\pi,&space;A,&space;\Phi)$, where $\pi$ gives the probability distribution over the intial state)

## 2.2 Expectation Maximization

Recall that the EM Algorithm iterates maximization of a lower bound on the complete data log likelihood and maximum likelihood estimation of the parameters.  More formally the corresponding objective function is

$Q\left(\theta&space;,\overset{\text{(old)}}{\theta&space;}\right)\to&space;\sum&space;_Z&space;p\left(Z|X,\overset{\text{(old)}}{\theta&space;}\right)&space;\log&space;(p(X,Z|\theta&space;))$

$\gamma&space;\left(z_n\right)\to&space;p\left(\left.z_n\right|X,\overset{\text{(old)}}{\theta&space;}\right)$

$\zeta&space;\left(z_{n-1},z_n\right)\to&space;p\left(z_{n-1},\left.z_n\right|X,\overset{\text{(old)}}{\theta&space;}\right)$

we have

$\sum&space;_z&space;\gamma&space;\left(z_n\right)&space;z_{n,k}\to&space;\mathbb{E}\left(z_{n,k}\right)\to&space;\gamma&space;\left(z_{n,k}\right)$

and

$\sum&space;_{z_{n-1},z_n}&space;z_{n-1,j}&space;z_{n,k}&space;\zeta&space;\left(z_{n-1},z_n\right)\to&space;\mathbb{E}\left(z_{n-1,j}&space;z_{n,k}\right)\to&space;\zeta&space;\left(z_{n-1,j},z_{n,k}\right)$
(The expectation of a binary variable is just the probability that it takes the value $1$)

such that

$\inline&space;Q\left(\theta&space;,\overset{\text{old}}{\theta&space;}\right)\to&space;\sum&space;_{k=1}^K&space;\log&space;\left(\pi&space;_k\right)&space;\gamma&space;\left(z_{1,k}\right)+\sum&space;_{n=2}^N&space;\sum&space;_{j=1}^K&space;\sum&space;_{k=1}^K&space;\log&space;\left(A_{j,k}\right)&space;\zeta&space;\left(z_{n-1,j},z_{n,k}\right)+\sum&space;_{n=1}^N&space;\sum&space;_{k=1}^K&space;\gamma&space;\left(z_{n,k}\right)\text{Log}\left[p\left(x_n|\phi&space;_k\right)\right.$

### 2.2.1 E-Step

Now the E-Step is actually almost completely given by the Forward-Backward Algorithm, we have

$\gamma&space;\left(z_n\right)\to&space;\frac{p\left(z_n\right)&space;p\left(X\left|z_n\right.\right)}{p(X)}&space;\to&space;\frac{\alpha&space;\left(z_n\right)&space;\beta&space;\left(z_n\right)}{p(X)}$

where

$p(X)\to&space;\sum&space;_{n=1}^N&space;\alpha&space;\left(z_n\right)$

and

$\zeta&space;\left(z_{n-1},z_n\right)\to&space;\frac{\alpha&space;\left(z_{n-1}\right)&space;\beta&space;\left(z_n\right)&space;p\left(z_n|z_{n-1}\right)&space;p\left(x_n|z_n\right)}{p(X)}$

### 2.2.2 M-Step

In the M step we maximize $Q\left(\theta&space;,\overset{\text{old}}{\theta&space;}\right)$ with respect to the parameters $\lambda$, treating $\gamma&space;\left(z_n\right)$ and $\zeta&space;\left(z_{n-1},z_n\right)$ as constant

If the parameters of the emission densities $\Phi_k$ are independent the problem decouples and we can find the ML estimates as follows

#### 2.2.2.1 Discrete Emissions

For a multinoulli emission model each observation corresponds to a binary vector of length $D$ and we have the conditional observation distribution

$p(x|z)&space;\to&space;\prod&space;_{d=1}^D&space;\prod&space;_{k=1}^K&space;\mu&space;_{d,k}{}^{x_d,z_k}$

and the corresponding estimate is

$\mu&space;_{d,k}\to&space;\frac{\sum&space;_{n=1}^N&space;x_{n,d}&space;\gamma&space;\left(z_{n,k}\right)}{\sum&space;_{n=1}^N&space;\gamma&space;\left(z_{n,k}\right)}$

#### 2.2.2.2 Continuous Emissions

A simple but practical relevant case of a continuous observation density is a Gaussian distribution

$p\left(x\left|\phi&space;_k\right.\right)\to&space;\mathcal{N}\left(x\left|\mu&space;_k\right.,C_k\right)$

for which we obtain

$\mu&space;_k\to&space;\frac{\sum&space;_{n=1}^N&space;x_n&space;\gamma&space;\left(z_{n,k}\right)}{\sum&space;_{n=1}^N&space;\gamma&space;\left(z_{n,k}\right)}$

and

$\Sigma&space;_k\to&space;\frac{\sum&space;_{n=1}^N&space;\gamma&space;\left(z_{n,k}\right)&space;\left(x_n-\mu&space;_k\right).\left(x_n-\mu&space;_k\right){}^{\mathsf{T}}}{\sum&space;_{n=1}^N&space;\gamma&space;\left(z_{n,k}\right)}$

## 2.3 Example

For illustration let's consider a two state  ($K&space;=&space;2$) HMM with Gaussian emissions $p(x&space;\left|&space;z_k)&space;\sim&space;\mathcal{N}(x&space;|&space;\mu_k,&space;C_k)$.

We take true parameters

$&space;a\to&space;\left(&space;\begin{array}{c}&space;0.9&space;\\&space;0.1&space;\\&space;\end{array}&space;\right),&space;\quad&space;A\to&space;\left(&space;\begin{array}{cc}&space;0.95&space;&&space;0.05&space;\\&space;0.75&space;&&space;0.25&space;\\&space;\end{array}&space;\right)&space;$

$\mu_1&space;\to&space;\left(&space;\begin{array}{c}&space;-1&space;\\&space;-1&space;\\&space;\end{array}&space;\right),&space;\quad&space;C_1&space;\to&space;\left(&space;\begin{array}{cc}&space;1&space;&&space;0.3&space;\\&space;0.3&space;&&space;0.8&space;\\&space;\end{array}&space;\right)&space;$

$\mu_2&space;\to&space;\left(&space;\begin{array}{c}&space;2&space;\\&space;2&space;\\&space;\end{array}&space;\right),&space;\quad&space;C_2&space;\to&space;\left(&space;\begin{array}{cc}&space;0.7&space;&&space;0.6&space;\\&space;0.6&space;&&space;1.0&space;\\&space;\end{array}&space;\right)&space;$

A Matlab implementation based on code from Sebastien Paris (available on the Mathworks FileExchange) converges to estimated parameter values

$&space;a\to&space;\left(&space;\begin{array}{c}&space;0.999&space;\\&space;0.001&space;\\&space;\end{array}&space;\right),&space;\quad&space;A\to&space;\left(&space;\begin{array}{cc}&space;0.9247&space;&&space;0.0753&space;\\&space;0.2232&&space;0.7768&space;\\&space;\end{array}&space;\right)&space;$

$\mu_1&space;\to&space;\left(&space;\begin{array}{c}&space;-1.1382&space;\\&space;-1.0514&space;\\&space;\end{array}&space;\right),&space;\quad&space;C_1&space;\to&space;\left(&space;\begin{array}{cc}&space;0.9570&space;&&space;0.2242&space;\\&space;0.2242&space;&&space;0.7643&space;\\&space;\end{array}&space;\right)&space;$

$\mu_2&space;\to&space;\left(&space;\begin{array}{c}&space;2.1227&space;\\&space;1.8584&space;\\&space;\end{array}&space;\right),&space;\quad&space;C_2&space;\to&space;\left(&space;\begin{array}{cc}&space;0.9784&space;&&space;0.7754&space;\\&space;0.7754&space;&&space;0.9633&space;\\&space;\end{array}&space;\right)&space;$

(Parameters initialized at random, 100 training samples, 30 Iterations of the algorithm)

The resulting clustering can be visualized as follows

# 3. Discussion

As EM methods are notorious for getting stuck in local optima (the underlying problem is nonconvex) we want to reference alternative approaches and mention some caveats

•  Extensions Staying with the EM approach deterministic annealing (see i.e Rao 2001) can mitigate the danger of local minima.  Also it is advisable to spend some effort on initialisation
• Use fully labeled data for the initial settings of the parameters
• Initially ignore the time dependencies but treat the observations as IID random variables and apply standard methods (e.g. K-Means)
• Initialize at random but run the algorithm multiple times and pick the best solution.
• A comparative analysis of appropriate stopping (convergence criteria) can be found in (Abbi 2008)
• Bayesian Approaches: EM returns a maximum a posteriori (likelihood) estimate. Alternatives (more Bayesian approaches) are

• Markov Chain Monte Carlo (MCMC): Generate hidden paths using forwards-filtering and backwards sampling and obtain the parameters from their posteriors conditioned on the simulated data (see (Fruhwirt-Schnatter 2007) for details)

•  Variational Bayes EM (Beal 2003): Use the posterior mean parameters instead of the MAP estimates in the E step and update the parameters of the conjugate posteriors in the M step

• Alternative gradient based methods allowing online learning are described in (Baldi 1994)

(Abbi 2008) Revlin Abbi, Analysis of stopping criteria for the EM algorithm in the context of patient grouping according to length of stay, 4th International IEEE Conference on Intelligent Systems, 2008.

(Baldi 1994) Patrick Balid, Smooth online learning algorithms for hidden Markov models, Journal on Neural Computation, Issue 6, 1994

(Barber 2014): David Barber, Bayesian reasoning and machine learning, Cambridge University Press, 2014.

(Beal 2003) Matthew Beal, Variational Algorithms for Approximate Bayesian Inference, Ph.D. Thesis, Gatsby University, 2003

(Bishop 2006): Christopher Bishop, Pattern recognition and machine learning,  Springer,  2006.

(Fruhwirt 2007): Sabine Fruhwirt-Schnatter, Finite Mixture and Markov Switching Models, Springer 2007.

(Murphy 2012): Kevin Murphy, Machine Learning - A probabilistic perspective, MIT Press, 2012.

## Dynamic time warping

In speech recognition, parameter training is an important part to determine the model parameters of the acoustic model. However, it could sometimes introduce unnecessary complexity. In this situation, Dynamic Time Warping, a template-based method for decoding in speech recognition, opens a door for decoding the feature vectors of a word sequence without training.

# 1. Motivation

In automatic speech recognition, speech signals are represented as a temporal sequences of feature vectors, which is compared with all the template words in a vocabulary using a distance metric. In order to find the corresponding template word of the feature vector, two main problems need to be solved: the miss of the endpoint of the word and the distortions during speech. Furthermore, the speed and accelerations of speech could be different for a certain word [1]. Under these circumstances, dynamic time warping is introduced.

# 2. Dynamic time warping algorithm

In the DTW algorithm, the feature vector of a speech signal and a template word are aligned by warping the time axis iteratively until an optimal match between the two sequences is found. As in figure 1, the two sequences are A(i) (i=1...n) and B(j) (j=1...m). So, how does the algorithm work in detail?

Figure 1: The path for the best alignment between the two discrete feature vectors

## 2.1 Obtaining the optimal path mathematically

As in the figure 1, to get the best alignment between the two discrete feature vectors (there can be a small difference in their length), an optimal path P = p1, …, ps, …, pk (a warping function) through the grid needs to be found.

The optimal path is the path that minimize the time-normalized distance $D(A,B)$ between the two vectors A and B, which is given by:

$D\left&space;(&space;A,B&space;\right&space;)=\left&space;[&space;\frac{\sum_{s=1}^{k}d(p_{s})w_{s}}{\sum_{s=1}^{k}w_{s}}&space;\right&space;]$

where ps is a point along the path and d(ps) is the distance between is and js at this certain point. ws>0 is a weight [2].

## 2.2 Constraints to the optimal path

Now that the choices of the path can be exponentially explosive with the increase of the length of A and B. Thus, five constraints on the path are necessary.

1. Monotonicity condition: The index of i and j never decreases.

2. Continuity condition: On each step along the path, the index of i and j can at most increase by one.

3.Boundary condition: As in the figure 1, the path starts at the bottom left and ends at the top right.

4.Warping window condition: As in figure 1, the path never tends to stretch too far from the diagonal.

5.Slope constraint: The slope can not be too shallow or too steep [2].

## 2.3 Tracing-back

The advantage of the DTW algorithm is that it always keeps track of the cost of the path to each point it stretches. Sometimes the increase of the cost may be the same when the path stretches at a certain point in two different ways, which leads to multiple choices for the optimal path during the calculation process. In this situation, after reaching the endpoint, a tracing-back is needed to get the optimal path.

## 2.4 The complexity of DTW algorithm

Suppose the length of the feature vector of a speech signal is m and there are in total t vectors of template words in the vocabulary with the average length of n, then the complexity of DTW algorithm is O(m*n*t). But with some optimizations such as choosing most likely template words before the DTW algorithm, a considerable decrease on the complexity can be achieved.

# 3.The application of dynamic time warping

We can say that any kind of data, that can be converted into temporal sequences, can be processed with DTW [3]. Besides its application to automatic speech recognition, it has been applied to online signature recognition as well.

In speech recognition, there can be multiple different feature vectors for the same words or phrases. In order to find out which feature vectors are unreliable and to improve their joint likelihood, we should first align the multiple feature vectors with least distortion using the multi pattern dynamic time warping (MPDTW) algorithm and then track the HMM (Introduction to Hidden Markov Models) evolution of feature vectors along the optimum MPDTW path [4]. The figure 2 gives an example of MPDTW path for three utterances, where each axis corresponds to one utterance.

Figure 2: Example MPDTW path for 3 utterances

# 4. References

[1] ¨Dynamische Programierung in der Spracherkennung¨, Damir Culjat,FU-Berlin

[2] ¨Dynamic time warping algorithm for gene expression time series¨, PPT, Elena Tsiporkova

[3] ¨Dynamic time warping¨, Wikipedia, http://en.wikipedia.org/wiki/Dynamic_time_warping

[4] ¨Forward/Backward algorithms for joint multi pattern speech recognition¨, Nishanth Ulhas Nair and T.V. Sreenivas, Department of Electrical Communication Engineering, Indian Institute of Science, Bangalore

# 1. Introduction

Until recent years, most speech recognition systems use Hidden Markov Models (HMMs) to deal with the temporal variability of speech and Gaussian Mixture Models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input [1]. GMMs and HMMs co-evolved as a way of doing speech recognition when computers were too slow to explore more computationally intensive approaches as neural networks [2]. Better training algorithms and faster processors lead to a new interest for the application of neural networks for speech recognition. In 2012, a clear breakthrough of neural networks for speech recognition came with the introduction of the hybrid context-dependent deep neural network hidden markov model (CD-DNN-HMM) , which outperformed the hybrid GMM-HMMs in all usage scenarios [3].

This article should describe the CD-DNN-HMM architecture in detail. First, the architecture of the CD-DNN-HMM should be explained. Consecutively, the advantages of the CD-DNN-HMM architecture compared to the GMM-HMMs will be pointed out. In the last section, different experiments confirm the theoretical thoughts about the reason for the strength of the CD-DNN-HMM.

# 2. Architecture

The context dependent deep neural network (CD-DNN) simply replaces the GMM in common GMM-HMM systems. Figure 1 shows the general architecture of a state-of-the-art speech recognition system. Figure 2 shows the architecture of the CD-DNN-HMM introduced in [3].

Figure 1: General architecture of a speech recognition system.                                                                Figure 2: Architecture of the CD-DNN–HMM [4].

The neural network consists of an input layer, several hidden layers and an output layer. The input of the neural network is composed of the feature vectors of several frames in the the context window. For example, the context window in [3] consists of the feature vectors of 5 predecessor frames, 1 central frame and 5 successor frames. The weights and neurons in the hidden layers should give the model enough degrees of freedom to model speech. The model in [3] has 5 hidden layers with 2048 neurons. The last layer implements the softmax function and produces the output posterior probabilies for the different HMM-states. For any target HMM state, the model needs one output neuron. The HMM is necessary to model the temporal structure of speech by managing the transition probabilities between different states. In [3], the HMM states are  senones (tied context-dependent triphone states). With the algorithm of Bengios work [5], the whole CD-DNN-HMM can be optimized globally using backpropagation.

As we can see in the benchmarks section, the neural network outperforms all GMM-HMM models in different usage scenarios. Since we receive better results by replacing the GMM by a CD-DNN, the question arises, why does the CD-DNN outperforms the GMM in all usage scenarios.

### Efficient representation

Speech is produced by modulating a relatively small number of parameters of a dynamical system. This implies that its true underlying structure is much lower-dimensional than is immediately apparent in a window that contains hundreds of coefficients. GMMs are easy to fit to data using the EM algorithm. They can model any distribution with enough components, but using their parameters inefficiently [2]. Each parameter only applies to a very small fraction of the data and therefore you have to train a large number of components. For DNNs, the output on each training case is sensitive to a large fraction of the weights. That means that each parameter of a DNN is influenced by a large part of the training data. In other words NN can share similar portions of the input space to train some hidden units but keep other units sensitive to a subset of the input features that are significant to recognition. That means, that DNNs use their paremeters more efficient than GMMs. Because less parameters have to be trained the NNs can model speech with much less training data.

### Modelling of correlated input data

Almost the entire gain of DNN is connected to the ability of the DNN to use a long context window. This long context window consists of input feature vectors that are concatenated from several consecutive speech frames and provides more context information to distinguish among different phones. These concatenated frames are highly correlated because of large overlaps in speech analysis windows (see Figure 2).

The high correlation leads to ill-formed covariance matrices in the GMMs. DNNs only use linear perceptrons as their basic units for classification and are quite powerful to use these highly correlated features. As long as there is not a good way to use these highly correlated features in GMMs, it seems difficult for GMMs to compete with DNNs [6].

Figure 3 and 4 should visualize extraction of the feature vectors in the context window and the construction of the input for the DNN

Figure 3: Visualisation of a context window [7]                                                            Figure 4: Visualisation of the input for the neural network. [7]

# 4. Experiments

### The small vocabulary  PSC Task

In [6], several experiments related to the effect of the context window size have been conducted.

First, the same input features are used  for the GMM and the DNN model. Since the GMM cannot handle highly correlated features, it does not make sense to use 11 concatenated frames and only the current frame is used. The results for the small PSC Task are shown in Table 1. It is quite surprising that DNN does not yield any better performance than discriminatively trained GMMs if they both use the current frame only.  If we extend the context window to augment more neighboring frames as DNN input features, word error rate (WER) improves dramatically from 18.0% to 13.4% (using 11 frames). This  gain can be attributed to to the DNNs ability to use the highly correlated concatenated input feature vectors from several consecutive speech frames within a relatively long context window.

 Table 1: The character error rate (CER) for a different size of the context window and a different number of hidden layers applied to the small vocabulary PSC task [6].

### Effects on the large vocabulary switchboard task

The experiments were repeated on the switchboard task. The results in Table 2 lead to the same conclusion. If only the current frame is used, DNN does not surpass GMMs even when the number of hidden layers is increased up to 5. However, performance of DNN is significantly improved when we use a longer (5+1+5) context window, WER is quickly brought down to 31.2% from 42.6% in Hub98 and 33.1% to 23.7% in Hub01.

 Table 2: The character error rate (CER) for a different size of the context window and a different number of hidden layers applied to the large vocabulary switchboard task [6].

### Effects on the small vocabulary TIMIT corpus

In [2], varying the number of frames in the input window shows that the best performance on the development set is achieved using 17 frames. Much smaller (7 frames) and much bigger (37 frames) windows give remarkably worse performance. The context window (110ms to 270ms) covers the average size of phones or syllables. Smaller input windows miss important discriminative information, while networks with larger windows are probably getting distracted by the almost irrelevant information far from the center of the window.

 Figure 5: Graph of the phone error rate dependent on the context window size and the number of layers [2]

# References

[1] G. Hinton, L. Deng, D. Yu, G. Dahl, "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups", IEEE Signal Processing Magazine, vol. 8, issue 6, pp. 82 - 97, Nov. 2012

[2] A. Mohamed, G. Dahl, G. Hinton, "Acoustic Modeling Using Deep Belief Networks", IEEE Transactions on Audio, Speech, and Language Processing, pp. 14 - 22, vol. 20, Jan. 2012

[3] G. Dahl, D. Yu, L. Deng, A. Acero, "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition", IEEE Transactions on Audio, Speech and language processing, vol. 20, 30 - 42, Jan. 2012

[4] L. Deng, D. Yu, “Deep Learning: Methods and Applications”, Foundations and Trends in Signal Processing, vol. 7, issues 3-4, p. 249, Jun. 2014.

[5] Y. Bengio, R. de Mori, G. Flammia, R. Kompe, "Global optimization of a neural network-hidden Markov model hybrid", IEEE Transactions on Neural Networks, vol. 3, issue 2, pp. 252 - 259, Mar. 1992

[6] J. Pan, C. Liu, Z. Wang, Y. Hu, H. Jiang, "Investigation of Deep Neural Networks (DNN) for Large Vocabulary Continuous Speech Recognition: Why DNN Surpasses GMMs in Acoustic Modeling", International Symposium on Chinese Spoken Language Processing (ISCSLP), vol. 8, pp. 301 - 305, Dec. 2012

[7] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, "Convolutional neural networks for speech recognition", IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 22, issue 10, pp. 1533-1545, Oct. 2014

## Discriminative Training

In speech recognition, modeling the speech signal using the Hidden Markov Model (Hidden Markov Models for Speech Recognition) is not totally correct. For example, the first order assumption (the next state is dependent only upon the current state) and the independent output assumption (current output(observation) is statistically independent of the previous outputs(observations)) in the Expectation maximization approach (Expectation Maximization) are not realistic for human voice despite their computational and statistical correctness.

So as to minimize the recognition error rate, another parameter estimation method, discriminative training, is developed.

# 1. Basic idea

Discriminative training is used to optimize the model parameters to minimize the recognition error rate on training data. In discriminative training, an objective function with respect to the model parameters is used to express the recognition error. That is, the larger the value of the objective function is, the smaller the recognition error is. Typical types of objective function with the reference and its competing hypotheses are the maximum mutual information (MMI), the minimum phone error (MPE) and the minimum classification error (MCE) [1].

# 2. Maximum mutual information (MMI)

According to the Bayes rule, the posterior probability is defined as:

$P(W|X)=\tfrac{P(X|W)P(W)}{\sum_{{W}'}P(X|{W}')P({W}')}$

where $P(W|X)$ is determined by the HMM (Introduction to Hidden Markov Models) and $P(W)$ is given by the language model.

Now the logarithmic mutual information between $X$, the observing sequence of a certain feature vector and $W$, the reference word sequence is

$I(W,X)=\frac{P(X|W)P(W)}{P(X)P(W)}=P(W|X)\times&space;\frac{1}{P(W)}$

If we assume that the prior probability $P(W)$ is uniform distributed, maximizing the mutual information can also be regarded as maximizing the posterior probability. With this assumption, maximizing the posterior probability is the same as maximizing the objective function as follows:

$F_{MMI}(\theta&space;)=logP_{r}(X;\theta)-logP_{c}(X;\theta)$

where $P_{r}(X;\theta&space;)=P(X|W;\theta)P(W)$ is the likelihood of the reference concerning the observing word sequence $W$ and $P_{c}(X;\theta&space;)=\sum_{{W_{i}}'}P(X|{W_{i}}';\theta&space;)P({W_{i}}'&space;)$ is the likelihood of the competing hypotheses concerning the competing word sequences $W_{1}^{'},&space;W_{2}^{'},&space;...,&space;W_{n}^{'}$ and $\theta$ is the parameter to be optimized. In case that the prior probability is not uniform distributed, then the priors $P(W)$ and $P(W^{'})$  should be considered in computing the objective function.

To conclude, the MMI is computational more complicated than the Maximum Likelihood approach because of the producing of a set of competing hypotheses.

# 3. Minimum phone error (MPE)

Unlike the MMI, the probabilities of each competing hypotheses is different in the MPE which optimizes the phone error. In such a case, its objective function is expressed as :

$F_{MPE}(\theta)=\sum_{i}\frac{\sum_{W_{i}^{'}}P(X|W_{i}^{'};\theta)P(W_{i}^{'})A(W_{i}^{'},W_{i})}{\sum_{W_{i}^{'}}P(X|W_{i}^{'};\theta)P(W_{i}^{'})}$

where $A(W_{i}^{'},W_{i})$ refers to the raw phone accuracy for the competing hypothesis $W_{i}^{'}$ given $W_{i}$, the reference word sequence of the ith utterance in the training set which includes many utterances.

# 4. Minimum classification error (MCE)

The main difference of the MCE from the MMI and the MPE is that it introduces a concept of distance :

$d(X,\theta)=log\frac{P(X|W;\theta)}{\left&space;[\frac{1}{N}\sum_{i=1,W_{i}\neq&space;W}^{N}P^{\eta&space;}(X|W_{i};\theta)&space;\right&space;]^{\frac{1}{\eta&space;}}}$

where N is the number of competing hypothesis.

The objective function of the MCE with this ¨distance¨ is a sigmoid function:

$F_{MCE}(\theta)=\frac{1}{1+exp^{-(a*d(X,\theta)+b)}}$

The sigmoid function focuses the optimization on the utterances that can be more easily corrected,in which parameters $a$ and $b$ are used to control the region of interest for the optimization and the rate of the optimization. The optimization method is gradient descent [1].

# 5. A unified view of discriminative training

Now that there exist similarities among the three types of objective functions, the MMI, the MPE and the MCE, could we combine them in a global objective function? Recent research has shown that a unified objective function can represent these three main discriminative training criteria [2]. Now we have the acoustic observation $O^{(i)}$ of the ith utterance in the training set, the definitions of $W_{i}$, $W_{i}{}'$ and $A(W_{i}^{'},W_{i})$ is the same as they are in the MPE. Then the unified objective function that optimizes the parameter $\theta$ can be formulated as:

$F_{unified}(\theta)=\sum_{i}f(log\frac{\sum_{W_{i}{}'}P_{\theta}(O^{(i)},W_{i}{^{´}}')A(W_{i}{}',W_{i})}{\sum_{W_{i}{}'\epsilon&space;S^{(i)}}P_{\theta}(O^{(i)},W_{i}{}')}})$

In this function, $P_{\theta}(O^{(i)},W_{i}{^{´}}')$ denotes the joint probability of the observation sequence $O^{(i)}$ and the competing hypothesis sequence $W_{i}{}'$ and $f(.)$ denotes the smooth function and $S^{(i)}$ denotes the space of the competing hypothesis. As we can see from table 1, with different choices of $f(.)$, $S^{(i)}$ and $A(W_{i}^{'},W_{i})$, various kinds of discriminative training criteria can be expressed by $F_{unified}(\theta)$ [3].

 Criterion $f(.)$ Hyp. space $S^{(i)}$ Accuracy MMI $x$ all $\delta&space;(W_{i}^{'},W_{i})$ MCE $\frac{-1}{1+exp(\rho&space;x)}$ all but $W_{i}$ $\delta&space;(W_{i}^{'},W_{i})$ MPE $exp(x)$ all $A(W_{i}^{'},W_{i})$

Table 1: Choice of parameters for different discriminative training criteria

# 6. Drawbacks and potential solutions

## 6.1 Drawbacks of discriminative training

There are the following two disadvantages of discriminative training:

• The complexity of the objective function leads to heuristics in the optimization algorithm and we have to spend huge computation expenses for the heuristics.
• The extra consideration in the competing hypothesis increases the training time to a considerable extent.

## 6.2 Potential solutions to the drawbacks

We can use generalized Baum Welch Algorithm (GBW) to convert the optimization problem into a simpler convex problem [1].

• Instead of ordinary word lattices, we can use transducer-based lattices for discriminative training,which have the two following advantages [4]:
1. The operations and algorithms generated for the weighted finite-state transducers (WFSTs) can be used to take control of the lattices in order to increase the flexibility and efficiency of discriminative training.
2. Now that HMM (Introduction to Hidden Markov Models) states are input symbols of the transducer-based lattice, the express of the space of competing hypothesis becomes more efficient because the HMM-state lattices are capable of generating more competing hypothesis than word lattices [3].

# 7. References

[1] ¨Generalized Discriminative Training for Speech Recognition¨, Roger Hsiao, Language Technologies Institute, School of Computer Science, Carnegie Mellon University

[2] ¨Comparison ofdiscriminative training criteria and optimization methods for speech recognition¨, R. Schlüter, W. Macherey, B. Müller, and H. Ney, Speech Commun., vol. 34, pp. 287–310, 2001.

[3] ¨A general discriminative training algorithm for speech recognition using weighted finite-state transducers¨, Yong Zhao, Andrej Ljolje, Diamantino Caseiro, Biing-Hwang (Fred) Juang, Department of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, USA

[4] ¨Investigations on error minimizing training criteria for discriminative training in automatic speech recognition¨, W. Macherey, L. Haferkamp, R. Schlüter, and H. Ney, in Proc. Interspeech, 2005.

# 1. Introduction

Restricted Boltzmann Machines are a variant of neural networks. They can be used to classify input data after they have been trained by a training algorithm. Basically, a RBM (Restricted Boltzmann Machine) learns the probability distribution of a data set in the training phase, and can then classify unknown data by using this distribution. The idea of Restricted Boltzmann Machines is nothing new, they were first proposed in 1986, but only recently came into interest of scientists because Geoffrey Hinton and his colleagues invented fast learning algorithms that make use of RBMs [2].

RBMs are the foundation for the different layers in Deep Belief Networks (DBN). In these networks, several RBMs are stacked together to form multiple layers. Each layer is learning features from the extracted features of the layer beneath, resulting in a high level representation of the data.

In contrast to deterministic networks, a RBM is a stochastic network, meaning that its neurons are not activated by their thresholds, but by probabilities.

A RBM consists of nodes of visible units $x$, and hidden units $h$. Each visible unit is connected to every hidden unit and vice versa (bipartite undirected graph). Each connection has a weight $w$ associated to it, which is a parameter that gets optimized through training. Each visible and hidden unit in the network furthermore is biased with a bias-term $b$ or $c$. The visible units represent observable data and the hidden units describe the dependencies between the observed variables.

In order to get the probability distribution of the input data, an energy function has to be defined.

# 1.1 Definition

RBMs are often described with the help of an Energy-Based Model, which means that a scalar energy is associated with each configuration of the variables of interest. For RBMs, the goal is to get low levels of energy for trained values (i.e. vectors that should be recognized) of the variable and high levels of energy for the other values of the variable. The energy function of the RBM can be written as following [6]:

$E(\mathbf{x},\mathbf{h})&space;=&space;-\sum_j\sum_k&space;W_{j,k}&space;h_{j}&space;x_{k}&space;-&space;\sum_k&space;c_k&space;x_k&space;-\sum_j&space;b_jh_j$

Where $W$ is the weight of the connection between $x$ and $h$, and $c$ and $b$ are bias vectors for the visible and hidden layers. To obtain a probability distribution from this function, the exponential function is used:

$\mathit{p}(\mathbf{x},\mathbf{h})&space;=&space;\exp(-E(\mathbf{x},\mathbf{h}))/Z$

Where $Z$ is the partition function. $Z$ describes the sum of energies for all possible combinations of $x$ and $h$. Because in this case $x$ and $h$ are binary (in this example), $Z$ can have an exponential number of values and it is therefore intractable to compute (there are methods to approximate this number). From the formula it can be seen, that real data (which the neural network should recognize) leads to high probabilities and other data leads to low probabilities.

# 1.2 Probabilities

The conditional inferences of $x$ and $h$ are not dependent on $Z$ and can be written as a sum:

$p(\mathbf{h}|\mathbf{x})=\prod_j&space;p(h_j|\mathbf{x})$

Because $h$ and $x$ are binary, the probability of each  can be written as:

$p(h_j=1|\mathbf{x})&space;=&space;\frac{1}{1+\exp&space;(-(b_j+\mathbf{W_j.}&space;\mathbf{x}))}$

So calculating the probability of each $h$ is simple. Because the restricted Boltzmann machine is symmetric (undirected graph), it also yields:

$p(\mathbf{x}|\mathbf{h})=&space;\prod_kp(x_k|\mathbf{h})$

and

$P(x_k=1|\mathbf{h})=&space;\frac{1}{1+\exp&space;(-(c_k+\mathbf{h}^\intercal&space;\mathbf{W}_{.\mathbf{k}}))}$

The unconditional probability $p(\mathbf{x})$ can be written as:

$p(\mathbf{x})=&space;\sum_{\mathbf{h}\in&space;\left&space;\{&space;0,1\right&space;\}^H}&space;p(\mathbf{x},\mathbf{h})&space;=&space;\sum_{\mathbf{h}\in&space;\left&space;\{&space;0,1\right&space;\}^H}\exp&space;(-E(\mathbf{x},\mathbf{h}))/Z&space;\\=\exp(\mathbf{c}^\intercal&space;\mathbf{x}&space;+&space;\sum_{j=1}^{H}\log(1+\exp&space;(b_j&space;+&space;\mathbf{W}_{j.}\mathbf{x})))/Z&space;\\&space;=&space;\exp(\mathbf{c}^\intercal&space;\mathbf{x}&space;+&space;\sum_{j=1}^{H}\textrm{softplus}&space;(b_j&space;+&space;\mathbf{W}_{j.}\mathbf{x})))/Z$

From the formula for $p(\mathbf{x})$ it can be seen that, in order to get a high probability for input data of $x$, the corresponding parameters $c$, $b$ and $w$ have to be aligned to $x$.

# 1.3 Training objective

In order to get the parameters of the RBM, it has to be trained. As shown in the last section, the probability for real input data has to be maximized. In order to do that, the average negative log-likelihood has to be minimized:

$\frac{1}{T}\sum_t-\log&space;p(\mathbf{x}^{(t)})$

Besides, stochastic gradient descent is used:

$\frac{\partial-\log&space;p(\mathbf{x}^{(t)})}{\partial&space;\theta}&space;=&space;E_\mathbf{h}\left&space;[&space;\frac{\partial&space;E(\mathbf{x}^{(t)},\mathbf{h})}{\partial&space;\theta}\bigg&space;\vert&space;\mathbf{x}^{(t)}\right&space;]&space;-&space;E_{\mathbf{x},\mathbf{h}}\left&space;[&space;\frac{\partial&space;E(\mathbf{x},\mathbf{h})}{\partial&space;\theta}\right&space;]$

Where $\theta$ is the parameter of the RBM that is being optimized ($w$, $b$ or $c$). The positive phase

$E_\mathbf{h}\left&space;[&space;\frac{\partial&space;E(\mathbf{x}^{(t)},\mathbf{h})}{\partial&space;\theta}\bigg&space;\vert&space;\mathbf{x}^{(t)}\right&space;]$

of the formula is easily calculated, but the negative phase

$E_{\mathbf{x},\mathbf{h}}\left&space;[&space;\frac{\partial&space;E(\mathbf{x},\mathbf{h})}{\partial&space;\theta}\right&space;]$

depends on $Z$ and therefore has to be approximated. In order to calculate the formula, the Contrastive Divergence algorithm can be used [3].

# 2. Example

To understand the way a RBM works more clearly, the following video can be used as an example. In this video, digits of the MNIST database [5] are used on a Restricted Boltzmann Machine. So in this case, the RBM is presented with an image recognition task, which is not the topic of this website. But instead of images, voice (or voice features) could be used as well. For visualization, images are better as an example. The video shows the input digit on the left, and the representation of the RBM on the right.

The video starts using random weights. As it can be seen pretty clearly, the output is only noise and not usable. Next, a RBM which has been pre-trained is used. In this case, the regenerated output is quite good, even though the Restricted Boltzmann Machine sometimes gets confused with digits that are looking alike (e.g. 2 and 7). In the next part, the weights have been fine-tuned by the back-propagation algorithm, showing good results but sometimes ambiguities.

# References

[2] Hinton, G. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7), 1527-1554.

[3] Hinton, G. (2010). A practical guide to training restricted Boltzmann machines. Momentum, 9(1), 926.