# Discriminative Training

In speech recognition, modeling the speech signal using the Hidden Markov Model (Hidden Markov Models for Speech Recognition) is not totally correct. For example, the first order assumption (the next state is dependent only upon the current state) and the independent output assumption (current output(observation) is statistically independent of the previous outputs(observations)) in the Expectation maximization approach (Expectation Maximization) are not realistic for human voice despite their computational and statistical correctness.

So as to minimize the recognition error rate, another parameter estimation method, discriminative training, is developed.

# 1. Basic idea

Discriminative training is used to optimize the model parameters to minimize the recognition error rate on training data. In discriminative training, an objective function with respect to the model parameters is used to express the recognition error. That is, the larger the value of the objective function is, the smaller the recognition error is. Typical types of objective function with the reference and its competing hypotheses are the maximum mutual information (MMI), the minimum phone error (MPE) and the minimum classification error (MCE) [1].

# 2. Maximum mutual information (MMI)

According to the Bayes rule, the posterior probability is defined as:

$P(W|X)=\tfrac{P(X|W)P(W)}{\sum_{{W}'}P(X|{W}')P({W}')}$

where $P(W|X)$ is determined by the HMM (Introduction to Hidden Markov Models) and $P(W)$ is given by the language model.

Now the logarithmic mutual information between $X$, the observing sequence of a certain feature vector and $W$, the reference word sequence is

$I(W,X)=\frac{P(X|W)P(W)}{P(X)P(W)}=P(W|X)\times&space;\frac{1}{P(W)}$

If we assume that the prior probability $P(W)$ is uniform distributed, maximizing the mutual information can also be regarded as maximizing the posterior probability. With this assumption, maximizing the posterior probability is the same as maximizing the objective function as follows:

$F_{MMI}(\theta&space;)=logP_{r}(X;\theta)-logP_{c}(X;\theta)$

where $P_{r}(X;\theta&space;)=P(X|W;\theta)P(W)$ is the likelihood of the reference concerning the observing word sequence $W$ and $P_{c}(X;\theta&space;)=\sum_{{W_{i}}'}P(X|{W_{i}}';\theta&space;)P({W_{i}}'&space;)$ is the likelihood of the competing hypotheses concerning the competing word sequences $W_{1}^{'},&space;W_{2}^{'},&space;...,&space;W_{n}^{'}$ and $\theta$ is the parameter to be optimized. In case that the prior probability is not uniform distributed, then the priors $P(W)$ and $P(W^{'})$  should be considered in computing the objective function.

To conclude, the MMI is computational more complicated than the Maximum Likelihood approach because of the producing of a set of competing hypotheses.

# 3. Minimum phone error (MPE)

Unlike the MMI, the probabilities of each competing hypotheses is different in the MPE which optimizes the phone error. In such a case, its objective function is expressed as :

$F_{MPE}(\theta)=\sum_{i}\frac{\sum_{W_{i}^{'}}P(X|W_{i}^{'};\theta)P(W_{i}^{'})A(W_{i}^{'},W_{i})}{\sum_{W_{i}^{'}}P(X|W_{i}^{'};\theta)P(W_{i}^{'})}$

where $A(W_{i}^{'},W_{i})$ refers to the raw phone accuracy for the competing hypothesis $W_{i}^{'}$ given $W_{i}$, the reference word sequence of the ith utterance in the training set which includes many utterances.

# 4. Minimum classification error (MCE)

The main difference of the MCE from the MMI and the MPE is that it introduces a concept of distance :

$d(X,\theta)=log\frac{P(X|W;\theta)}{\left&space;[\frac{1}{N}\sum_{i=1,W_{i}\neq&space;W}^{N}P^{\eta&space;}(X|W_{i};\theta)&space;\right&space;]^{\frac{1}{\eta&space;}}}$

where N is the number of competing hypothesis.

The objective function of the MCE with this ¨distance¨ is a sigmoid function:

$F_{MCE}(\theta)=\frac{1}{1+exp^{-(a*d(X,\theta)+b)}}$

The sigmoid function focuses the optimization on the utterances that can be more easily corrected,in which parameters $a$ and $b$ are used to control the region of interest for the optimization and the rate of the optimization. The optimization method is gradient descent [1].

# 5. A unified view of discriminative training

Now that there exist similarities among the three types of objective functions, the MMI, the MPE and the MCE, could we combine them in a global objective function? Recent research has shown that a unified objective function can represent these three main discriminative training criteria [2]. Now we have the acoustic observation $O^{(i)}$ of the ith utterance in the training set, the definitions of $W_{i}$, $W_{i}{}'$ and $A(W_{i}^{'},W_{i})$ is the same as they are in the MPE. Then the unified objective function that optimizes the parameter $\theta$ can be formulated as:

$F_{unified}(\theta)=\sum_{i}f(log\frac{\sum_{W_{i}{}'}P_{\theta}(O^{(i)},W_{i}{^{´}}')A(W_{i}{}',W_{i})}{\sum_{W_{i}{}'\epsilon&space;S^{(i)}}P_{\theta}(O^{(i)},W_{i}{}')}})$

In this function, $P_{\theta}(O^{(i)},W_{i}{^{´}}')$ denotes the joint probability of the observation sequence $O^{(i)}$ and the competing hypothesis sequence $W_{i}{}'$ and $f(.)$ denotes the smooth function and $S^{(i)}$ denotes the space of the competing hypothesis. As we can see from table 1, with different choices of $f(.)$, $S^{(i)}$ and $A(W_{i}^{'},W_{i})$, various kinds of discriminative training criteria can be expressed by $F_{unified}(\theta)$ [3].

 Criterion $f(.)$ Hyp. space $S^{(i)}$ Accuracy MMI $x$ all $\delta&space;(W_{i}^{'},W_{i})$ MCE $\frac{-1}{1+exp(\rho&space;x)}$ all but $W_{i}$ $\delta&space;(W_{i}^{'},W_{i})$ MPE $exp(x)$ all $A(W_{i}^{'},W_{i})$

Table 1: Choice of parameters for different discriminative training criteria

# 6. Drawbacks and potential solutions

## 6.1 Drawbacks of discriminative training

There are the following two disadvantages of discriminative training:

• The complexity of the objective function leads to heuristics in the optimization algorithm and we have to spend huge computation expenses for the heuristics.
• The extra consideration in the competing hypothesis increases the training time to a considerable extent.

## 6.2 Potential solutions to the drawbacks

We can use generalized Baum Welch Algorithm (GBW) to convert the optimization problem into a simpler convex problem [1].

• Instead of ordinary word lattices, we can use transducer-based lattices for discriminative training,which have the two following advantages [4]:
1. The operations and algorithms generated for the weighted finite-state transducers (WFSTs) can be used to take control of the lattices in order to increase the flexibility and efficiency of discriminative training.
2. Now that HMM (Introduction to Hidden Markov Models) states are input symbols of the transducer-based lattice, the express of the space of competing hypothesis becomes more efficient because the HMM-state lattices are capable of generating more competing hypothesis than word lattices [3].

# 7. References

[1] ¨Generalized Discriminative Training for Speech Recognition¨, Roger Hsiao, Language Technologies Institute, School of Computer Science, Carnegie Mellon University

[2] ¨Comparison ofdiscriminative training criteria and optimization methods for speech recognition¨, R. Schlüter, W. Macherey, B. Müller, and H. Ney, Speech Commun., vol. 34, pp. 287–310, 2001.

[3] ¨A general discriminative training algorithm for speech recognition using weighted finite-state transducers¨, Yong Zhao, Andrej Ljolje, Diamantino Caseiro, Biing-Hwang (Fred) Juang, Department of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, USA

[4] ¨Investigations on error minimizing training criteria for discriminative training in automatic speech recognition¨, W. Macherey, L. Haferkamp, R. Schlüter, and H. Ney, in Proc. Interspeech, 2005.