# Legal Disclosure

Information in accordance with section 5 TMG

Antonio Maiolo
Lerchenauerstraße 249c
80995 München

https://antonio-maiolo.com

## Contact

Telephone: 015753029343
E-Mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

## Disclaimer

Accountability for content
The contents of our pages have been created with the utmost care. However, we cannot guarantee the contents' accuracy, completeness or topicality. According to statutory provisions, we are furthermore responsible for our own content on these web pages. In this context, please note that we are accordingly not obliged to monitor merely the transmitted or saved information of third parties, or investigate circumstances pointing to illegal activity. Our obligations to remove or block the use of information under generally applicable laws remain unaffected by this as per §§ 8 to 10 of the Telemedia Act (TMG).

Responsibility for the content of external links (to web pages of third parties) lies solely with the operators of the linked pages. No violations were evident to us at the time of linking. Should any legal infringement become known to us, we will remove the respective link immediately.

Our web pages and their contents are subject to German copyright law. Unless expressly permitted by law (§ 44a et seq. of the copyright law), every form of utilizing, reproducing or processing works subject to copyright protection on our web pages requires the prior consent of the respective owner of the rights. Individual reproductions of a work are allowed only for private use, so must not serve either directly or indirectly for earnings. Unauthorized utilization of copyrighted works is punishable (§ 106 of the copyright law).

# 1. Introduction

For testing of different ASR architectures, a large number of different benchmark datasets have been developed. Figure 1, which has been created by NIST shows the word error rate (WER) as a function of time for a number of increasingly difficult speech recognition tasks. The word error rates (WERs) were obtained using the GMM–HMM technology. There is a flat curve for one particularly difficult large vocabulary task (Switchboard) over many years. In 2009, the success of DNNs on the TIMIT task motivated more ambitious experiments with much larger vocabularies and more varied speaking styles. In 2011, the word error rate drops  tremendously  (marked by the red star in Figure 1) with the introduction of CD-DNN-HMMs. [1].

 Figure 1: The evolution of the word error rate (WER) of different speech recognition tasks compiled by NIST [1]

Meanwhile, there are new large vocabulary benchmark tasks developed by three of the biggest speech research groups. This article will examine the different benchmarks developed by the dominating research groups for speech recognition that compare the performance of GMMs, DNNs and CNNs for large vocabulary tasks. First the general procedure for finding the optimal setup of the examined architecture will be described. After that, 5 large vocabulary tasks (LVCSR) will be introduced and the setup for the GMM, DNN and CNN will be presented. The results of every benchmark task will be concluded in the last section.

# 2. Procedure

The TIMIT Acoustic-Phonetic Continuous Speech Corpus provided by the linguistic data consortium (LDC) supplies a simple and convenient way of testing new approaches to speech recognition. TIMIT provides a good starting point for developing a new approach, especially one that requires a challenging amount of computation. The training set is small enough to make it feasible to try many variations of a new method and a lot of existing techniques have already been benchmarked on TIMITs core test set. With TIMIT, it is easy to see if a new approach is promising by comparing it with existing techniques that have been implemented by their proponents [2]. For finding the best configuration of the different architectures, the research groups performed different experiments and evaluations on TIMIT. After that, they used the best configuration for the very challenging large vocabulary tasks (LVCSR) and compared the results to their best GMM-HMM-Model. The GMMs, DNNs and CNNs that worked best on the TIMIT data formed the starting point for subsequent experiments on much more challenging large vocabulary tasks that were too computationally intensive to allow extensive exploration of variations in the architecture, the representation of the acoustic input or the training procedure [2].

# 3. Benchmarks on LVCSR Tasks

After the success of convolutional neural networks for the TIMIT data set, Sainath et al. [3] applied the hybrid CNN-HMMs to the very challenging Switchboard Recognition Task. The Switchboard collection focused primarily on GSM cellular phone technology with over 300 hours of training data. The project's goal was to target 190 subjects balanced by gender and under varied environmental conditions to participate in five to six minute conversations on GSM cellular phones. The speech data was collected for research, development, and evaluation of automatic systems for speech-to-text and allows rigorous comparisons among different ASR techniques. Development is done on the Hub5'00 set, while testing is done on the rt03 set, Performance will be reported separately on the Switchboard (SWB) and Fisher (FSH) portions of the set.

### Setup of the GMM-HMM, DNN-HMM and CNN-HMM

The HMM use 8,260 quinphone states and is HMM is combined with different models (GMM,DNN,CNN). The GMM consists of 372K Gaussians, uses feature space maximum likelihood linear regression (fMLLR) for more general speaker adaptation, vocal tract length normalization (VTLN) for male-female normalization and is trained with expectation maximization (EM). The pre-trained hybrid DNN system uses the same fMLLR features and 8,260 states as the GMM. It takes an 11-frame context around the current frame, and use six hidden layers (2,048 sigmoidal units per layer). The DNN hybrid system is pre-trained, followed by cross entropy and sequence-training. The CNN system is trained with 40-dimensional VTLN-warped mel filter-bank features. The CNN has two convolutional layers with 424 units, four fully connected layers with 2,048 units and a softmax layer with 512 output units. The number of parameters of the CNN matches that of the DNN. No pretraining is performed, only cross entropy and sequence-training. Again, after 40-dimensional features are extracted with principle component analysis (PCA), GMM Maximum Likelihood training is done followed by discriminative training[3].

### Results

Table 1 shows the performance of the hybrid CNN-HMM compared to the hybird DNN-HMM and GMM-HMM. The hybrid CNN offer between a 13-33% relative improvement over the GMM-HMM system and between a 4-7% relative improvement over the hybrid DNN-HMM.

 Table 1: The performance of the hybrid CNN-HMM compared to the hybrid DNN-HMM and GMM-HMMs [3]

Sainath et al. [3] also applied the hybrid CNN-HMMs to another challenging English Broadcast News Recognition Task. The English Broadcast News Speech Corpora of the linguistic data consortium contains more than 100 hours of broadcasts from ABC, CNN and CSPAN television networks and NPR and PRI radio networks with corresponding transcripts. The primary motivation for this collection is to provide training data for the DARPA "HUB4" Project on continuous speech recognition in the broadcast domain. The acoustic models are trained on 50 hours of data and evaluated with the EARS 2004 development set. Testing is done on the DARPA EARS rt04 evaluation set [3].

### Setup of the GMM-HMM, DNN-HMM and CNN-HMM

The raw acoustic features are 19-dimensional perceptual linear predictive (PLP) features with speaker-based mean, variance, and VTLN, followed by an linear discriminant analysis (LDA) and then fMLLR. The GMMs are feature-space and model-space discriminatively trained using the boosted maximum mutual information (BMMI) criterion. The GMM-HMM use 5,999 quinphone states and 150K diagonal-covariance Gaussians. The generatively pre-trained DNN hybrid system has five hidden layers with 1,024 sigmoidal units and one output layer with 512 units. It uses the same fMLLR features and 5,999 quinphone states as the GMM system, but a 9-frame context around the current frame. The DNN training begins with greedy layerwise pretraining, followed by cross entropy training and then sequence training. The number of parameters of the CNN matches that of the DNN. No pre-training is performed, only crossentropy and sequence-training. After 40-dimensional features are extracted with PCA, we apply maximum likelihood GMM training, followed by discriminative training using the boosted maximum mutual information (BMMI) criterion [3].

### Results

Table 2 shows the performance of the hybrid CNN-HMM compared to the hybrid DNN-HMM and GMM-HMM. The hybrid CNN offer between 13-18% relative improvement over the GMM-HMM system, and between a 10-12% relative improvement over the DNN-HMM [3].

## 3.3 Bing Voice Search Recognition Task

The Switchboard and the Broadcast News Recognition Task were based on a publicly available test set. Microsoft researchers developed a data set containing 18 hours of bing mobile voice search data to test the new techniques for its own usage scenarios. The data set has a high degree of acoustic variability caused by noise, music, side-speech, accents, sloppy pronunciation, hesitation, repetition, interruptions, and mobile phone differences [4].

### Setup of the DNN-HMM and CNN-HMM

Initially, a conventional state-tied triphone HMM was built. The HMM state labels were used as the targets in training the DNNs and CNNs, which both followed the standard recipe.  Abdel-Hamid et al. [4] investigated the effects of pretraining using an Restricted Bolzman Machine (RBM) for the fully connected layers and using a CRBM, as described in section 4-B of [4], for the convolution and pooling layers. The DNNs first hidden layer had 2000 units, to match the increased number of units in the CNN. The other two hidden layers had 1000 units in each. The CNN layer used limited weight sharing of convolution nodes that will be pooled together (section). Each section had 84 feature maps per section. It had a filter size of 8, a pooling size of 6, and a shift size of 2. The context window had 11 frames. Frame energy features were not used in these experiments [4].

### Results

Table 3 shows that the CNN improves word error rate (WER) performance over the DNN regardless of whether pretraining is used. The CNN provides about 8% relative error reduction over the DNN in the voice search task without pretraining. With pretraining, the relative word error rate reduction is about 6%. The results show that pretraining the CNN can improve its performance, but the effect of pretraining for the CNN is not as strong as that for the DNN [4].

 Table 3: WER with and without pretraining (PT) on the Bing Voice Search Task [4]

Google Voice Input transcribes voice search queries, short messages, e-mails, and user actions from mobile devices. This is a large vocabulary task that uses a language model designed for a mixture of search queries and dictation. Jaitly et. al. [5] used approximately 5,870 h of aligned training data for a his work to evaluate the change to a DNN-HMM in Googles conventional GMM-HMM model.

### Setup of the GMM-HMM and DNN-HMM

Google’s model for this task, which was built from a very large corpus, uses an speaker-independent GMM-HMM model composed of context-dependent triphone HMMs with 7,969 senone states. This Model uses PLP features as acoustic input that have been transformed by LDA. Semitied covariances (STCs) [6] are used in the GMMs to model the LDA transformed features and BMMI was used to train the model discriminatively. The DNN had four hidden layers with 2,560 fully connected units per layer and a final “softmax” layer with 7,969 alternative states. The input was 11 contiguous frames of 40 log filter-bank features with no temporal derivatives. Each DNN layer was pretrained for one epoch as an Restricted Bolzman Machine (RBM) and the resulting DNN was discriminatively finetuned for one epoch. After pretraining, discriminative fine-tuning of the neural network was performed using MMI [2].

### Results

On a test set of anonymized utterances from the live Voice Input system, the hybrid DNN-HMM system achieved a WER of 12.3% — a 23% relative reduction compared to the best GMM-based system for this task [2].

# 4. Main results

Table 4 summarizes the results of the large vocabulary benchmark tasks described above. It shows that DNN-HMMs consistently outperform GMM-HMMs that are trained on the same amount of data. Moreover, CNNs with the same number of parameters outperform DNNs.

 Table 4: Results of the large vocabulary benchark tasks

# References

[1] L. Deng, D. Yu, “Deep Learning: Methods and Applications”, Foundations and Trends in Signal Processing, vol. 7, issues 3-4, p. 249, June 2014.

[2] G. Hinton, L. Deng, D. Yu, G. Dahl, "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups", IEEE Signal Processing Magazine, vol. 8, issue 6, pp. 82 - 97, Nov. 2012

[3] T. Sainath, A. Mohamed, B. Kingsbury, B. Ramabhadran, "Deep convolutional neural networks for LVCSR", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614 - 8618, May 2013

[4] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, Convolutional neural networks for speech recognition, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 22, issue 10,pp. 1533-1545, October 2014

[5] N. Jaitly, P. Nguyen, A. Senior, V. Vanhoucke, “An application of pretrained deep neural networks to large vocabulary speech recognition”, Proceedings of Interspeech 2012, 2012

[6] M. Gales, "Semi-tied covariance matrices for hidden Markov models", IEEE Transactions on Speech and Audio Processing, vol. 7, issue 3, pp. 272 - 281, Aug. 2002

# 1. Introduction

In this article we explain how to infer the posterior marginals of the hidden state variables in a HMM from a sequence of observations/emissions.  The algorithm makes use of the principle of dynamic programming (i.e. it breaks down the problem into simpler ones) during two passes over the data. The first pass goes forward in time while the second goes backward in time; hence the name forward–backward algorithm.

(We assume fixed parameters please see the article on Baum Welch Algorithm for ways to estimate the parameters.)

We will build up the full algorithm incrementally by

• Deriving conditional independence assumptions on the underlying graphical model
• Applying Bayes sequentially to incorporate previous observations (forward pass)
• Adding a backward pass so that we condition on past and previous observations

Our presentation is to great extent based on [2] from which we also adopted large parts of notation.

# 2. Forward-Backward Algorithm

## 2.1 Preliminaries

A first order Hidden Markov Model has hidden variables $\inline&space;h_t$ and visible variables $v_t$ corresponding to observed data.

Figure 13.4 from (Barber, 2014)

Recall that we can factorize any joint distribution  $p(x_{1:V})$ as

$p(x_{1:V})&space;=&space;p(x_1)&space;p(x_2&space;|&space;x_1)&space;p(x_3&space;|&space;x_{1:2}),&space;\ldots&space;p(x_V&space;|&space;x_{1:V-1})&space;$

(This is sometimes called "chain rule of probability", the Matlab like notation $i:j$ denotes the ordered index set $\{i,&space;i+1,&space;i+2,&space;i+3,&space;\ldots,&space;j-1,&space;j\}$)

Applying the rule to our model the joint probability of hidden and observed variables is given by

$p\left[h_{1:T},v_{1:T}\right]=&space;p\left(h_1\right)&space;p\left(v_1|h_1\right)&space;\prod&space;_{t=2}^T&space;p\left(h_t|h_{t-1}\right)&space;p\left(v_t|h_t\right)$

To simplify the factored expressions we want to establish conditional independence relations. While these can be formally shown by D-Separation ([4], sec. 10.5.1) or tedious algebraic manipulations a more intuitive way to assert independence graphically is given by the Bayes ball rules.

Figure 10.9 from ([4], 2012)

The rules drawn above show the most important rules in a Hidden Markov Model. As blocked paths (filled cycles) render the corresponding nodes independent, $h_{t+1}$ is independent of $h_{t-1}$ given $h_t$ as well as $v_{t-1}$ and $v_{t}$ and$v_{t+1}$ are.

## 2.2 Forward Filtering

The goal in forward filtering is to calculate $\alpha(h_t)&space;:=&space;p(h_t,&space;v_{1:t})$ that is the joint probability of the hidden state at the current time $t$ together with all previous observations

To this end we first marginalize over the previous states

$p\left[h_t,v_{1:t}\right]\to&space;\sum&space;_{h_{t-1}}&space;p\left[h_t,h_{t-1},v_{1:t}\right]$

and apply the chain rule to obtain

$p\left[h_t,v_{1:t}\right]\to&space;\sum&space;_{h_{t-1}}&space;p\left[v_t|v_{1:t-1},h_t,h_{t-1}\right]p\left[h_t|}v_{1:t-1},h_{t-1}\right]p\left[v_{1:t-1},h_{t-1}\right]$

Using the conditional independence assumptions of the Markov model we can simplify this equation to read

$p\left[h_t,v_{1:t}\right]\to&space;\sum&space;_{h_{t-1}}&space;p\left(v_t|h_t\right)p\left(h_t|h_{t-1}\right)p\left[v_{1:t-1},h_{t-1}\right]$

This corresponds to a recursive formulation of the joint probability $\alpha(h_t)&space;:=&space;p(h_t,&space;v_{1:t})$

$p\left[h_t,v_{1:t}\right]\to&space;\alpha(h_t)&space;\to&space;p(v_t&space;|&space;h_t)&space;\sum\limits_{h_{t-1}}&space;p(h_t&space;|&space;h_{t&space;-&space;1})&space;\alpha(h_{t-1})&space;\quad&space;t&space;>&space;1&space;$

The base case is

$\alpha&space;\left(h_1\right)\to&space;p\left(h_1,v_1\right)\to&space;p\left(h_1\right)&space;p\left(v_1|h_1\right)$

### Interpretation

The $\alpha$-Recursion can be understood as a predictor corrector method

$\alpha(h_t)&space;\to&space;\underbrace{p\left(v_t|h_t\right)}_{\text{Corrector}}\underbrace{\sum&space;_{h_{t-1}}&space;p\left(h_t|h_{t-1}\right)&space;\alpha&space;\left(h_{t-1}\right)}_{\text{Predictor}}$

The filtered distribution $\alpha(h_{t-1})$ from the previous timestep is propagated forwards by the dynamics for one timestep to reveal a prior distribution at time $t$. This distribution is then modulated by the observation $v_t$, which serves to incorporate the new evidence into a posterior distribution.

## 2.3 Backward Smoothing

If we do not only take into account previous observations but process the data offline we can possibly obtain a better (smoother) estimate. Parallel smoothing incorporates information from past and future using the fact that ( when observed ) $h_t$ d-separates the past from the future.  Formally that is

$p\left[h_t,v_{1:T}\right]\to&space;p\left[h_t,v_{1:t},v_{t+1:T}\right]\to&space;\underbrace{p\left[h_t,v_{1:t}\right]}_{\alpha({h_t})}\underbrace{p\left[v_{t+1:T}|h_t\right]}_{\beta(h_t)}$

We see that the desired posterior $p(h_t,&space;v_{1:T})$ can be factored into the past and future contribution $\alpha(h_t)&space;:=&space;p(h_t,&space;v_{1:T})$ and $\beta(h_t)&space;:=&space;p(v_{t&space;+&space;1&space;:&space;T}&space;|&space;h_t)$

A recursive update formula for $\beta(h_t)$ can be derived as follows

$p\left(v_{t:T}|h_{t-1}\right)\to&space;\sum&space;_{h_t}&space;p\left[v_t,v_{t+1:T},h_t|h_{t-1}\right]$

Repeatedly applying the chain rule we have

$p\left(v_{t:T}|h_{t-1}\right)\to&space;\sum&space;_{h_t}&space;p\left[v_t|v_{t+1:T},h_t,h_{t-1}\right]p\left[v_{t+1:T},h_t|h_{t-1}\right]&space;$

and

$p\left(v_{t:T}|h_{t-1}\right)\to&space;\sum&space;_{h_t}&space;p\left(v_t|h_t\right)p\left[v_{t+1:T}|h_t,h_{t-1}\right]p\left(h_t|h_{t-1}\right)&space;$

So that we can identify the recursion

$\beta&space;\left(h_{t-1}\right)&space;\leftarrow&space;\sum&space;_{h_t}&space;p\left(v_t|h_t\right)&space;\beta&space;\left(h_t\right)&space;p\left(h_t|h_{t-1}\right)&space;\quad&space;2&space;\leq&space;t&space;\leq&space;T$

with initialisation $\beta(h_T)&space;\leftarrow&space;1$

## 2.5 Complete Forward Backward Algorithm

Together the $\alpha-\beta$ recursions make up the forward backward algorithm and the smoothed posterior distribution is given by

$p\left[h_t|v_{1:T}\right]\text{:=}&space;\frac{\alpha&space;\left(h_t\right)&space;\beta&space;\left(h_t\right)}{\sum&space;_{h_t}&space;\alpha&space;\left(h_t\right)&space;\beta&space;\left(h_t\right)}&space;$

## 3. Illustration

As an illustrative toy example take a look at the "dishonest casino HMM" described by ([4] respectively [3])

Figure 10.9 from ([4])

Imagine a dishonest casino which may occasionally use a loaded (L) die skewed towards $6$.  (For a fair (F) die the emission probabilities are given by a uniform distributions over the integers $\{1:6\}$, for the loaded die the probablity is much higher e.g. $\inline&space;\frac{1}{2}$ in our example) Our filtering / smoothing task in this setting would be to infer whether this is the case just by considering a sequence of games.

Typical emissions in that setting may look as follows:

Rolls: 664153216162115234653214356634261655234232315142464156663246

the corresponding (hidden) ground truth would be:

Die:   LLLLLLLLLLLLLLFFFFFFLLLLLLLLLLLLLLFFFFFFFFFFFFFFFFFFLLLLLLLL

The resulting output $p(h_{t}&space;|&space;v_{1:t})$ after forward filtering is

Figure 10.9a from ([4], 2012)
After forward-backward filtering we obtain $p(h_{t}&space;|&space;v_{1:T})$:

Figure 10.9b from (Murphy, 2012)

We see that forwards-backwards smoothing gives indeed a better (smoother) estimate. If we threshold the estimates at $0.5$ and compare to the true sequence, we find that the filtered method makes $71$ errors out of $300$, and the smoothed method makes $49/300$

# 3. Discussion

• As both forward and backward recursion involve repeated multiplication with factors $\leq&space;1$ it is advisable to work in the log domain.  Alternatively if one is only interested into the posteriors normalization at each stage such that $\sum&space;_{h_t}&space;\beta&space;\left(h_t\right)=1$ is also a viable approach.
• An alternative to the parallel approach presented here is correction smoothing (Barber 2014, sec. 23.2.4) which forms a direct recursion for the posterior $\inline&space;\gamma&space;\left(h_t\right)=\sum&space;_{h_{t+1}}&space;p\left[h_t|h_{t+1},v_{1:t}\right]\gamma&space;\left(h_{t+1}\right)$.  This procedure also referred to as Rauch-Tung-Striebel method is sequential because since the $\alpha$ recursions must be completed before the $\gamma$ recursions may be started.
• In relation to the Viterbi-Decoding which computes maximum a posteriori state sequence (MAP) $\operatorname{argmax}_{h_{1:T}}p(h_{1:T}|v_{1:T})$ the approach presented can be used to compute the maximizer of the posterior marginals (MPM) $\left[\operatorname{argmax}_{h_{1}}p(h_{1}|v_{1:T}),&space;\ldots,&space;\operatorname{argmax}_{h_{T}}p(h_{T}|v_{1:T})\right]$.  The advantage of the joint MAP estimate is that it is always globally consistent which is desireable if we have the requirement that data should be explained by a single consistent (e.g. lingualistically plausible) path.  On the other hand the MPM is more robust since for each node we average over the values of its neighbour rather than conditioning on a specific value (see (Murphy 2012, sec. 17.4.4.1)) for additional details.  With regards to time and space complexity both algorithms generally are of order $O(K^2&space;T)$ and $O(K&space;T)$.

[1]: Christopher Bishop, Pattern recognition and machine learning,  Springer,  2006.

[2]: David Barber, Bayesian reasoning and machine learning, Cambridge University Press, 2014.

[3]: Richard Durbin, Biological sequence analysis: probabilistic models of proteins and nucleic acids, Cambridge University Press, 1998.

[4]: Kevin Murphy, Machine Learning - A probabilistic perspective, MIT Press, 2012.

# 1. Introduction

Convolutional Neural Networks (CNN) have showed success in achieving translation invariance through local filtering and maxpooling for many image processing tasks.  It seems intuitive to use CNNs for speech recognition to improve the hybrid CD-DNN-HMM model. In [1], the concepts of CNNs were used in the frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.

In some experiments, the proposed CNN architecture is evaluated in a speaker independent speech recognition task using the standard TIMIT data sets. Experimental results show that the proposed CNN method can achieve over 10% relative error reduction in the speaker independent TIMIT test sets when comparing with a regular CD-DNN using the same number of hidden layers and weights. You can look up all the experimental results in the benchmarks section.

In this article we want to explain the three extra concepts over the simple fully connected feed-forward NN and the application of these concepts onto speech recognition.

# 2. Architecture

A CNN consists of one or more pairs of convolution and pooling layers. A convolution layer applies a set of filters on small local parts of the whole input space. A max-pooling layer takes the maximum filter activation from different positions of a specified window. The outcome is a lower resolution version of the convolution layer. This approach adds translation invariance and tolerance to differences of positions of objects parts. Higher layers work on lower resolution inputs and process the already extracted high-level representation of the input. The last layers are fully connected layers which combine inputs from all positions to classify the overall inputs [2].

Figure 1 shows the architecture of the CD-CNN-HMM and visualizes the different concepts of CNNs. These different concepts of a CNN and its application to speech recognition will be explained in the following sections.

 Figure 1: The architecture of the CD-CNN-HMM and the different concepts: local filters, weight sharing, max-pooling [1]

## 2.1 Local filters

Speech signals have local characteristics along the frequency axis. Different phonemes produce different energy patterns in different local bands along the frequency axis. These local patterns become the critical components to distinguish different phonemes. The nodes in the convolutional layer of CNNs receives its input only from a limited bandwidth of the whole speech spectrum. The weights of the receptive field of the node can be configured to detect specific local patterns. These specific patterns can be saved in feature maps and show an alternative representation of the speech signal. These feature maps are eventually used to recognize each phone. This strategy is better than representing the entire frequency spectrum as a whole as it is done in a GMM. Another benefit of local filters is a better robustness against noises. When noises are concentrated in specific regions of the frequency spectrum, local filters in cleaner regions can still detect speech features to distinguish between different phones. Linear spectrum, Mel-scal spectrum or filter-bank features are all good for local filtering. MFCCs cannot be used for CNNs because the DCT-based decorrelation destroys local characteristics of the signal [3].

## 2.2 Max-pooling

As mentioned above, a speech spectrum includes many local structures, whereas each local structure usually appears within a limited range of one particular frequency. For example, central frequencies of formants for the same phoneme may vary within a limited range for different speakers and different utterances of this formant from the same speaker. These shifts are difficult to handle with other models such as GMMs and DNNs. In CNNs, feature values computed at different locations are pooled together via max-pooling and represented by one value [3]. The max-pooling function outputs the maximum value of its receptive field. This way, the maxpooling layer generates a lower resolution version of the convolution layer by doing this maximization operation every n bands, where n is the sub-sampling factor. This lower resolution version contains more useful information that can be further processed by higher layers in the NN hierarchy. The variability problem coming from different speakers can be easily solved this way [2].

## 2.3 Weight sharing

### Full weight sharing (FWS)

The weight sharing scheme in figure 2.1 is full weight sharing (FWS). That means each node of one feature map in the convolution layer uses the same filter weights for all positions within the whole input space but on different positions, as shown in figure 2.2. The output of the convolution layer can be seen as a convolution of the filter weights and the input signals. This is the standard for CNNs in image processing, since the same patterns (edges, blobs, etc.) could appear at any location in an image [3]. For example, a set of filters that work as edge detectors can be applied to the whole image irrelevant of any particular position [2].

 Figure 2.1: Illustration of the architecture for full weight sharing [2]
 Figure 2.2: The weight matrix for 80 feature maps looking for 80 different features in each local band. The filter weights in each band are the same. [3]

### Limited weight sharing (LWS)

In speech signals the local structures appearing at different frequency bands are quite different. Therefore, it may be better to limit weight sharing to local filters that are close to each other and will be pooled together in the max-pooling layer. That means that one set of filter weights is used for each pooling band. This weight sharing strategy is called limited weight sharing. As a result, we divide the convolution layer into a number of convolution sections. Figure 3.1 visualizes limited weight sharing (LWS) scheme for CNNs. Only the convolution units that are processed by the same pooling unit share the same filter weights. These convolution units need to share their weights so that they compute the same features, which may then be pooled together. Figure 3.2 shows the weight matrix for limited weight sharing.

 Figure 3.1: Illustration of the architecture for limited weight sharing [2]
 Figure 3.2: The weight matrix for 80 feature maps looking for 80 different features in each local band. The filter weights in each band are the same. [3]

# 3. Experiments

The experiments of this section have been conducted on the TIMIT data set to evaluate the effectiveness of CNNs in ASR. To see the results of this approach for more challenging  large vocabulary tasks, have a look in the benchmarks section [3].

## 3.1 Configuration of the CNN

The experiments on the CNNs in [3] have been conducted using both full weight sharing (FWS) and limited weight sharing (LWS) schemes. First, evaluate the ASR performance of CNNs under different settings of the CNN parameters. In these experiments, one convolution layer, one pooling layer and two fully connected hidden layers on the top are used. The fully connected layers had 1000 units in each. The convolution and pooling parameters were: filter size of 8, 150 feature maps for FWS and 80 feature maps per frequency band for LWS, pooling size of 6 and a shift size of 2.

## 3.2 Effects of Varying CNN Parameters:

In this section, we analyze the effects of changing different CNN parameters.  The results of these experiments on both the core test set and the development set are shown in figure 4.1 and figure 4.2. Pooling size and the number of feature maps have the most significant impact on the final ASR performance. All configurations reach better performance with increasing pooling size up to 6. A larger number of feature maps usually leads to better performance, especially with FWS. It also shows that LWS can achieve better performance with a smaller number of feature maps than FWS due to its ability to learn different feature patterns for different frequency bands. This indicates that the LWS scheme is more efficient in terms of the number of hidden units [3]. With a full weight sharing CNN we got relative reduction in phone error rate (PER) of more than 5% compared to the DNN without convolution. With limited weight sharing the relative reduction exceeded 10%.

 Figure 4.1: Effects of the pooling size on phone error rate (PER) [3]
 Figure 4.2: Effects of the number of feature maps on PER  [3]

## 3.3 Overall Performance Experiments

In [3], the overall performance of different CNN configurations with a baseline DNN system on the same TIMIT task is examined. All results of the comparison are listed in Table 1, along with the numbers of parameters (weights) and computations (op's) in each model. The two first rows show the results with DNNs of different depth.  The CNN with the best performance of the previous section is used in row 3 and 4. This CNN has a filter size of 8, a pooling size of 6, and a shift size of 2. The number of feature maps was 150 for LWS and 360 for FWS. The CNN with LWS gave more than an 8% relative reduction in PER over the DNN. LWS was slightly better than FWS even with less than half the number of units in the pooling layer.

In the fifth row two pairs of convolution and pooling layers with FWS in addition to two fully connected hidden layers on top were used. Row 6 shows the performance for the same model when the second convolution layer uses LWS. These tests show only minor differences to the performance of one convolution layer, but using two convolution layers tends to result in a smaller number of parameters as the fourth column shows.

 Table 1: Performance on TIMIT with different CNN configurations. 'm' is the number of feature maps. 'p' is the pooling size, 's' is the shift size and 'f' is the filter size [3].

# References

[1] O. Abdel-Hamid, L. Deng, D. Yu, "Exploring convolutional neural network structures and optimization techniques for speech recognition", Interspeech 2013

[2] O. Abdel-Hamid, A. Mohamed, H. Jiang, G. Penn, "Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277 - 4280, March 2012

[3] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, "Convolutional Neural Networks for Speech Recognition", IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 22, issue 10, pp. 1533-1545, Oct. 2014

## HMM Algorithms

content based on

@book{murphy2012,
title={Machine learning: a probabilistic perspective},
author={Murphy, Kevin P},
year={2012},
publisher={MIT press}
}