1. Introduction

Convolutional Neural Networks (CNN) have showed success in achieving translation invariance through local filtering and maxpooling for many image processing tasks.  It seems intuitive to use CNNs for speech recognition to improve the hybrid CD-DNN-HMM model. In [1], the concepts of CNNs were used in the frequency domain to normalize speaker variance to achieve higher multi-speaker speech recognition performance.

In some experiments, the proposed CNN architecture is evaluated in a speaker independent speech recognition task using the standard TIMIT data sets. Experimental results show that the proposed CNN method can achieve over 10% relative error reduction in the speaker independent TIMIT test sets when comparing with a regular CD-DNN using the same number of hidden layers and weights. You can look up all the experimental results in the benchmarks section.

In this article we want to explain the three extra concepts over the simple fully connected feed-forward NN and the application of these concepts onto speech recognition.

2. Architecture

A CNN consists of one or more pairs of convolution and pooling layers. A convolution layer applies a set of filters on small local parts of the whole input space. A max-pooling layer takes the maximum filter activation from different positions of a specified window. The outcome is a lower resolution version of the convolution layer. This approach adds translation invariance and tolerance to differences of positions of objects parts. Higher layers work on lower resolution inputs and process the already extracted high-level representation of the input. The last layers are fully connected layers which combine inputs from all positions to classify the overall inputs [2].

Figure 1 shows the architecture of the CD-CNN-HMM and visualizes the different concepts of CNNs. These different concepts of a CNN and its application to speech recognition will be explained in the following sections.



Figure 1: The architecture of the CD-CNN-HMM and the different concepts: local filters, weight sharing, max-pooling [1]

2.1 Local filters

Speech signals have local characteristics along the frequency axis. Different phonemes produce different energy patterns in different local bands along the frequency axis. These local patterns become the critical components to distinguish different phonemes. The nodes in the convolutional layer of CNNs receives its input only from a limited bandwidth of the whole speech spectrum. The weights of the receptive field of the node can be configured to detect specific local patterns. These specific patterns can be saved in feature maps and show an alternative representation of the speech signal. These feature maps are eventually used to recognize each phone. This strategy is better than representing the entire frequency spectrum as a whole as it is done in a GMM. Another benefit of local filters is a better robustness against noises. When noises are concentrated in specific regions of the frequency spectrum, local filters in cleaner regions can still detect speech features to distinguish between different phones. Linear spectrum, Mel-scal spectrum or filter-bank features are all good for local filtering. MFCCs cannot be used for CNNs because the DCT-based decorrelation destroys local characteristics of the signal [3]. 

2.2 Max-pooling

As mentioned above, a speech spectrum includes many local structures, whereas each local structure usually appears within a limited range of one particular frequency. For example, central frequencies of formants for the same phoneme may vary within a limited range for different speakers and different utterances of this formant from the same speaker. These shifts are difficult to handle with other models such as GMMs and DNNs. In CNNs, feature values computed at different locations are pooled together via max-pooling and represented by one value [3]. The max-pooling function outputs the maximum value of its receptive field. This way, the maxpooling layer generates a lower resolution version of the convolution layer by doing this maximization operation every n bands, where n is the sub-sampling factor. This lower resolution version contains more useful information that can be further processed by higher layers in the NN hierarchy. The variability problem coming from different speakers can be easily solved this way [2].

2.3 Weight sharing

Full weight sharing (FWS)

The weight sharing scheme in figure 2.1 is full weight sharing (FWS). That means each node of one feature map in the convolution layer uses the same filter weights for all positions within the whole input space but on different positions, as shown in figure 2.2. The output of the convolution layer can be seen as a convolution of the filter weights and the input signals. This is the standard for CNNs in image processing, since the same patterns (edges, blobs, etc.) could appear at any location in an image [3]. For example, a set of filters that work as edge detectors can be applied to the whole image irrelevant of any particular position [2]. 

   image description


Figure 2.1: Illustration of the architecture for full weight sharing [2]
Figure 2.2: The weight matrix for 80 feature maps looking for 80 different features in each local band. The filter weights in each band are the same. [3]




Limited weight sharing (LWS)

In speech signals the local structures appearing at different frequency bands are quite different. Therefore, it may be better to limit weight sharing to local filters that are close to each other and will be pooled together in the max-pooling layer. That means that one set of filter weights is used for each pooling band. This weight sharing strategy is called limited weight sharing. As a result, we divide the convolution layer into a number of convolution sections. Figure 3.1 visualizes limited weight sharing (LWS) scheme for CNNs. Only the convolution units that are processed by the same pooling unit share the same filter weights. These convolution units need to share their weights so that they compute the same features, which may then be pooled together. Figure 3.2 shows the weight matrix for limited weight sharing.










Figure 3.1: Illustration of the architecture for limited weight sharing [2]
Figure 3.2: The weight matrix for 80 feature maps looking for 80 different features in each local band. The filter weights in each band are the same. [3]



3. Experiments

The experiments of this section have been conducted on the TIMIT data set to evaluate the effectiveness of CNNs in ASR. To see the results of this approach for more challenging  large vocabulary tasks, have a look in the benchmarks section [3].

3.1 Configuration of the CNN

The experiments on the CNNs in [3] have been conducted using both full weight sharing (FWS) and limited weight sharing (LWS) schemes. First, evaluate the ASR performance of CNNs under different settings of the CNN parameters. In these experiments, one convolution layer, one pooling layer and two fully connected hidden layers on the top are used. The fully connected layers had 1000 units in each. The convolution and pooling parameters were: filter size of 8, 150 feature maps for FWS and 80 feature maps per frequency band for LWS, pooling size of 6 and a shift size of 2. 

3.2 Effects of Varying CNN Parameters:

In this section, we analyze the effects of changing different CNN parameters.  The results of these experiments on both the core test set and the development set are shown in figure 4.1 and figure 4.2. Pooling size and the number of feature maps have the most significant impact on the final ASR performance. All configurations reach better performance with increasing pooling size up to 6. A larger number of feature maps usually leads to better performance, especially with FWS. It also shows that LWS can achieve better performance with a smaller number of feature maps than FWS due to its ability to learn different feature patterns for different frequency bands. This indicates that the LWS scheme is more efficient in terms of the number of hidden units [3]. With a full weight sharing CNN we got relative reduction in phone error rate (PER) of more than 5% compared to the DNN without convolution. With limited weight sharing the relative reduction exceeded 10%.

Figure 4.1: Effects of the pooling size on phone error rate (PER) [3]
Figure 4.2: Effects of the number of feature maps on PER  [3]



3.3 Overall Performance Experiments

In [3], the overall performance of different CNN configurations with a baseline DNN system on the same TIMIT task is examined. All results of the comparison are listed in Table 1, along with the numbers of parameters (weights) and computations (op's) in each model. The two first rows show the results with DNNs of different depth.  The CNN with the best performance of the previous section is used in row 3 and 4. This CNN has a filter size of 8, a pooling size of 6, and a shift size of 2. The number of feature maps was 150 for LWS and 360 for FWS. The CNN with LWS gave more than an 8% relative reduction in PER over the DNN. LWS was slightly better than FWS even with less than half the number of units in the pooling layer. 

In the fifth row two pairs of convolution and pooling layers with FWS in addition to two fully connected hidden layers on top were used. Row 6 shows the performance for the same model when the second convolution layer uses LWS. These tests show only minor differences to the performance of one convolution layer, but using two convolution layers tends to result in a smaller number of parameters as the fourth column shows.

Table 1: Performance on TIMIT with different CNN configurations. 'm' is the number of feature maps. 'p' is the pooling size, 's' is the shift size and 'f' is the filter size [3].




[1] O. Abdel-Hamid, L. Deng, D. Yu, "Exploring convolutional neural network structures and optimization techniques for speech recognition", Interspeech 2013

[2] O. Abdel-Hamid, A. Mohamed, H. Jiang, G. Penn, "Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition", IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277 - 4280, March 2012

[3] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, "Convolutional Neural Networks for Speech Recognition", IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 22, issue 10, pp. 1533-1545, Oct. 2014