1 Introduction

The algorithms introduced in this section are trying to extract speech features, which trying to preserve phonetics informations of speech signal. The previously described feature extraction algorithms are Mel-Frequency Cepstral Coefficients(MFCC), wavelet based features, using wavelet transformation(WT), Non-negative matrix factorization(NMF) and artificial neural network(ANN). The methods can be classified into two classes, namely calculating coefficients based on particular transformations and learning method based hidden features.

2 Comparison

MFCC and Wavelet based Feature extraction belong to the first class, while ANN and NMF are learning methods.

2.1 Comparison of Transformations and Computational Cost

For MFCC and WT based features, one common characteristic is the input signals are segmented into 30ms speech segments. However, these two methods use different transformations. Namely MFCC is an application of Fourier Transform(more precisely STFT), contrary wavelet based features use a particular wavelet transformation, which only scales and translates a mother wavelet. The wavelet is usually chosen from a certain orthonormal series. 

Note that the transformed spectrogram obtained by the STFT has the same resolution of time and frequency, while for the wavelet transformation the resolution of time increases along frequency. [1] The spectrums are shown in figure 1.

                        time and frequency resolution of STFT and WT

Figure1: Time and frequency resolution of DFT and WT[1]

With respect to the used transforms, the computational demands for MFCC and WT have a large difference. For MFCC it contains: the temporal signal was first transformed using the FT to frequency domain (computation time:  ) and then logarithmically processed after which the discrete cosine transform is applied directly (computation time: )  or by using the FFT (computation time: ). Moreover the WT is a temporal approximation using chosen wavelet and temporal moving window. Thus computation demands is not high,  and the extracted features contain the information of scaling and time-shifting.

NMF and ANN are two learning methods in the field of machine learning.

Both of them are iterative processes, which obtain the final results by iterative error estimation and component update. 

NMF as a matrix factorization method is used to factorize a speech spectrogram which is computed by the STFT. The error estimation is realized by using particular cost functions for the reconstructed and the original signal. The factors of standard NMF are a weight matrix and a hidden variable as shown in figure 2. This is somehow similar to a single layer forward ANN as shown in figure 3. Similar ANNs has an activation function, a loss function and a corresponding update for weights. However, it has to be noted, that none of the data in NMF can be negative. While the weightings between the neurons do not have such constraints. Moreover, both loss functions can guarantee a local minimum but not a global minimum.[5,6] 

The computational demands for these two learning methods are high for both. The computation time depends on the number of iterations and the complexity of the loss function as well as the update function.

 

Figure 2 Hidden variable model of NMF lines present basic [Wikipedia- NMF]

Figure 3 One layer ANN the arrows represent weights [Wikipedia- ANN]  

2.2 Methods Combination with respect to Words Error Rate

To evaluate the quality of the feature extraction method, one common standard is the comparison of word error rate(WER).

One characteristic of MFCC is using Mel scale, which projects frequency from hertz to mel with a log process. This process makes the frequency metric obey the rule of human hearing system. In this way, a mel filter bank can extract human speech components for reconstructing percipient signal directly. Thus it’s the most used feature extraction method in speech processing. And as a mature technique, MFCC was used in mobile phones for the recognition of spoken numbers.[2]

The drawback of MFFC is the sensitivity to additive noise. However, by combining with other techniques the WER or word accuracy(WA) can be enhanced. This is shown in Figure 4 and 5.[7] For clean speech WER of MFCC performs well while in noisy case it reduces(from 80% to 60%). The result of this paper also shows that by using WT combined with MFCC the WER can be decreased both in clean and noisy speech cases. 

Figure 4 Average success percentage of each word for the noisy speech [7]

                                                                 

Figure 5 Average success percentage of each word for the clean speech [7]

The combination of MFCC and NMF was used in an experiment for an noise robust recognition of the letters inside a car.[4] Results in figure 6 show the WA with respect to different combinations of trained model and test condition. The noise cases are: smooth inner city road (CTY), highway (HWY), cobble road (COB). For different noisy situations NMF combined MFCC features show an obvious improvement of WA. M denotes MFCC and M+N denotes MFCCs and NMF features.

Figure 6 WA of in car spelling recognition for letters „a“ to „z“ [4]

3. Conclusion

  MFCC Wavelet Based NMF ANN
Transformation STFT WT STFT x
Computational Demands ++ + +++ +++
Variability

log mel amplitude offset before DCT

different wavelets

adding different constrains to cost function

structure of neuros

Particular Cases

number recognition in mobile communication

human voice and music classification

speech separation

robust features

 

References:

[1] R. Sarikaya (2000) “High Resolution Speech Feature Parametrization for Monophone-Based Stressed Speech Recognition“ IEEE signal processing letters, vol.7 no.7 pp. 182-185

[2] European Telecommunications Standards Institute (2003), Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms. Technical standard ES 201 108, v1.1.3.

[3] R. Sarikaya (1998) „wavelet package transform features with application to speaker recognition“

[4] B. Schuller (2010) „Non-negative matrix factorization as noise-robust feature extraction for speech recognition“ ICASSP pp.4562-4565

[5] C. Plahl, (2014) “Neural Network based Feature Extraction for Speech and Image Recognition”, Ph.D. dissertation, Dept. Computer Science, RWTH Aachen, Aachen, Ger

[6] D. D. Lee (2000) „Algorithms for non-negative matrix factorization“

[7] M.A.Anusuya (2011) "Comparison of Different Speech Feature Extraction Techniques with and without Wavelet Transform to Kannada Speech Recognition" International Journal of Computer Applications, vol. 26 No.4 pp. 19-24

 

 


Contents