1971 H. Lawton and A. Sylvestre presented a method to find the shapes of two overlapping functions from an observation of additive mixtures[1]. This method is called Self Modeling Curve Resolution(SMCR) which was frequently used in the field of spectrophotometry whose outcome is an additive mixture of two unknown, non-negative and linear independent functions(„basis“). Of course the combination parameter of these functions can’t be non-negative. The feasibility of this method to find the „basis“ of mixtures was verified by PCA. SMCR is the prime idea of non-negative matrix factorization(NMF), which consider only two bases functions. An extension of number of basis was introduced by P. Paatero and U. Tapper in 1994 as the concept positive matrix factorization(PMF)[2]. Until 1999 the notion of NMF started to become famous through viewpoint from D. Lee & H. Seung, that perception of the whole based on perception of its parts[3]. Moreover they provided two algorithms to learn the parts of objects. 

This page is trying to introduce first the basic idea of NMF and parts-based, second different algorithms to compute NMF, third NMF based applications for Speech Recognition(SR).

1 Introduction of NMF and Parts-based Idea

NMF is a matrix factorization method with non-negative constrain, where all the matrices in this process should be non-negative. (Here non-negative matrix is different from positive definite matrix which fulfills , but all the elements in the matrix are equal or larger than zero). 

Given a non-negative matrix V, find non-negative matrix factors W and H such that: [4]. V is a data set, which contains m  samples,. Factor matrices W and H  respectively, where r should be smaller than n or m. NMF can also be written column-wise, namely , the column of V can be approximated via a linear combination of the columns of W, here W is the basis matrix. This is the basic idea of NMF.

In order to illustrate how it works for speech signal, what are the meaning of each matrix and what is parts-based, here a small example of NMF deconvolution for sound object extraction will be quoted[5], (where deconvolution here can be neglect). 

Given a spectrogram of a segment of sound signal X(the spectrogram will be calculated usually via Short Time Fourier Transform STFT), using NMF to find the linear approximation W and H for X. 

The lower right plot is the magnitude spectrogram X, it contains 256 DFT slices and each DFT slice were overlapped by 128 points. The NMF performed here decompose X into W with 3 bases(3 leftmost plots, each of them has 10 DFT slices), and a weight matrix H, also called hidden variable matrix, which is the invisible layer of the original signal.

Connect to parts-based concept, the 3 bases of W are representing parts with respect to this X, signal. Which means in this case each DFT slice in X will be represented by the multiplication of parts-bases W and corresponding weight h. For fixed bases W, the hidden variable H implies features hidden in the speech segmentation, ideally these should be our desired speech features.

From this example it’s easy to figure out, that the input signal which presented in both time and frequency domain can be via NMF divided into, one as frequency bases(W) and one as features(H) which is only temporal. 

2 Algorithms to Compute NMF

There are two ways to get factorization. One is exact NMF[6], which based on some extra conditions of original matrix and won’t be discussed here. Another is through updating W and H. Since this method do not gain a exact solution, whether the current W and H a good representation for the original signal or not. An object function/cost function must be given for update rules. Mathematically such a function is called metric, but only when it is symmetric. (Otherwise it can be considered as divergence).

2.1 Cost Functions

There exists many divergences which can compute the distance between two matrices. A standard method is squared Euclidean distance[4]:

                                                                           (1)

where A=V, B=WH, i j can simply be regarded as indices in original matrix(which are not related to the following update indices). Since this distance is based on a norm, which means it fulfills homogeneity. This property means, EU-Distance don’t tell the statistical property of the data.

Second method is based on Kullback–Leibler divergence[4]:

                                             (2)

It is also called relative entropy and describe the difference between two statistical distribution. Let A and B be two pdfs, since Log is a convex function, after reformulation it is easy to show that KL-divergence is always equal or larger than zero, equality holds if and only if A equal to B. Since KL-divergence do not fulfill symmetric and triangle inequality property it is not a real distance.

Third method is based on Itakura-Saito divergence[7]:

                                                         (3)

This divergence reflect perceptual difference between the original spectrum and approximated spectrum. Under the assumption of superimposed gaussian components of given signal, the meaning of IS-NMF is as in this paper illustrated, its statistical contribution. Namely, the maximum likelihood estimation of W and H from V which is in sum of Gaussians components is exactly the NMF of V using IS divergence[7], the proof is shown in this paper. 

2.2 Update Rules

For given above cost functions, the update rules should be found, which have to converge fast to the minimum and save computation. Here notice that, since these cost functions are convex in W only or in H only(where B = WH), which means only a local minimal of them can be guaranteed.

A directly approximation method is using gradient descent, but it converge slow to local minimum. Another method is conjugate gradient method, it converge faster for local minimum but the implementation is more complicated. Also the draw back of convergence using gradient based method is sensitivity to the step size, which is not practical for applications[4]. The following multiplicative update rules for the first two cost functions introduced by Lee&Seung and multiplicative rule for IS-NMF did a good balance between speed of convergence and implementation complexity.

Update rule of W and H for Euclidean distance cost function is as follow:

                                                                                     (4)

                                                                                      (5)

where i is the number of rows of V which turn to the row index of W and u is the number of columns of V which also turn into the column index of H, a is the reduced rank of W and H.

Update rule for KL-divergence is:

                                                                       (6)

                                                                       (7)

The only difference of indices are k and v, which ensure the constrain of the columns of W and rows of H both to sum to unity. Because it’s convenient to eliminating the degeneracy associated with the invariance of WH, when , where is a diagonal matrix[3].

The prove of non increasing of first two cost functions under such update rules were detailed described in this paper[4]. The two update rules are multiplicative. In particular, a simple additive update can be used for H, this is also shown in the paper.

IS-NMF multiplicative update rule is(IS-NMF/MU)[7]:

                                                             (8)

                                                            (9)

An advantage of IS-NMF compare with EU/KL-NMF is scale invariance. To see this property we need firstly the definition of  - divergence, where when  corresponds to KL-divergence,  is EU-divergence. According to the definition of  -divergence it it easy to get:

                                                                                        (10)

holds for all  -divergence. And scale invariance holds only when . This means the cost of a bad factorization for a low-power coefficients of x is the same as the cost of the factorization for a higher power coefficients. In contrary to the cases when , that rely more on larger coefficients, which leads to a less accuracy in the reconstruction of lower power components in audio spectra[7].

Additional, there is an extension of Estimation Maximization IS-NMF(IS-NMF/EM), SAGE algorithm, which is for special structured data (here is the frontal mentioned superimposed Gaussian components). It overcomes the disadvantage of IS-NMF/EM, that all the parameters from factorization in each iteration have to be updated, rather only a part of parameters will be updated.

Given multiple gaussian i.d.d. variables GMM

            (11)

K is the number of basis functions.

And define s of parameter . The SAGE choose for each a hidden-data space , and the union of  is the IS-NMF/EM of X. 

The corresponding update rules is:

                                                                                          (12)

where: , and  and  are posterior mean and variance of . And  is computed from the most up-to-date parameters.

This model performs a good estimation for musical pitches compared with KL-NMF or EU-NMF even IS-NMF/MU[7].

3 Applications of NMF for SR

Because of parts-based characteristic of NMF and non-negativity of speech signal. NMF can be widely used for speech signal processing combined with some other constrains, such as the statistical model of input signal and sparsity constrain for H matrix.

3.1 Speech Denoising 

Combine with statistical parameters of signal, NMF can be implemented for denoising

As in preprocessing part mentioned. The noise in speech processing can be regarded as stationary and non-stationary. The most filtering methods only deal with the former. By using NMF with priors[8], environment noise can be mostly eliminated, which non-stationary is. 

In training step, the source signals, noise and speech, their spectrums are available individually. Namely and . By using KL-divergence, in training phase the cost function for both sources should be minimized respectively. Thus  were obtained with sizes  and . Same as for  and  with same size . After nmf, the statistics of  will also be estimated, here Gaussians representation of signals was chosen for connivance. Then to compute the empirical means and covariances of their log values, yielding,  and  where each  is vector and each  is an covariance matrix. 

According to the new coefficients , a regularized cost function is introduced. A negative log likelihood for  is added:

(13)

such that  can be consistent with the statistics of [8].

In denoising stage, to fix  and assume that the basis functions perform a good description for speech and noise. Assuming the speech and noise are indeoendent, to concatenate to form  and .

Finally, to reconstruct the denoised spectrogram, just to compute .

An simple example to illustrate NMF with prior: for both equal to 2, i.e. signal and noise signal are consist of high and low frequency components. The original signal shows that for speech the high and low frequencies correlated strongly, while in noise they are kind of negative correlated. Although the assumption is that the statistic of speech and noise are quite different. The results shows the potential, that by regularizing the KL-divergence with a log likelihood, such NMF algorithm do well for non-stationary denoisng for speech signals.

3.2 Single Channel Speech Separation using Sparse NMF(SNMF)

As a learning method, NMF is able to learn a dictionary for given training data, with new constrain of sparsity for code matrix H, which the reconstruction do speech separation of mixture.

To restrict the sparsity of H such that the data can be well approximated with some known statistic distributions, is the basic idea of speech separation using SNMF.

Consider a mixture as a sum of source signals:

Where each source can be decomposed to its own basic functions  and code matrix . Thus by enforcing  to be sparse, there should exists a total Dictionary which was concatenated by , such that  can be separated into .

A new sparse constrain will be added to cost function. Since no statistic of data was considered, it’s more convenience to use EU-divergence. And the new cost function is:

                                                                          (14)

And corresponding update rule is:

                                                                                        (15)

                                                            (16)

Where  is sparse degree, smaller \lambda infers deep sparsity. A relaxation from zero norm to one norm holds for an optimal sparsest reconstruction iff the space of parameters of cost fulfills the null space property.

Apply SNMF to the training data for individual speakers to learn dictionaries. By separation step the merging over-complete dictionary will keep fix, the sparse code H for mixture is then obtained. The separated source is reconstructed from the over-complete dictionary and H.

To notice that the computation of dictionary per speaker is computation demanded. Since the phoneme based property of speech signal. It’s able to learn dictionaries in phoneme level, by using HMM phoneme recognizer.

For each phoneme the dictionary is over-complete, it is shown that there are some general frequency properties of respective phoneme.

Intuitively to restrict the sparsity of H, the total over-complete dictionary can be regarded  as concatenation of phoneme level dictionary, which means such sparse H can be treat as our speech features, when given suitable training data.

The SNR is computed from separated wave form compare with clean sources.

 is important for the performance of separation. It controls the sparsity directly and the size of dictionary indirectly, namely for smaller  dictionary size increases.

The draw back of NMF is evident, that is its computation demand. Decomposition generally need 500 to 1000 iterations, and for each iteration the computation of cost function and parameter updates are included. Moreover in this process will be done in both training and testing phase, though by testing phase absence an update of basic functions. This leads to a hard decision of divergence and finding corresponding fast convergent update rules.

Contrary the regularization possibilities for cost functions provide a flexible and practical choice for different demands and multiple statistic of data.

To end this work, there is a question deserve to be mentioned. That is when dose NMF give a correct decomposition into parts?[F.R.] Namely under which condition performs NMF unique and correct recovery? And these should hold for all NMF algorithms.

 

References:

[1] W. H. Lawton (1971). Self Modeling Curve Resolution, technometrics vol.13, No.3, 617-622

[2] P. Paatero (2006) „Positive matrix factorization: optimal error estimator“, Environmetrics vol.5 Issue2, 111-126

[3] D. D. Lee (1999) „Learning the parts of objection by non-negative matrix factorization“, Nature vol. 401, 788-791

[4] D. D. Lee (2000) „Algorithms for non-negative matrix factorization“

[5] P. SMaragdis (2004) „Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs“

[6] Campbell, S.L.; G.D. Poole (1981) "Computing nonnegative rank factorizations.". Linear Algebra Appl. 35: 175–182

[7] C. Fevotte (2008) „Nonnegative matrix factorization with the Itakura-Saito divergence. With application to music analysis“

[8] K. Wilson (2008) „Speech denoising using Nonnegative matirx factorization with prior“

[9] M. N. Schmidt (2006) „Single-channel speech separation using sparse non-negative matrix factorization“

Further reading: D. Donoho „When dose Non-negative matrix factorization give a correct decomposition into parts?“

 

 

 

 

 

The chapter “Feature extraction” describes the feature extraction methods for speech signals. Therefore, the article introduces the motivation for the feature extraction from speech signals, describes various low-level features and presents the various combination of low-level features. Finally this chapter describes the feature extraction component of the RWTH Aachen, which combines the previously described methods.

1. Introduction

After the preprocessing step, feature extraction is the second component of automatic speech recognition (ASR) systems. This component should derive descriptive features from the windowed and enhanced speech signal to enable a classification of sounds. The feature extraction is needed because the raw speech signal contains information besides the linguistic message and has a high dimensionality. Both characteristics of the raw speech signal would be unfeasible for the classification of sounds and result in a high word error rate. Therefore, the feature extraction algorithm derives a characteristic feature vector with a lower dimensionality, which is used for the classification of sounds. [1,2]

A feature vector should emphasize the important information regarding the specific task and suppress all other information. As the goal of automatic speech recognition is to transcribe the linguistic message the information about this message needs to be emphasized [3]. The Speaker dependent characteristics, the characteristics of the environment and the recording equipment should be suppressed because these characteristics do not contain any information about the linguistic message. Including this non-linguistic information would introduce an additional variability, which could have a negative impact on the separability of the phone classes. Furthermore, the feature extraction should reduce the dimensionality of the data to reduce the computation time and the number of training samples [1].

2. Low Level Features

Until now many different features, which highlight different aspects of the speech signal have been proposed. These features can mostly divided into linguistic and acoustic features. Acoustic features are only relevant for classification of non-verbal vocal outbursts such as laughter or sighs. Linguistic features are more relevant to ASR systems because these systems try to transcribe the linguistic message [6]. For example some of the linguistic features are Intensity, Linear Predictive Coding (LPC) [4], Perceptional Linear Predictive Coefficients (PLP) [5], Mel-Frequency Cepstral Coefficients (MFCC), Linear Prediction Cepstral Coefficients (LPCC), Wavelet Based Features and Non-Negative Matrix Factorization features.  Many of the previously mentioned low level features use speech signal frames in the range from 10ms to 30ms due to their quasi stationary characteristics [7]. Furthermore, many of these features are biologically inspired and extract the features from the spectrum because the human speech production controls the spectrum of the signal and the ear acts as a spectrum analyzer (for more information see Speech Production and Sense of Hearing)[1]. The articles

give an extensive description of the corresponding feature extraction methods. A comparison of these three low level features is described in the article Comparison of different feature extraction algorithms. Further information regarding the other feature extraction methods, which are not described in this wiki, can be found within the corresponding references.

3. Combination of Features

These previously described low level features have all different shortcomings. To compensate each individual shortcoming various combinations of low-level features have been proposed. This section presents an overview of various combinations.

3.1 Longer Time Frames

Many of the in section 2.0 described algorithms use a quasi-stationary speech signal frame of 10 ms to 30 ms. These quasi-stationary signals correspond to a static configuration of the vocal tract. Even though steady configurations of the vocal tract contain relatively little linguistic information and the fundamental linguistic units are likely to be longer. Furthermore, it is hard to distinguish the short term stationary speech signal from the long term stationary noise signal in a signal frame of 10 ms. Therefore, longer time frames of the speech signal seem more appropriate for ASR systems. These longer time frames also correspond to the auditory systems of mammals, which processes auditory signals in the order of 200 ms. [1] To achieve features vectors spanning a longer time frame, n sequential low level feature vectors within a sliding window are combined into a single feature vector and processed simultaneously [8].

3.2 Feature Stacking

Besides the use of longer time frames, feature stacking is used to gain more robust features for ASR. Feature stacking combines the extracted features from different algorithms as shown in Figure 1. These stacked features should result in a lower word error rate because the error characteristics should be different between different feature extraction methods and compensate the limitations of each individual feature extraction method. [7] Most commonly the dimensionality of the stacked feature vector is reduced by Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) [8]. An example of this method is described in the section the tandem architecture within the article "Artificial Neural Networks for Feature Extraction".

Figure 1: The features of different feature extraction methods are stacked to achieve a better word error rate [9]

3.3 Artificial Neural Networks

Another approach to generate more robust features for ASR is the post processing of the low-level features with an artificial neural network. Different architectures for this post processing are described in the article "Artificial Neural Networks for Feature Extraction".

4. Modern Feature Extraction

Many modern feature extraction components for ASR system use all of the previously described combinations of low-level features. For example the feature extraction component of the RWTH Aachen, which yielded the best word error rate for English and German in 2011, uses sequential low-level features as a single feature vector, feature stacking of different feature streams and post processing of the low-level features with an artificial neural network using the tandem architecture with bottle-neck dimension reduction. As low-level features the feature extraction component of the RWTH Aachen uses Mel Frequency Cepstral Coefficients (MFCC) and Perceptual Linear Predictive coefficients (PLP). The first 16 cepstral coefficients of MFCCs are computed with 20 band-pass filters, cepstral mean normalization and variance normalization. Nine MFCC feature vectors within a sliding window are stacked and the dimensionality is reduced to 45 dimensions using LDA . The additional feature stream of PLP features is extracted and processed similar to the MFCC features. Finally the phone posterior probabilities are computed by a hierarchical bottle-neck feed-forward ANN. The probabilistic bottle-neck features are decorrelated by PCA and added to the feature vector. [10]

References

[1] H. Bourlard, H. Hermansky, N. Morgan, “Towards increasing speech recognition error rates”, Speech Communications, vol. 18, pp. 205–231, 1995.

[2] D. O’Shaughnesssy, “Invited paper: Automatic speech recognition: History, methods and challenges”, Pattern Recognition, vol. 41, no. 10, pp. 2965–2979, 2008.

[3] H. Niemann, Klassifikation von Mustern, 2nd ed., Berlin, New York, Tokyo: Springer, 2003.

[4] E. Schukat – Talamazzini, Automatische Spracherkennung, Braunschweig, Wiesbaden: Vieweg Verlag, 1995.

[5] H. Hermansky, B. A. Hanson, H. Wakita, “Perceptually based linear predictive analysis of speech”, Acoustics, Speech, and Signal Processing , vol. 10, pp. 509-512, 1985.

[6] B. Schuller, “Voice and speech analysis in search of states and traits”, in Computer Analysis of Human Behaviour , A. A. Salah, T. Gevers (eds.), Berlin, New York, Tokyo: Springer ,2011, ch. 9, pp. 227-253.

[7] L. Rabiner and B.-H. Juang, Fundamentals of speech recognition, Englewood Cliffs: Prentice-Hall International, 1993.

[8] C. Plahl, “Neural Network based Feature Extraction for Speech and Image Recognition”, Ph.D. dissertation, Dept. Computer Science, RWTH Aachen, Aachen, Ger, 2014 .

[9] A. Hagen, A. Morris, “Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR”, Computer Speech and Language,vol. 19, no. 1, pp. 3–30, 2005.

[10] M. Sundermeyer, M. Nußbaum-Thom, S. Wiesler, C. Plahl, A. El-Desoky Mousa, S. Hahn, D. Nolden, R. Schlüter, H. Ney, “The RWTH 2010 Quaero ASR Evaluation System for English, French, and German”, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2212–2215, 2011.

1 Introduction

The algorithms introduced in this section are trying to extract speech features, which trying to preserve phonetics informations of speech signal. The previously described feature extraction algorithms are Mel-Frequency Cepstral Coefficients(MFCC), wavelet based features, using wavelet transformation(WT), Non-negative matrix factorization(NMF) and artificial neural network(ANN). The methods can be classified into two classes, namely calculating coefficients based on particular transformations and learning method based hidden features.

2 Comparison

MFCC and Wavelet based Feature extraction belong to the first class, while ANN and NMF are learning methods.

2.1 Comparison of Transformations and Computational Cost

For MFCC and WT based features, one common characteristic is the input signals are segmented into 30ms speech segments. However, these two methods use different transformations. Namely MFCC is an application of Fourier Transform(more precisely STFT), contrary wavelet based features use a particular wavelet transformation, which only scales and translates a mother wavelet. The wavelet is usually chosen from a certain orthonormal series. 

Note that the transformed spectrogram obtained by the STFT has the same resolution of time and frequency, while for the wavelet transformation the resolution of time increases along frequency. [1] The spectrums are shown in figure 1.

                        time and frequency resolution of STFT and WT

Figure1: Time and frequency resolution of DFT and WT[1]

With respect to the used transforms, the computational demands for MFCC and WT have a large difference. For MFCC it contains: the temporal signal was first transformed using the FT to frequency domain (computation time:  ) and then logarithmically processed after which the discrete cosine transform is applied directly (computation time: )  or by using the FFT (computation time: ). Moreover the WT is a temporal approximation using chosen wavelet and temporal moving window. Thus computation demands is not high,  and the extracted features contain the information of scaling and time-shifting.

NMF and ANN are two learning methods in the field of machine learning.

Both of them are iterative processes, which obtain the final results by iterative error estimation and component update. 

NMF as a matrix factorization method is used to factorize a speech spectrogram which is computed by the STFT. The error estimation is realized by using particular cost functions for the reconstructed and the original signal. The factors of standard NMF are a weight matrix and a hidden variable as shown in figure 2. This is somehow similar to a single layer forward ANN as shown in figure 3. Similar ANNs has an activation function, a loss function and a corresponding update for weights. However, it has to be noted, that none of the data in NMF can be negative. While the weightings between the neurons do not have such constraints. Moreover, both loss functions can guarantee a local minimum but not a global minimum.[5,6] 

The computational demands for these two learning methods are high for both. The computation time depends on the number of iterations and the complexity of the loss function as well as the update function.

 

Figure 2 Hidden variable model of NMF lines present basic [Wikipedia- NMF]

Figure 3 One layer ANN the arrows represent weights [Wikipedia- ANN]  

2.2 Methods Combination with respect to Words Error Rate

To evaluate the quality of the feature extraction method, one common standard is the comparison of word error rate(WER).

One characteristic of MFCC is using Mel scale, which projects frequency from hertz to mel with a log process. This process makes the frequency metric obey the rule of human hearing system. In this way, a mel filter bank can extract human speech components for reconstructing percipient signal directly. Thus it’s the most used feature extraction method in speech processing. And as a mature technique, MFCC was used in mobile phones for the recognition of spoken numbers.[2]

The drawback of MFFC is the sensitivity to additive noise. However, by combining with other techniques the WER or word accuracy(WA) can be enhanced. This is shown in Figure 4 and 5.[7] For clean speech WER of MFCC performs well while in noisy case it reduces(from 80% to 60%). The result of this paper also shows that by using WT combined with MFCC the WER can be decreased both in clean and noisy speech cases. 

Figure 4 Average success percentage of each word for the noisy speech [7]

                                                                 

Figure 5 Average success percentage of each word for the clean speech [7]

The combination of MFCC and NMF was used in an experiment for an noise robust recognition of the letters inside a car.[4] Results in figure 6 show the WA with respect to different combinations of trained model and test condition. The noise cases are: smooth inner city road (CTY), highway (HWY), cobble road (COB). For different noisy situations NMF combined MFCC features show an obvious improvement of WA. M denotes MFCC and M+N denotes MFCCs and NMF features.

Figure 6 WA of in car spelling recognition for letters „a“ to „z“ [4]

3. Conclusion

  MFCC Wavelet Based NMF ANN
Transformation STFT WT STFT x
Computational Demands ++ + +++ +++
Variability

log mel amplitude offset before DCT

different wavelets

adding different constrains to cost function

structure of neuros

Particular Cases

number recognition in mobile communication

human voice and music classification

speech separation

robust features

 

References:

[1] R. Sarikaya (2000) “High Resolution Speech Feature Parametrization for Monophone-Based Stressed Speech Recognition“ IEEE signal processing letters, vol.7 no.7 pp. 182-185

[2] European Telecommunications Standards Institute (2003), Speech Processing, Transmission and Quality Aspects (STQ); Distributed speech recognition; Front-end feature extraction algorithm; Compression algorithms. Technical standard ES 201 108, v1.1.3.

[3] R. Sarikaya (1998) „wavelet package transform features with application to speaker recognition“

[4] B. Schuller (2010) „Non-negative matrix factorization as noise-robust feature extraction for speech recognition“ ICASSP pp.4562-4565

[5] C. Plahl, (2014) “Neural Network based Feature Extraction for Speech and Image Recognition”, Ph.D. dissertation, Dept. Computer Science, RWTH Aachen, Aachen, Ger

[6] D. D. Lee (2000) „Algorithms for non-negative matrix factorization“

[7] M.A.Anusuya (2011) "Comparison of Different Speech Feature Extraction Techniques with and without Wavelet Transform to Kannada Speech Recognition" International Journal of Computer Applications, vol. 26 No.4 pp. 19-24

 

 

Wavelets give a short-term multi-resolution analysis of time, energy and frequencies in a speech signal [1]. This article will present wavelet based features as a parametric representation of a speech signal for automatic speech recognition tasks. In the following the motivation of the wavelet transform and its theory is presented, pointing out the development from the Fourier Transform to the Wavelet Transform. Then the practical realization with a filterbank will be introduced. The article concludes with an explaination of few approaches using wavelet based features in automatic speech recognition.

1. Motivation

Typically, the process of signal processing transforms a time-domain signal into another domain, with the purpose of extracting the characteristic information embedded within the time series that is otherwise not readily observable in its original form. Mathematically, this can be achieved by representing the time-domain signal as a series of coefficients, based on a comparison between the signal and a set of known, template functions [2]. These coefficients can then be used by recognition systems such as neural networks or hidden markov models for stochastic pattern matching.

 

1.1 Fourier Transform (FT)

 

In 1822 Joseph Fourier stated that every signal (periodic or aperiodic) can be described with a weighted composition of periodic functions (sine and cosine waves) that differ in their amplitude and frequency. This weighted composition of a series of periodic functions is defined as the Fourier transform and it is mathematically expressed as


with the complex exponential . The term is denoted as the spectrum of the signal .

The inverse transformation is given by

A major shortcoming of the Fourier transform with regard to speech processing is that by transforming the signal to the frequency domain, the information of time is lost. Meaning that you are not able to relate a frequency of your signal to a time instance where this frequency had occured. Therefore the Fourier transform is a reasonable choice when analysing stationary signals, but as speech signals are highly non-stationary signals where the fundamental frequencies are changing over time, you would lose valuable time information of the speech signal. An illustration of this statement is given below showing a nonstationary sine wave and its spectrum.

Figure 1: A nonstationary signal represented in the time and frequency domain

1.2 Short-Time Fourier Transform (STFT)

The Short-Time Fourier Transform is an extension to the Fourier transform in order to provide temporal information on the spectral analysis of the transformed signal. It is realized by a windowing of the signal and an application of the Fourier transform to this windowed signal. The STFT can be expressed as

       (Note the difference between and )

The width of the window determines the resolution in time and frequency.  That is, if you apply a wide window (assume the extreme case of a window covering the entire signal) then you will have good frequency resolution (the frequency bandwidth is small, meaning you can still distinguish between two frequencies that are quite similar) but bad time resolution (the extreme case of a window covering the entire signal  would simply mean you apply the Fourier transform, as described above that means you will have no time resolution at all).  In other words, the signal cannot be localized both in time and frequency. This property is related to the Heisenberg uncertainty principle and it is called the uncertainty principle in signal processing, mathematically expressed as  [3]. Therefore the shortcoming of the STFT is, that it has a fixed resolution. When using the STFT you have to make a tradeoff, either you have good frequency resolution and bad time resolution or good time resolution and bad frequency resolution. The figure below illustrates the STFT and its inherent tiling of the time-frequency plane. The area of each box corresponds to the product which is a positive constant.

Figure 2: The Short-Time Fourier Transform and its time-frequency resolution

2. Wavelet Transform in theory

Since most biological signals [4], such as the speech, contain for a large amount of time low frequency components and for only short periods high frequency components, it is desired to have a good time resolution for those high-frequency events and at the same time good frequency resolution for low-frequency events. This is where the wavelet transform comes into play, since it provides us with just that.

2.1 Wavelet Transform (WT)

The general idea of the Wavelet transform is similar to the STFT. A windowing function, for the WT this is a wavelet, which is a "small" wave of finite length, is multiplied with a segment of the speech signal and then shifted in time until the entire signal is processed. The outcome are coefficients that represent the similarity of the signal segments and the wavelet. In addition to that, the width of the window is changed by scaling the wavelet and the transform is applied to the signal again. Therefore the Wavelet transform is able to analyse the signal at different frequencies with different resolutions. The tiling of the time-frequency plane is shown in the following figure. Side remark: The depicted tiling is realized with a dyadic scaling which is applied by the discrete Wavelet transform.

Figure 3: Dyadic time-frequency tiling of the (discrete) Wavelet Transform

2.2 Continuous Wavelet Transform

The continuous Wavelet transform is expressed as

The higher the coefficient C, the more the similarity of the wavelet with the signal segment. The function  is denoted as the mother wavelet or prototype function and expressed as

 with  being the paramater for scale and  the parameter for translation. Every wavelet used in the transform is therefore a scaled and/or time shifted version of the mother wavelet. An explicit form of a wavelet is given in the subsection "Some properties of wavelets" in this article, but most wavelets do not have an explicit form, since they are realized with digital filters.

The algorithm of the continuous Wavelet Transform is as follows and depicted in the figure below

  1. Take a wavelet and compare it to a section at the start of the original signal. Calculate C(a,b).
  2. Shift the wavelet to the right and repeat step 1 until you covered the whole signal.
  3. Scale the wavelet and repeat steps 1 and 2.
    Figure 4: The continuous Wavelet Transform

2.3 Discrete Wavelet Transform

In speech processing the signals are digitized and therefore consist of discrete samples. For the sake of completeness the discrete Wavelet transform is briefly introduced.

In the discrete case the scale is dyadic and only integer shifts are possible. This can be written as and  with . The wavelet transform in the discrete case can then be expressed as

with being a discrete wavelet defined as .

2.4 Some properties of Wavelets

  1. Wavelets have compact support. That means outside a compact set, the wavelets value is zero.
  2. The average value of a wavelet in the time domain is zero, this is considered as the zero mean property: Therefore it must have an oscillatory character, like a wave.
  3. The Fourier spectrum of a wavelet must have a bandpass-like character. This is considered as the admissibility condition [ref.], in order to ensure the existence of an inverse transform:

Depicted below are two exemplary wavelets.

The Morlet wavelet (also Gabor wavelet) depicted here is also referred to as the real Morlet wavelet. The prototype function is a composition of a complex exponential multiplied by a Gaussian window. This wavelet is closely related to the human perception of sound [ref.].

The Daubechies 4 wavelet is a wavelet from the Daubechies wavelet familiy, which are one of the most used wavelets due to their simple implementation using the fast Wavelet transform which is realized as a filterbank. The db4 wavelet is realized as a quadrature mirror filter with the coefficients   [6]. Depicted below is the wavelet and scaling function of the db4 wavelet. The need for a scaling function becomes obvious if you consider the discrete wavelet transform. In the introduction it was said that when making the window wider, the covered frequency band is reduced ( = smaller frequency resolution). For the DWT in every step the wavelet is stretched by a factor of 2, therefore you cover only halve of the remaining frequency spectrum. This would mean you need to apply infinite number of stretches to cover the entire frequency spectrum. To overcome this problem the scaling function is used as an averaging filter which covers from one step on the remaining low pass frequency spectrum [7]. 

Figure 5: The real Morlet wavelet function
Figure 6: The Daubechies 4 wavelet and scaling function

3. Fast Wavelet Transform (practical implementation)

Figure 7: The Fast Wavelet Transform realized as a 2-channel filterbank

 

The Fast Wavelet Transform (FWT) realizes the transform of a sampled waveform in the time domain into a sequence of coefficients based on a mother wavelet. It is implemented as a two-channel filterbank using quadrature mirror filters. Quadrature mirror filters are filters whose magnitude response are to a certain extent a mirror image of another filter. The filter coeffcients are determined by the mother wavelet. For each step of the filterbank the outcome of the downsampled high-pass () filtered signal are the corresponding wavelet based feature coefficients, also denoted as the detail coefficients. The outcome of the downsampled low-pass () filtered signal -- the approximation coefficients -- are the input of the next step of the filterbank. The number of steps of a filterbank correspond to the idea of how often a wavelet is stretched and compared to the signal.

If you use at the same the detail coefficients as the input to a subsequent step of the filterbank, spanning the entire binary tree, this transform is denoted as the Wavelet Packet decomposition. The FWT can therefore be considered as a subset of this generalized, more versatile transform. The Wavelet Packet decomposition maps time samples into wavelet coefficients. By pruning this wavelet packet tree in a specific way, one can choose an optimal basis of coefficients.

4. Particular approaches and results with respect to speech recognition

A rather simple template pattern matching based on a cross-correlation measure using Daubechies 8 wavelet features for phoneme recognition is presented by Gamulkiewicz et al. [8], demonstrating the applicability of wavelet based feature to speech recognition.

Long et al. [5] proposes a wavelet based phonetic feature extractor that uses a best basis algorithm [9] which aims to extract the maximum information of a signal by projecting it onto bases in which that signal is best represented. With regard to classification, a basis that most uniquely represents a given class of signal in the presence of other known classes will be most desirable. They propose a method that decides between two dictionaries of features, smooth localised cosine packets and wavelet packets. The selected features are then trained to a two-layer feed-forward neural network. It is reported that the extracted features are suitable for classification tasks, stating that experiments on small data sets gave 100% accuracy with a 90% confidence threshold.

Ramalingam et al. [10] uses different wavelet features to classify whether an audio signal is speech or music. The perfomance between two classification models,  Support Vector Machines and Gaussian Mixture Models, is compared. The former is an effective method for general pattern recognition, mapping the features into a higher dimensional feature space and finding a linear separating hyper plane that separates this data optimally into different classes. The latter is a parametric classifier, that models the probability distribution of feature vectors. In brief, it is reported that best classification performance was achieved using the Daubechies 8 wavelet compared to the haar and the symlets 2 wavelet. Whereas the results of the Gaussian Mixture Model which showed an accuracy of 95.4% were slightly better than the classification results using Support Vector Machines.

References

[1] B. Schuller. Voice and speech analysis in search of states and traits. In A. A. Salah, T. Gevers (eds.) Computer Analysis of Human Behaviour, Advances in Pattern Recognition, chapter 9, pp. 227-253, Springer, Berlin (2011).

[2] R. X. Gao, R. Yan, From Fourier Transform to Wavelet Transform: A Historical Perspective. In Wavelets Theory and Applications for Manufacturing, Chapter 2, pp. 17-32, Springer US, Dec 2010.

[3] J. D. Jackson, Classical Electrodynamic, 2nd edition (Wiley, NY, 1975), pp 299-306

[4] P. S. Addison, J. Walker, R. C. Guido. Time-Frequency Analysis of Biosignals. In Engineering in Medicine and Biology Magazine IEEE, vol. 28, pp. 14-29, Sept 2009.

[5] C. J. Long, S. Datta. Wavelet Based Feature Extraction for Phoneme Recognition. In Fourth International Conference on Spoken Language, pp. 264-267, vol. 1, Oct 1996.

[6] M. J. Mohlenkamp, A Tutorial on Wavelets and Their Applications. University of Colorado at Boulder, Department of Applied Mathematics.

[7] C. Valens. A Really Friendly Guide to Wavelets. 1999.

[8] B. Gamulkiewicz, M. Weeks. Wavelet Based Speech Recognition. Computer Science Department, Georgia State University, 2003.

[9] R. Coifman, M. Wickerhauser. Entropy-based algorithms for best-basis selection. In IEEE Trans. on Information Theory, vol. 38, no. 2, 1992.

[10] T. Ramalingam, P. Dhanalakshmi. Speech/Music Classification using Wavelet Based Feature Extraction Techniques. In Journal of Computer Science 10 (1), pp. 34-44, 2014.

 

1. Introduction

Modern feature extraction algorithms for speech signals incorporate an artificial neural network (ANN) for feature extraction. These algorithms take the feature vector of a Mel-Frequency Cepstral Coefficients (MFCC), Linear Prediction, Wavelet transform computed for time frames in the range of 10ms and post process them using an ANN. To incorporate the perspective of longer timeframes n MFCC vectors within a sliding window of the size n are used as an input for the ANN. Therefore, the number of input neurons varies with the size of the sliding window and the dimensionality of the feature vector. The trained target classes, the number of hidden layers and the number of output neurons can be varied. [1] Generally two different architectures, the tandem architecture and the hybrid architecture, can be used to include ANNs for feature extraction.

2. ANN Architectures

2.1 Hybrid Architecture

The flow chart for an automatic speech recognition (ASR) system using a hybrid architecture is shown in figure 1. First a low-level feature extraction method is used to compute n sequential feature vectors. These feature vectors are used as input for the ANN and the ANN computes the posterior probability of the trained classes. The posterior probability is described by

where si is the ith class and xj is the jth feature vector of the previously used featured extraction algorithm. The probability distribution is used as an input for a Hidden Markov Model (HMM) to transcribe the spoken word. [1,2] A complete description of the hybrid approach can be found in the article Context-Dependent Deep Neural Networks: Breakthrough of neural networks for speech recognition.

Figure 1: Block diagram of the hybrid ANN architecture

 

2.2 Tandem Architecture

The tandem architecture extends the low-level feature extraction methods with the probability distributions of the target classes. The flow chart of the tandem approach is shown in Figure 2.

Figure 2: Block diagram of the tandem ANN architecture

 

The n short time feature vectors are used inputs for the ANN and the probability distribution of the target classes is computed. This probability distribution extends the feature vector. The extended feature vector consists of the n feature vectors from the short time feature extraction algorithm and the probability distribution computed by the ANN. Principal Component Analysis (PCA) or Linear Discriminant Analysis (LDA) reduce the high dimensionality of the extended feature vector. Figure 3 shows two potential methods to stack the separate feature streams to a single feature vector. The first method (Figure 3.a) reduces the dimension of each feature stream separately. The second method (Figure 3.b) reduces the dimension of both feature streams in a global transformation. [1]

Figure 3: a) The dimension of each feature stream is reduced separately. b) The dimension of the feature vector is reduced in a global transformation. [1]

 

3.0 Bottle-neck Features

A different method for a non-linear dimension reduction technique for ANNs is bottle-neck processing. Some research suggests that bottle-neck dimension reduction can learn a better low dimensional representation than PCA [1]. For bottle-neck processing a hidden layer within a deep neural network is replaced by a layer which number of neurons correspond with the target dimensionality. Figure 4.a shows a bottle-neck layer with 4 neurons within the hidden layers. During training the complete ANN is trained on the target classes. After the training all layers beyond the bottle-neck layer are removed and the output of the neurons in the bottle-neck layer are used as features. This method is referred to as probabilistic bottle-neck features. [1,3,4] Experimental results showed that an additional step for decorrelation using LDA or PCA is necessary to achieve competitive results [1].

Figure 4: Artificial Neural Network with a bottle-neck layer of 4 neutrons within the hidden layers during training (a) and decoding (b). [1]

References

[1] C. Plahl, “Neural Network based Feature Extraction for Speech and Image Recognition”, Ph.D. dissertation, Dept. Computer Science, RWTH Aachen, Aachen, Ger, 2014 .

[2] A. Hagen, A. Morris, “Recent advances in the multi-stream HMM/ANN hybrid approach to noise robust ASR”, Computer Speech and Language,vol. 19, no. 1, pp. 3–30, 2005.

[3] D. E. Rumelhard, G. E. Hinton, R. J. Wiliams, “Learning internal representations by error propagation”, Parallel Distributed Processing: Explorations in the Microstructure of Cognition , vol. 1, pp. 318–362, 1985.

[4] F. Grézl, M. Karafiat, S. Kontar, J. Cernock, “Probabilistic and Bottle- Neck Features for LVCSR of Meetings”, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) vol. 4, pp. 757–760, 2007.

Contents