Wavelets give a short-term multi-resolution analysis of time, energy and frequencies in a speech signal [1]. This article will present wavelet based features as a parametric representation of a speech signal for automatic speech recognition tasks. In the following the motivation of the wavelet transform and its theory is presented, pointing out the development from the Fourier Transform to the Wavelet Transform. Then the practical realization with a filterbank will be introduced. The article concludes with an explaination of few approaches using wavelet based features in automatic speech recognition.

1. Motivation

Typically, the process of signal processing transforms a time-domain signal into another domain, with the purpose of extracting the characteristic information embedded within the time series that is otherwise not readily observable in its original form. Mathematically, this can be achieved by representing the time-domain signal as a series of coefficients, based on a comparison between the signal and a set of known, template functions [2]. These coefficients can then be used by recognition systems such as neural networks or hidden markov models for stochastic pattern matching.

 

1.1 Fourier Transform (FT)

 

In 1822 Joseph Fourier stated that every signal (periodic or aperiodic) can be described with a weighted composition of periodic functions (sine and cosine waves) that differ in their amplitude and frequency. This weighted composition of a series of periodic functions is defined as the Fourier transform and it is mathematically expressed as


with the complex exponential . The term is denoted as the spectrum of the signal .

The inverse transformation is given by

A major shortcoming of the Fourier transform with regard to speech processing is that by transforming the signal to the frequency domain, the information of time is lost. Meaning that you are not able to relate a frequency of your signal to a time instance where this frequency had occured. Therefore the Fourier transform is a reasonable choice when analysing stationary signals, but as speech signals are highly non-stationary signals where the fundamental frequencies are changing over time, you would lose valuable time information of the speech signal. An illustration of this statement is given below showing a nonstationary sine wave and its spectrum.

Figure 1: A nonstationary signal represented in the time and frequency domain

1.2 Short-Time Fourier Transform (STFT)

The Short-Time Fourier Transform is an extension to the Fourier transform in order to provide temporal information on the spectral analysis of the transformed signal. It is realized by a windowing of the signal and an application of the Fourier transform to this windowed signal. The STFT can be expressed as

       (Note the difference between and )

The width of the window determines the resolution in time and frequency.  That is, if you apply a wide window (assume the extreme case of a window covering the entire signal) then you will have good frequency resolution (the frequency bandwidth is small, meaning you can still distinguish between two frequencies that are quite similar) but bad time resolution (the extreme case of a window covering the entire signal  would simply mean you apply the Fourier transform, as described above that means you will have no time resolution at all).  In other words, the signal cannot be localized both in time and frequency. This property is related to the Heisenberg uncertainty principle and it is called the uncertainty principle in signal processing, mathematically expressed as  [3]. Therefore the shortcoming of the STFT is, that it has a fixed resolution. When using the STFT you have to make a tradeoff, either you have good frequency resolution and bad time resolution or good time resolution and bad frequency resolution. The figure below illustrates the STFT and its inherent tiling of the time-frequency plane. The area of each box corresponds to the product which is a positive constant.

Figure 2: The Short-Time Fourier Transform and its time-frequency resolution

2. Wavelet Transform in theory

Since most biological signals [4], such as the speech, contain for a large amount of time low frequency components and for only short periods high frequency components, it is desired to have a good time resolution for those high-frequency events and at the same time good frequency resolution for low-frequency events. This is where the wavelet transform comes into play, since it provides us with just that.

2.1 Wavelet Transform (WT)

The general idea of the Wavelet transform is similar to the STFT. A windowing function, for the WT this is a wavelet, which is a "small" wave of finite length, is multiplied with a segment of the speech signal and then shifted in time until the entire signal is processed. The outcome are coefficients that represent the similarity of the signal segments and the wavelet. In addition to that, the width of the window is changed by scaling the wavelet and the transform is applied to the signal again. Therefore the Wavelet transform is able to analyse the signal at different frequencies with different resolutions. The tiling of the time-frequency plane is shown in the following figure. Side remark: The depicted tiling is realized with a dyadic scaling which is applied by the discrete Wavelet transform.

Figure 3: Dyadic time-frequency tiling of the (discrete) Wavelet Transform

2.2 Continuous Wavelet Transform

The continuous Wavelet transform is expressed as

The higher the coefficient C, the more the similarity of the wavelet with the signal segment. The function  is denoted as the mother wavelet or prototype function and expressed as

 with  being the paramater for scale and  the parameter for translation. Every wavelet used in the transform is therefore a scaled and/or time shifted version of the mother wavelet. An explicit form of a wavelet is given in the subsection "Some properties of wavelets" in this article, but most wavelets do not have an explicit form, since they are realized with digital filters.

The algorithm of the continuous Wavelet Transform is as follows and depicted in the figure below

  1. Take a wavelet and compare it to a section at the start of the original signal. Calculate C(a,b).
  2. Shift the wavelet to the right and repeat step 1 until you covered the whole signal.
  3. Scale the wavelet and repeat steps 1 and 2.
    Figure 4: The continuous Wavelet Transform

2.3 Discrete Wavelet Transform

In speech processing the signals are digitized and therefore consist of discrete samples. For the sake of completeness the discrete Wavelet transform is briefly introduced.

In the discrete case the scale is dyadic and only integer shifts are possible. This can be written as and  with . The wavelet transform in the discrete case can then be expressed as

with being a discrete wavelet defined as .

2.4 Some properties of Wavelets

  1. Wavelets have compact support. That means outside a compact set, the wavelets value is zero.
  2. The average value of a wavelet in the time domain is zero, this is considered as the zero mean property: Therefore it must have an oscillatory character, like a wave.
  3. The Fourier spectrum of a wavelet must have a bandpass-like character. This is considered as the admissibility condition [ref.], in order to ensure the existence of an inverse transform:

Depicted below are two exemplary wavelets.

The Morlet wavelet (also Gabor wavelet) depicted here is also referred to as the real Morlet wavelet. The prototype function is a composition of a complex exponential multiplied by a Gaussian window. This wavelet is closely related to the human perception of sound [ref.].

The Daubechies 4 wavelet is a wavelet from the Daubechies wavelet familiy, which are one of the most used wavelets due to their simple implementation using the fast Wavelet transform which is realized as a filterbank. The db4 wavelet is realized as a quadrature mirror filter with the coefficients   [6]. Depicted below is the wavelet and scaling function of the db4 wavelet. The need for a scaling function becomes obvious if you consider the discrete wavelet transform. In the introduction it was said that when making the window wider, the covered frequency band is reduced ( = smaller frequency resolution). For the DWT in every step the wavelet is stretched by a factor of 2, therefore you cover only halve of the remaining frequency spectrum. This would mean you need to apply infinite number of stretches to cover the entire frequency spectrum. To overcome this problem the scaling function is used as an averaging filter which covers from one step on the remaining low pass frequency spectrum [7]. 

Figure 5: The real Morlet wavelet function
Figure 6: The Daubechies 4 wavelet and scaling function

3. Fast Wavelet Transform (practical implementation)

Figure 7: The Fast Wavelet Transform realized as a 2-channel filterbank

 

The Fast Wavelet Transform (FWT) realizes the transform of a sampled waveform in the time domain into a sequence of coefficients based on a mother wavelet. It is implemented as a two-channel filterbank using quadrature mirror filters. Quadrature mirror filters are filters whose magnitude response are to a certain extent a mirror image of another filter. The filter coeffcients are determined by the mother wavelet. For each step of the filterbank the outcome of the downsampled high-pass () filtered signal are the corresponding wavelet based feature coefficients, also denoted as the detail coefficients. The outcome of the downsampled low-pass () filtered signal -- the approximation coefficients -- are the input of the next step of the filterbank. The number of steps of a filterbank correspond to the idea of how often a wavelet is stretched and compared to the signal.

If you use at the same the detail coefficients as the input to a subsequent step of the filterbank, spanning the entire binary tree, this transform is denoted as the Wavelet Packet decomposition. The FWT can therefore be considered as a subset of this generalized, more versatile transform. The Wavelet Packet decomposition maps time samples into wavelet coefficients. By pruning this wavelet packet tree in a specific way, one can choose an optimal basis of coefficients.

4. Particular approaches and results with respect to speech recognition

A rather simple template pattern matching based on a cross-correlation measure using Daubechies 8 wavelet features for phoneme recognition is presented by Gamulkiewicz et al. [8], demonstrating the applicability of wavelet based feature to speech recognition.

Long et al. [5] proposes a wavelet based phonetic feature extractor that uses a best basis algorithm [9] which aims to extract the maximum information of a signal by projecting it onto bases in which that signal is best represented. With regard to classification, a basis that most uniquely represents a given class of signal in the presence of other known classes will be most desirable. They propose a method that decides between two dictionaries of features, smooth localised cosine packets and wavelet packets. The selected features are then trained to a two-layer feed-forward neural network. It is reported that the extracted features are suitable for classification tasks, stating that experiments on small data sets gave 100% accuracy with a 90% confidence threshold.

Ramalingam et al. [10] uses different wavelet features to classify whether an audio signal is speech or music. The perfomance between two classification models,  Support Vector Machines and Gaussian Mixture Models, is compared. The former is an effective method for general pattern recognition, mapping the features into a higher dimensional feature space and finding a linear separating hyper plane that separates this data optimally into different classes. The latter is a parametric classifier, that models the probability distribution of feature vectors. In brief, it is reported that best classification performance was achieved using the Daubechies 8 wavelet compared to the haar and the symlets 2 wavelet. Whereas the results of the Gaussian Mixture Model which showed an accuracy of 95.4% were slightly better than the classification results using Support Vector Machines.

References

[1] B. Schuller. Voice and speech analysis in search of states and traits. In A. A. Salah, T. Gevers (eds.) Computer Analysis of Human Behaviour, Advances in Pattern Recognition, chapter 9, pp. 227-253, Springer, Berlin (2011).

[2] R. X. Gao, R. Yan, From Fourier Transform to Wavelet Transform: A Historical Perspective. In Wavelets Theory and Applications for Manufacturing, Chapter 2, pp. 17-32, Springer US, Dec 2010.

[3] J. D. Jackson, Classical Electrodynamic, 2nd edition (Wiley, NY, 1975), pp 299-306

[4] P. S. Addison, J. Walker, R. C. Guido. Time-Frequency Analysis of Biosignals. In Engineering in Medicine and Biology Magazine IEEE, vol. 28, pp. 14-29, Sept 2009.

[5] C. J. Long, S. Datta. Wavelet Based Feature Extraction for Phoneme Recognition. In Fourth International Conference on Spoken Language, pp. 264-267, vol. 1, Oct 1996.

[6] M. J. Mohlenkamp, A Tutorial on Wavelets and Their Applications. University of Colorado at Boulder, Department of Applied Mathematics.

[7] C. Valens. A Really Friendly Guide to Wavelets. 1999.

[8] B. Gamulkiewicz, M. Weeks. Wavelet Based Speech Recognition. Computer Science Department, Georgia State University, 2003.

[9] R. Coifman, M. Wickerhauser. Entropy-based algorithms for best-basis selection. In IEEE Trans. on Information Theory, vol. 38, no. 2, 1992.

[10] T. Ramalingam, P. Dhanalakshmi. Speech/Music Classification using Wavelet Based Feature Extraction Techniques. In Journal of Computer Science 10 (1), pp. 34-44, 2014.

 


Contents