## Cepstral mean normalization

The variable characteristics of the channel can be a typical issue with the speech recognition system, especially when there are not only stationary noise but also non-stationary noise. To overcome the influence of these variations, Cepstral Mean Normalization (CMN) is developed. In speech recognition, CMN is fundamental for the further feature extraction operations such as Mel-Frequency Cepstral Coefficients.

# 1. Basic idea

In the CMN algorithm, the mean value of cepstral coefficients over the whole utterance(combination of cepstral vectors) is subtracted from each frame (single cepstral vector): $x_{t}=z_{t}-m$, where $m$ is this mean value, $x_{t}$ and $z_{t}$ are single cepstral vectors .

One drawback of this process is that frames cannot be processed before obtaining the whole utterance, it´s inappropriate to be used in the live speech recognition.

# 2. Dynamic CMN

To deal with the drawback mentioned above, the dynamic CMN algorithm is introduced to process live speech signal. In the dynamic CMN algorithm, an IIR filter is used to generate an instantaneous average of the cepstral coefficients from the received utterance. The initial cepstral coefficients from the training data will be derived and stored for later use. If we find that the initial cepstral coefficients are inaccurate with respect to the received utterance, the filter must be updated accordingly before calculating the instantaneous average. Furthermore, at the start of each iteration to calculate cepstral coefficients, a parameter is needed to specify the number of frames to be processed by the IIR filter before using the average value for normalization [1].

# 3.Efficient cepstral normalization techniques

In this section, we will introduce some other cepstrum-based compensation techniques with higher computational cost but with higher accuracy. By comparing their differences with respect to the cepstral mean normalization, we can have a clear understanding about cepstral normalization techniques.

## 3.1 SDCN

SNR-Dependent Cepstral Normalization introduces an additive correction vector in the cepstral domain only according to the instantaneous SNR of the signal. This correction vector indicates the spectral difference between simultaneous speech samples in training environment and  simultaneous speech samples in testing environment with respect to SNR. At low SNRs the vector subtracts the noise while at higher SNRs it compensates the spectral difference between the two environments. This algorithm is simple but needs an environment-specific training [2].

## 3.2 CDCN

In Codeword-Dependent Cepstral Normalizationcodewords of the speech are generated in the training database. Expectation Maximization is applied to obtain the ML estimation of the parameters concerning additive noise and the linear filtering. ML estimator generates cepstral coefficients that best matches the cepstral coefficients of speech in the testing environment to the positions of codewords in the training environment.

One advantage of CDCN is that it works without the prior knowledge of the testing environment. The disadvantage of it is its higher computational expenses: the structural knowledge about the nature of the speech signal degradations is difficult to obtain, but it can improve the accuracy of the algorithm considerably [2].

## 3.3 FCDCN

Fixed Codeword-Dependent Cepstral Normalization not only uses the instantaneous SNR of the signal to generate an additive correction vector (like SDCN) , it also differs from codeword to codeword in the training environment (like CDCN) as well.

$x=z+r[k,l]$                                                                                                                                                                                    (1)

As in (1), for each frame of a speech signal, z is the cepstral vector of the corrupted signal, x represents the cepstral vector of the compensated signal, l is the index for SNR, k is the index for the VQ codeword and $r[k,l]$ is the correction vector.

$\left&space;\|&space;z+r\left&space;[k,l&space;\right&space;]&space;-c\left&space;[k&space;\right&space;]\right&space;\|^{2}$                                                                                                                                                                       (2)

The c[k] is a codebook vector chosen from the training database such that (2) is minimized. Using Expectation Maximization, the new correction vectors is estimated and it maximizes the likelihood of the data. The process of the algorithm is as follows:

I.Estimate the initial values of $r^{'}[k,l]$ and $\sigma^{2}[l]$

II.Estimate the posterior probabilities of the mixture components,

$f_{i}\left&space;[&space;k&space;\right&space;]=\frac{exp\left&space;(&space;-\frac{1}{2\sigma&space;^{2}\left&space;[l&space;_{i}&space;\right&space;]}\left&space;\|&space;z_{i}&space;+r^{'}\left&space;[&space;k,l&space;\right&space;]-c\left&space;[&space;k&space;\right&space;]\right&space;\|&space;^{2}\right&space;)}{\sum_{p=0}^{K-1}exp\left&space;(&space;-\frac{1}{2\sigma&space;^{2}\left&space;[&space;l_{i}&space;\right&space;]}&space;\left&space;\|&space;z_{i}+r^{'}\left&space;[&space;p,l_{i}&space;\right&space;]-c\left&space;[&space;p&space;\right&space;]\right&space;\|^{2}\right&space;)}$                                                                                                    (3)

where i is the index of frame and $l_{i}$ is the instantaneous SNR of the ith frame.

III.Maximize the likelihood of data with new estimates of  $r[k,l]$ and $\sigma&space;^{2}[l]$

$r[k,l]=\frac{\sum_{i=0}^{N-1}(x_{i}-z_{i})f_{i}[k]\delta&space;[l-l_{i}]}{\sum_{i=0}^{N-1}f_{i}[k]\delta&space;[l-l_{i}]}$                                                                                                                                     (4)

$\sigma&space;^{2}[l]=\frac{\sum_{i=0}^{N-1}&space;\sum_{k=0}^{K-1}\left&space;\|&space;x_{i}-z_{i}-r[k,l]\right&space;\|^{2}f_{i}[k]\delta&space;[l-l_i{}]}{\sum_{i=0}^{N-1}\sum_{k=0}^{K-1}f_{i}[k]\delta&space;[l-l_{i}]}$                                                                                                  (5)

IV.Go to step 2 if the algorithm doesn't convergence, otherwise stop.

We can see that convergence can be achieved in two or three steps if we choose the initial value of the correction vector obtained from the SDCN algorithm [2].

## 3.4 MFCDCN

Multiple fixed Codeword-Dependent Cepstral Normalization is an advanced version of the FCDCN which doesn't require the environment-specific training. In MFCDCN, the correction vector becomes $r[k,l,m]$, in which m specifies the environment in which the correction vector is trained. That is to say, with an input utterance from an unknown environment, correction vectors with different possible environments are applied one after another to minimize the average residual VQ-distortion indicated by (6) [2].

$\left&space;\|&space;z+r\left&space;[k,l,m&space;\right&space;]&space;-c\left&space;[k&space;\right&space;]\right&space;\|^{2}$                                                                                                                                                                (6)

# 4. Example of application

This section presents an examples of cepstral normalization, segmental cepstral mean and variance normalization.

Now that the Mel Frequency Cepstral Coefficients may be changed by the stationary noise, the noise robustness is obtained by making the distribution of the cepstral coefficients invariant to the noise condition [3]. In cepstral mean and variance normalzation (CMVN) cepstral coefficients are linearly transformed to have zero mean and unit variance to ensure its robustness against noise. Since in our example, the transformation parameters are calculated segmentally, we call it segmental cepstral mean and variance normalization.

The feature vector is normalized as in (7) before trained or tested [4]:

$\hat{x}_{t}[i]=&space;\frac{x_{t}[i]-\mu&space;_{t}[i]}{\sigma&space;_{t}[i]}$                                                                                                                                                                        (7)

Where $x_{t}[i]$ is the $i$th component of the input feature vector. The mean $\mu&space;_{t}[i]$ and the standard deviation $\sigma&space;_{t}[i]$ of the corresponding component is estimated over a sliding window of length N as in (8) and (9):

$\mu&space;_{t}[i]=\frac{1}{N}\sum_{n=t-N/2}^{t+N/2-1}x_{n}[i]$                                                                                                                                                                (8)

$\sigma&space;^{2}_{t}[i]=\frac{1}{N}\sum_{n=t-N/2}^{t+N/2-1}(x_{n}[i]-\mu&space;_{t}[i])^{2}$                                                                                                                                        (9)

In the experiment for the segmental cepstral mean and variance normalization,there is a two-speaker two-channel database with two microphones, PARAT and H374(V)5, which is used to recognize Norwegian digit strings of known length from different speakers. Through the result of the experiment we can find that, if the CMVN algorithm is combined with a two-stage mel-warped Wiener Filtering algorithm and the globally poorly diagonal covariance (GV), the accuracy of the word recognition is sufficiently high not only under the stationary noise, CV90, but under the non-stationary, noise Babble as well [4].

# 5. References

[1] ¨Cepstral mean normalization¨, http://staffhome.ecm.uwa.edu.au/~00014742/research/speech/local/entropic/HAPIBook/node85.html

[2] ¨Efficient cepstral normalization for robust speech recognition¨, Fu-Hua Liu, Richard M. Stern, Xuedong Huang, Alejandro Acero , Department of Electrical and Computer Engineering , School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213

[3] ¨On the liminations of cepstral features in noise¨, Openshaw, J. P. and Mason, J. S., Proc. ICASSP-94, Vol. 2, pp.49-52, 1994

[4] ¨Cepstral mean and variance normalization in the model domain¨, Ole Morten Strand, Andreas Egeberg, Norwegian University of Science and Technology , 7491 Trondheim, Norway

## Preprocessing

The preprocessing stage in speech recognition systems is used in order to increase the efficiency of subsequent feature extraction and classification stages and therefore to improve the overall recognition performance. Commonly the preprocessing includes the sampling step, a windowing and a denoising step. At the end of the preprocessing the compressed and filtered speech frames are forwarded to the feature extraction stage. The general preprocessing pipeline is depicted in the following figure.

Due to increasing mobile use it is noticeable that speech recognition systems need to be robust with respect to their acoustic environment. Together with the feature extraction stage, the motivation of the preprocessing is to generate a parametric representation of the speech signal that is very compact and still stores all the necessary information for automatic speech recognition.

# 1. Sampling

In order that a computer is able to process the speech signal, it first has to be digitized. Therefore the time-continuous speech signal is sampled and quantized. The result is a time- and value-discrete signal. According to the Nyquist-Shannon sampling theorem a time-continuous signal $x(t)$ that is bandlimited to a certain finite frequency fmax needs to be sampled with a sampling frequency of at least 2fmax. In this way it can be reconstruced by its time-discrete signal $x[n]$. Studies of Sanderson et al. have shown that the sampling frequency in combination with the feature vector size has a direct effect on recognition accuracy.

Since human speech has a relatively low bandwidth (mostly between 100Hz and 8KHz - see chapter speech for detailed information) a sampling frequency of 16KHz is sufficient for speech recognition tasks.

For the purpose of having a value discrete signal the sampled values are quantized. This leads to a significant reduction of data. Usually speech recognition systems encode the samples with 8 or 16 bits per sample depending on the available processing power. 8 bit per sample would mean 28 = 256 quantization levels, 16 bit per sample provide 216 = 65536 quantization levels. Concluding, if you have enough processing power, a higher bit resolution for the sampled values is preferable.

# 2. Windowing and frame formation

Speech is a non-stationary time variant signal. A short but precise explanation of stationary and non-stationary signals aswell as the non-stationary nature of speech is given here. In brief, a signal is considered to be stationary if its frequency or spectral components do not change over time.  We assume that human speech is built from a dictionary of phonemes, while for most of the phonemes the properties of speech remain invariant for a short period of time (~ 5-100ms). [ref.] Thus we assume (and hope) the signal behaves stationary for those time frames. In order to obtain frames we multiply the speech signal with a windowing function. This windowing function weights the signal in the time domain and divides it into a sequence of partial signals. By doing so we gain time information of every partial signal keeping in mind that an important step of the preprocessing and feature extraction is a spectral analysis of each frame.

where

• $s(n)$ denotes the sampled speech signal,
• $Q$ is the frame length
• $K$ is the window length
• $q$ is the sample point the window is applied
• and $s_K(n,q)$ is a resulting short time signal, with $s_K(n,q)&space;=&space;s(n)w(q-n)$.

As it can be seen in the figure above there can be an overlapping of the windows. The frame length and window length is dependent on the scope of application and algorithms used. In speech processing the value for the frame length $Q$ typically varies between 5 to 25ms and for the window length $K$ between 20 and 25ms [ref.]. Smaller overlapping means larger time shift in the signal, therefore lower processor demand, but the difference of parameter values (e.g. feature vectors) of neighbouring frames can be higher. Whereas larger overlapping can result in a smoother change of the parameter values of the frames, although higher processing power is needed.

There are various windowing functions, each with different characteristics, weighting the original signal in a different way and therefore producing different windowed signals. In speech processing however the shape of the window function is not that crucial but usually some soft window like the Von-Hann or Hamming window is used [ref.] in order to reduce discontinuities of the speech signal at the edges of each frame. The hamming window is described by  $w(n)=0.54-0.46&space;\cos&space;(\frac{2\pi(n-1)}{K-1})$ with $n=0...K-1$ and it is depicted below.

From now on each frame can be analyzed independently and later on in the feature extraction stage represented by a single feature vector.

# 3. Denoising & speech enhancement

The stage of denoising or noise reduction, also referred to as enhancing of speech degraded by noise, aims to improve the speech signals quality. The objective is to improve the intelligibility, a measure of how comprehensible speech is. Noise corrupting speech signals can be grouped coarsely into the following 3 classes:

1. Microphone related noise
2. Electrical noise (e.g. electromagnetically induced or radiated noise)
3. Environmental noise

The first two types of noise can be easily compensated by training the speech recognizers on corresponding noisy speech samples, but compensating the environmental noise is not that elementary, due to its high variability. The basic problem of noise reduction is to reduce the external noise without disturbing the unvoiced and low-intensity noise-like components of the speech signal itself [4].

The algorithms of noise reduction can be grouped intro three fundamental classes. In the following exemplary algorithms for denoising are briefly described.

1. Filtering Techniques
2. Spectral Restoration (speech enhancement)
3. Speech-Model-Based

Filtering Techniques. Prominent algorithms based on filtering techniques are Adaptive Wiener filtering and the spectral subtraction methods. Adaptive Wiener filtering depends on the adaption of the filter transfer function from sample to sample based on the speech signal statistics (mean and variance). Spectral subtraction methods estimate the spectrum of the clean signal by the subtraction of the estimated noise magnitude spectrum from the noisy signal magnitude spectrum while keeping the phase spectrum of the noisy signal [6].

Spectral Restoration.  Spectral Restoration refers to the inducing of missing spectral components of nonverbal sounds by adding noise to increase intelligibility [7].

Speech-Model-Based.  Harmonic Decomposition refers to a denoising technique that uses a harmonic+noise model of the speech, assuming that the speech signal is composed of a periodic/voiced and random/unvoiced part. By processing the components separately and recombining them, the speech signal can be enhanced. An exemplary realization is described in the article harmonic decomposition in greater detail. Nonnegative Matrix Factorization algorithms factorize the Mel-magnitude spectra of noisy speech into a nonnegative weighted linear combination of speech and noise basis functions. The weights are then used to obtain an estimate of the clean speech [8]. Nonnegative Matrix Factorization may also be used as a feature extraction technique in speech processing.

Conclusively the general measure of quality of a noise reduction system is its improvement in signal-to-noise ratio (SNR), but with respect to speech recognition, the best measure is the improvement in recognition performance.

# References

[1] C. E. Shannon. Communication in the presence of noise. In Proc. Institute of Radio Engineers, vol. 37, no. 1, pp. 10–21, Jan. 1949.

[2] C. Sanderson, K. K. Paliwal. Effect of different sampling rates and feature vector sizes on speech recognition performance. In TENCON '97. IEEE Region 10 Annual Conference. Speech and Image Technologies for Computing and Telecommunications., Proceedings of IEEE, vol. 1, pp. 161-164, Dec. 1997.

[3] N. A. Meseguer. Speech Analysis for Automatic Speech Recognition. Norwegian University of Science and Technology, Department of Electronics and Telecommunications, July 2009.

[4] A. G. Maher, R. W. Kind, J.G. Rathmell. A Comparison of Noise Reduction Techniques for Speech Recognition in Telecommunications Environments. In The Institution of Engineers Australia Communications Conference, Sydney, October, 1992.

[5] J. Benesty, M. M. Sondhi, Y. Huang. Springer Handbook of Speech Processing, pp.843-869. Springer, 2007.

[6] M. A. Abd El-Fattah, M. I. Dessouky, S. M. Diab, F. E. Abd El-samie. Adaptive Wiener Filtering Approach for Speech Enhancement. In Ubiquitous Computing and Communication Journal, Vol. 3, No. 2, pp. 1-8. April 2008.

[7] R. M. Warren, K. R. Hainsworth, B. S. Brubaker, J. A. Bashford, E. W. Healy. Spectral restoration of speech: intelligibility is increased by inserting noise in spectral gaps. In Perception and psychophysics, 59 (2) (1997), pp. 275-283.

[8] J. T. Geiger, J. F. Gemmeke, B. Schuller, G. Rigoll. Investigating NMF Speech Enhancement for Neural Network based Acoustic Models. In Proc. Interspeech 2014, ISCA, Singapore, 2014.

## Wiener Filtering

In speech recognition, noise such as the microphone related noise and the electrical noise can be a key factor to deteriorate the quality of recognition. Thus, filtering method should be introduced to deal with noise. Wiener filtering, a filtering method, adapts the coefficients of the wiener filter with respect to the statistics of the speech signal to generate a desired output signal.

# 1. Introduction

To solve the linear optimum filtering problem [1], wiener filter, a type of adaptive filter, is developed based on wiener theory issued by Norbert Wiener in 1940. The basic principle of wiener filter is to optimize its coefficient to minimize the average squared distance between a desired signal and the filter output. The typical applications of wiener filter is additive noise reduction, linear prediction, signal restoration and system identification [2]. This article mainly focuses on the additive noise reduction using the FIR wiener filter.

# 2. Motivation

As in the figure above, given a time series x(0), x(1), x(2), …,x(n) which is the input signal of filter, a linear discrete-time filter with the output y(n) that generates the best estimation of the desired signal d(n), is to be designed. We confine the filter to be used to FIR filter because a large amount of computation is needed for the combination of adaptivity and feedback when an IIR filter needs to be designed. The criterion of the statistical optimization is a minimized mean square distance between y(n) and d(n) which is dependent on the unknown coefficient of the filter [2]. In this situation, how to get optimal filter coefficient becomes the main issue.

# 3. Basic Algorithm of FIR Wiener Filter

With the filter coefficient $w(n)$, we can get the output $y(n)$ and the error of estimation $e(n)$:

$y(n)=w(n)*x(n)=\sum_{l=0}^{N-1}w(l)x(n-l)$                                                                                                     (1)

$e(n)=d(n)-y(n)=d(n)-\sum_{l=0}^{N-1}w(l)x(n-l)$                                                                                       (2)

The coefficient of wiener filter $w(n)$ is optimal if it minimizes the mean square error:

$J(w)_{MSE}=E\left&space;\{&space;\left&space;|&space;e(n)&space;\right&space;|^{2}&space;\right&space;\}=E\left&space;\{&space;\left&space;|&space;d(n)-y(n)&space;\right&space;|^{2}&space;\right&space;\}$                                                                                       (3)

Getting the derivative of $J(w)_{MSE}$ with respect to $w^{*}(k)$ and force it to zero derive the orthogonality principle:

$E\left&space;\{&space;e(n)&space;x^{*}(n-k)\right&space;\}=0,&space;k=0,&space;1&space;......,&space;N-1$                                                                                              (4)

Through the rearrangement of the equation (4) we get equation (5) (where $r_{dx}=E\left&space;\{&space;d(n)x^{*}(n-k)&space;\right&space;\}$ and $r_{x}(k-l)=E\left&space;\{&space;x(n-l)x^{*}(n-k)&space;\right&space;\}$) and its matrix form equation (6) with $r_{x}(k)=r_{x}^{*}(-k)$:

$\sum_{l=0}^{N-1}w(l)r_{x}(k-l)=r_{dx},&space;k=0,1,&space;......,&space;N-1$                                                                                           (5)

$\begin{bmatrix}&space;r_{x}(0)&&space;r_{x}^{*}&space;(1)&space;&&space;...&space;&&space;r_{x}^{*}(N-1)\\&space;r_{x}(1)&&space;r_{x}(0)&space;&&space;...&space;&&space;r_{x}^{*}(N-2)&space;\\&space;r_{x}(2)&space;&&space;r_{x}(1)&&space;...&space;&&space;r_{x}^{*}(N-3)\\&space;.&space;&.&space;&&.&space;\\&space;.&space;&.&space;&&&space;.\\&space;.&space;&&space;.&space;&&.&space;\\&space;r_{x}(N-1)&space;&&space;r_{x}(N-1)&space;&...&space;&&space;r_{x}^{*}(0)&space;\end{bmatrix}\begin{bmatrix}&space;w(0)\\&space;w(1)\\&space;w(2)\\&space;.\\&space;.\\&space;.\\&space;w(N-1)&space;\end{bmatrix}=\begin{bmatrix}&space;r_{dx}(0)\\&space;r_{dx}(1)\\&space;r_{dx}(2)\\&space;.\\&space;.\\&space;.\\&space;r_{dx}(N-1)&space;\end{bmatrix}$                                         (6)

The equation is actually the Wiener-Hopf equation. By solving it, the optimal filter coefficients $w(n)$ is obtained:

$R_{x}w=r_{dx}$                                                                                                                                                                                                                (7)

From the equation (2), (3) and (4), the minimum mean square error is [3]:

$J_{MMSE}(w)=E\left&space;\{&space;e(n)d^{*}(n)&space;\right&space;\}=r_{d}(0)-\sum_{l=0}^{N-1}w(l)r_{dx}^{*}(l)$                                                                          (8)

# 4. Example of  its application

In this section, a simple example of the Application of wiener filtering for choosing an input word signal to approach a desired output signal is introduced.

We now have two recorded reference signals x1(n) and x2(n) for the same word and a desired signal d(n). Now we need to decided which of the two signals generates a output y(n) that has a smaller minimum mean square error with respect to a desired signal, d(n).

According to the algorithm of wiener filtering, with x1(n) and d(n) we calculate rdx1, Rx1, rd(0) and then get the optimal filter coefficients w1(n) by the equation (7), the way to get w2(n) is the same. Then we get the minimum mean square error with respect to w1(n) and w2(n). According to the principle that the smaller the MMSE, the better estimation, the system will choose the signals with a smaller MMSE from the two reference signals to be recorded at a third time [3].

With wiener filtering, an input word signal that is more likely to generate the desired output signal despite the noise is chosen from the reference signals. Furthermore, We can introduce more reference signals to make the MMSE between the optimal output signal and the desired one even smaller.

# 5. References

[1]  A rapid introduction to adaptive filtering, Chapter 2, Wiener filtering, Hernan Rey, Leonardo Rey Vega, Springer, 2012

[2] Introduction of wiener filter, Darcy Tsai, Graduate Institute of Electronics Engineering Nation Taiwan University, Taipei, Taiwan, ROC

[3] The algorithms of speech recognition, programming and simulating in MATLAB, Tingxiao Yang

# 1. Introduction

This article deals with harmonic decomposition as a denoising technique for speech processing. In the following the idea of harmonic decomposition is described shortly, followed by the description of one approach of how to implement it in a preprocessing step of a speech recognition system.

Harmonic decomposition uses the fact that a speech signal consists mainly of harmonic frequencies and therefore adapts the weights for the harmonic and random components of the speech signal. By decomposing the speech signal into a harmonic and a random component, the two components are processed independently and then recombined to generate an enhanced signal [3].

Harmonic decomposition is based on the property of periodicity of speech signals and the classification of the speech into voiced and unvoiced states, which is described in the Speech Production article [2] [4]. By exploiting these inherent properties of the speech signal itself and not making any assumptions about the noise, a priori knowledge of the noise is not needed. Therefore harmonic decomposition is able to improve robustness of speech recognition in situations where occurring noise is stationary as well as non-stationary [1].

# 2. Exemplary realization

## 2.1 Model of Speech

A clean speech signal is considered as the summation of a harmonic signal and a random signal

$x&space;=&space;x_h&space;+&space;x_r$

, where the harmonic signal can be modelled as a weighted sum of harmonically-related sinusoids

$x_{h}(t)&space;=&space;\sum\limits_{k=1}^K&space;a_{k}&space;\cos(kw_{0}t)&space;+&space;b_{k}&space;\sin(kw_{0}t)$ [1]

with $w_0$ being the fundamental frequency, also called the pitch, and $K$ being the total number of harmonics in the signal. The fundamental frequency of a periodic signal is defined as its lowest frequency and harmonic frequency is an integer multiple of this fundamental frequency. For every frame of the speech signal an estimate of  $w_0$ and the amplitude parameters $\{a_1,a_2,...,a_K,b_1,b_2,...,b_K\}$ is needed.

The pitch $w_0$ is estimated using a pitch detection algorithm (PDA) e.g. RAPT - A robust algorithm for pitch tracking’ [5].  $x$ is approximated with $x_h$ and then the equation

$\dpi{120}&space;\dpi{120}&space;\textbf{x}&space;=&space;Ab&space;=&space;\begin{bmatrix}&space;cos(1\cdot&space;w_0&space;\,&space;t_0)&space;&&space;...&space;&&space;cos(k\cdot&space;w_0&space;\,&space;t_0)&space;&&space;sin(1&space;\cdot&space;w_0&space;\,&space;t_0)&space;&&space;...&space;&&space;sin(k\cdot&space;w_0&space;\,&space;t_0)\\&space;.&space;&&space;&&space;&&space;&&space;&&space;.\\&space;.&space;&&space;&&space;&&space;&&space;&&space;.\\&space;.&space;&&space;&&space;&&space;&&space;&&space;.\\&space;cos(1\cdot&space;w_0&space;\,&space;t_N)&space;&&space;...&space;&&space;cos(k\cdot&space;w_0&space;\,&space;t_N)&space;&&space;sin(1&space;\cdot&space;w_0&space;\,&space;t_N)&space;&&space;...&space;&&space;sin(k\cdot&space;w_0&space;\,&space;t_N)&space;\end{bmatrix}\begin{bmatrix}&space;a_1&space;\\&space;.&space;\\&space;.&space;\\&space;.&space;\\&space;a_K&space;\\&space;b_1&space;\\&space;.&space;\\&space;.&space;\\&space;.&space;\\&space;b_K\end{bmatrix}$

is solved, where $\textbf{x}$ is a vector of $N$ samples. Since the number of samples $N&space;>&space;2K$ this yields an overdetermined system and therefore an approximated solution can be found by using a least squares approach:

$\hat{b}&space;=&space;(A^{T}A)^{-1}A^{T}\textbf{x}$

where this set of amplitude parameters is used to approximate the periodic component $x_h$

$\hat{x}_h&space;=&space;A&space;\hat{b}$

and by simply subtracting $\hat{x}_h$ from $x$ you can get an estimate for the residual or random component $x_r$:

$\hat{x}_r&space;=&space;x&space;-&space;\hat{x}_h$

## 2.2 Harmonic+Noise case

For the case of a noisy speech signal $y$ the decomposition is done as follows

$y&space;=&space;y_h&space;+&space;y_r&space;=&space;x_h&space;+&space;n_h&space;+&space;x_r&space;+&space;n_r$

, where $n_h$ is the noise at the harmonic component and $n_r$ is the noise at the random component. An entire block diagram of the denoising approach based on harmonic decomposition is depicted below.

The windowed speech segments $y_w[n]$ are decomposed using the least square approach described above into approximations of $y_h[n]$ and $y_r[n]$. Now assuming that the harmonic and random component are uncorrelated and also that the clean speech and noise are uncorrelated, a transformation of the speech signal into the Mel spectrum representation is made. This transformation consists of the two first stages of the feature extraction using Mel Frequency Cepstral Coefficients (MFCC). Briefly summarized the windowed signals are transformed into the frequency domain and the computed power spectrum of each window is filtered using triangular overlapping (bandpass) filters. The outcome is described as the Mel-frequency-spectrum. This spectrum can be considered as the power spectrum of the windowed signals mapped on the mel scale [4]. It is described in greater detail in the respective article. Therefore the Mel-frequency-spectrum can be considered as a measure of energy.

Denote $X$ as the Mel-frequency-spectrum of a clean speech segment x. Due to the assumption about correlation, the following equation can be derived $Y&space;=&space;Y_h&space;+&space;Y_r$.

Now it is possible to derive an estimate of the clean speech component of a noise corrupted speech segment $Y$ easily as a weighted summation of its harmonic and random component.

$\hat{X}&space;=&space;\alpha_{h}&space;Y_{h}&space;+&space;\alpha_{r}&space;Y_{r}$ with $0&space;\leq&space;\alpha_h&space;,&space;\alpha_r&space;\leq&space;1$

The final task is to determine $\alpha_h$ and $\alpha_r$. For an highly voiced speech segment for example it is assumed that the clean speech will be captured almost entirely by the harmonic component and at the same time the random component captures mostly noise.

An estimate for $\alpha_h$ can therefore be found by using the energy of the harmonic component for every frame: $\alpha_h&space;=&space;\frac&space;{\sum_i&space;y_h(i)^2}{\sum_i&space;y(i)^2}$ with $i$ being the index of the sample of a frame.

If noise corrupts the signal, the energy of the harmonic component $y_h(i)^2$ will be smaller while the energy of the random component $y_r(i)^2$ will be bigger.

For the weighting factor $\alpha_r$ of the random component it is not that easy to find an estimate, since there is no predictable structure of the underlying structure of this component. An estimate can learned from data, empirically studying the recognition performance of an susbsequent classifier using different values for $\alpha_r$. Results from [1] and [3] show best results setting $\alpha_r&space;=&space;0.1$. The following figure belongs to the results of Seltzer et al. showing the improvement of the word accuracy of a MFCC based recognition system over different Signal-to-noise ratios.

# References

[1] M. Seltzer, J. Droppo, A. Acero. A harmonic-model-based front end for robust speech recognition. In Proc. Eurospeech Conference, Geneva, Switzerland, September 2003. International Speech Communication Association.

[2] Q. Hu, M. Liang. On the harmonic-plus-noise decomposition of speech. In Signal Processing, 2006 8th International Conference, volume 1, 2006.

[3] U. Imtiaz. Robust Speech Recognition Using Harmonic Components. Catholic University of Leuven, Belgium, June, 2004.

[4] B. Schuller. Voice and speech analysis in search of states and traits. In A. A. Salah, T. Gevers (eds.) Computer Analysis of Human Behaviour, Advances in Pattern Recognition, chapter 9, pp. 227-253, Springer, Berlin (2011).

[5] D. Talkin. A robust algorithm for pitch tracking (RAPT). In Speech Coding and Synthesis, pp. 495-517, 1995.