# 1. Introduction

This article deals with harmonic decomposition as a denoising technique for speech processing. In the following the idea of harmonic decomposition is described shortly, followed by the description of one approach of how to implement it in a preprocessing step of a speech recognition system.

Harmonic decomposition uses the fact that a speech signal consists mainly of harmonic frequencies and therefore adapts the weights for the harmonic and random components of the speech signal. By decomposing the speech signal into a harmonic and a random component, the two components are processed independently and then recombined to generate an enhanced signal [3].

Harmonic decomposition is based on the property of periodicity of speech signals and the classification of the speech into voiced and unvoiced states, which is described in the Speech Production article [2] [4]. By exploiting these inherent properties of the speech signal itself and not making any assumptions about the noise, a priori knowledge of the noise is not needed. Therefore harmonic decomposition is able to improve robustness of speech recognition in situations where occurring noise is stationary as well as non-stationary [1].

# 2. Exemplary realization

## 2.1 Model of Speech

A clean speech signal is considered as the summation of a harmonic signal and a random signal

$x&space;=&space;x_h&space;+&space;x_r$

, where the harmonic signal can be modelled as a weighted sum of harmonically-related sinusoids

$x_{h}(t)&space;=&space;\sum\limits_{k=1}^K&space;a_{k}&space;\cos(kw_{0}t)&space;+&space;b_{k}&space;\sin(kw_{0}t)$ [1]

with $w_0$ being the fundamental frequency, also called the pitch, and $K$ being the total number of harmonics in the signal. The fundamental frequency of a periodic signal is defined as its lowest frequency and harmonic frequency is an integer multiple of this fundamental frequency. For every frame of the speech signal an estimate of  $w_0$ and the amplitude parameters $\{a_1,a_2,...,a_K,b_1,b_2,...,b_K\}$ is needed.

The pitch $w_0$ is estimated using a pitch detection algorithm (PDA) e.g. RAPT - A robust algorithm for pitch tracking’ [5].  $x$ is approximated with $x_h$ and then the equation

$\dpi{120}&space;\dpi{120}&space;\textbf{x}&space;=&space;Ab&space;=&space;\begin{bmatrix}&space;cos(1\cdot&space;w_0&space;\,&space;t_0)&space;&&space;...&space;&&space;cos(k\cdot&space;w_0&space;\,&space;t_0)&space;&&space;sin(1&space;\cdot&space;w_0&space;\,&space;t_0)&space;&&space;...&space;&&space;sin(k\cdot&space;w_0&space;\,&space;t_0)\\&space;.&space;&&space;&&space;&&space;&&space;&&space;.\\&space;.&space;&&space;&&space;&&space;&&space;&&space;.\\&space;.&space;&&space;&&space;&&space;&&space;&&space;.\\&space;cos(1\cdot&space;w_0&space;\,&space;t_N)&space;&&space;...&space;&&space;cos(k\cdot&space;w_0&space;\,&space;t_N)&space;&&space;sin(1&space;\cdot&space;w_0&space;\,&space;t_N)&space;&&space;...&space;&&space;sin(k\cdot&space;w_0&space;\,&space;t_N)&space;\end{bmatrix}\begin{bmatrix}&space;a_1&space;\\&space;.&space;\\&space;.&space;\\&space;.&space;\\&space;a_K&space;\\&space;b_1&space;\\&space;.&space;\\&space;.&space;\\&space;.&space;\\&space;b_K\end{bmatrix}$

is solved, where $\textbf{x}$ is a vector of $N$ samples. Since the number of samples $N&space;>&space;2K$ this yields an overdetermined system and therefore an approximated solution can be found by using a least squares approach:

$\hat{b}&space;=&space;(A^{T}A)^{-1}A^{T}\textbf{x}$

where this set of amplitude parameters is used to approximate the periodic component $x_h$

$\hat{x}_h&space;=&space;A&space;\hat{b}$

and by simply subtracting $\hat{x}_h$ from $x$ you can get an estimate for the residual or random component $x_r$:

$\hat{x}_r&space;=&space;x&space;-&space;\hat{x}_h$

## 2.2 Harmonic+Noise case

For the case of a noisy speech signal $y$ the decomposition is done as follows

$y&space;=&space;y_h&space;+&space;y_r&space;=&space;x_h&space;+&space;n_h&space;+&space;x_r&space;+&space;n_r$

, where $n_h$ is the noise at the harmonic component and $n_r$ is the noise at the random component. An entire block diagram of the denoising approach based on harmonic decomposition is depicted below.

The windowed speech segments $y_w[n]$ are decomposed using the least square approach described above into approximations of $y_h[n]$ and $y_r[n]$. Now assuming that the harmonic and random component are uncorrelated and also that the clean speech and noise are uncorrelated, a transformation of the speech signal into the Mel spectrum representation is made. This transformation consists of the two first stages of the feature extraction using Mel Frequency Cepstral Coefficients (MFCC). Briefly summarized the windowed signals are transformed into the frequency domain and the computed power spectrum of each window is filtered using triangular overlapping (bandpass) filters. The outcome is described as the Mel-frequency-spectrum. This spectrum can be considered as the power spectrum of the windowed signals mapped on the mel scale [4]. It is described in greater detail in the respective article. Therefore the Mel-frequency-spectrum can be considered as a measure of energy.

Denote $X$ as the Mel-frequency-spectrum of a clean speech segment x. Due to the assumption about correlation, the following equation can be derived $Y&space;=&space;Y_h&space;+&space;Y_r$.

Now it is possible to derive an estimate of the clean speech component of a noise corrupted speech segment $Y$ easily as a weighted summation of its harmonic and random component.

$\hat{X}&space;=&space;\alpha_{h}&space;Y_{h}&space;+&space;\alpha_{r}&space;Y_{r}$ with $0&space;\leq&space;\alpha_h&space;,&space;\alpha_r&space;\leq&space;1$

The final task is to determine $\alpha_h$ and $\alpha_r$. For an highly voiced speech segment for example it is assumed that the clean speech will be captured almost entirely by the harmonic component and at the same time the random component captures mostly noise.

An estimate for $\alpha_h$ can therefore be found by using the energy of the harmonic component for every frame: $\alpha_h&space;=&space;\frac&space;{\sum_i&space;y_h(i)^2}{\sum_i&space;y(i)^2}$ with $i$ being the index of the sample of a frame.

If noise corrupts the signal, the energy of the harmonic component $y_h(i)^2$ will be smaller while the energy of the random component $y_r(i)^2$ will be bigger.

For the weighting factor $\alpha_r$ of the random component it is not that easy to find an estimate, since there is no predictable structure of the underlying structure of this component. An estimate can learned from data, empirically studying the recognition performance of an susbsequent classifier using different values for $\alpha_r$. Results from [1] and [3] show best results setting $\alpha_r&space;=&space;0.1$. The following figure belongs to the results of Seltzer et al. showing the improvement of the word accuracy of a MFCC based recognition system over different Signal-to-noise ratios.

# References

[1] M. Seltzer, J. Droppo, A. Acero. A harmonic-model-based front end for robust speech recognition. In Proc. Eurospeech Conference, Geneva, Switzerland, September 2003. International Speech Communication Association.

[2] Q. Hu, M. Liang. On the harmonic-plus-noise decomposition of speech. In Signal Processing, 2006 8th International Conference, volume 1, 2006.

[3] U. Imtiaz. Robust Speech Recognition Using Harmonic Components. Catholic University of Leuven, Belgium, June, 2004.

[4] B. Schuller. Voice and speech analysis in search of states and traits. In A. A. Salah, T. Gevers (eds.) Computer Analysis of Human Behaviour, Advances in Pattern Recognition, chapter 9, pp. 227-253, Springer, Berlin (2011).

[5] D. Talkin. A robust algorithm for pitch tracking (RAPT). In Speech Coding and Synthesis, pp. 495-517, 1995.