# Cepstral mean normalization

The variable characteristics of the channel can be a typical issue with the speech recognition system, especially when there are not only stationary noise but also non-stationary noise. To overcome the influence of these variations, Cepstral Mean Normalization (CMN) is developed. In speech recognition, CMN is fundamental for the further feature extraction operations such as Mel-Frequency Cepstral Coefficients.

# 1. Basic idea

In the CMN algorithm, the mean value of cepstral coefficients over the whole utterance(combination of cepstral vectors) is subtracted from each frame (single cepstral vector): $x_{t}=z_{t}-m$, where $m$ is this mean value, $x_{t}$ and $z_{t}$ are single cepstral vectors .

One drawback of this process is that frames cannot be processed before obtaining the whole utterance, it´s inappropriate to be used in the live speech recognition.

# 2. Dynamic CMN

To deal with the drawback mentioned above, the dynamic CMN algorithm is introduced to process live speech signal. In the dynamic CMN algorithm, an IIR filter is used to generate an instantaneous average of the cepstral coefficients from the received utterance. The initial cepstral coefficients from the training data will be derived and stored for later use. If we find that the initial cepstral coefficients are inaccurate with respect to the received utterance, the filter must be updated accordingly before calculating the instantaneous average. Furthermore, at the start of each iteration to calculate cepstral coefficients, a parameter is needed to specify the number of frames to be processed by the IIR filter before using the average value for normalization [1].

# 3.Efficient cepstral normalization techniques

In this section, we will introduce some other cepstrum-based compensation techniques with higher computational cost but with higher accuracy. By comparing their differences with respect to the cepstral mean normalization, we can have a clear understanding about cepstral normalization techniques.

## 3.1 SDCN

SNR-Dependent Cepstral Normalization introduces an additive correction vector in the cepstral domain only according to the instantaneous SNR of the signal. This correction vector indicates the spectral difference between simultaneous speech samples in training environment and  simultaneous speech samples in testing environment with respect to SNR. At low SNRs the vector subtracts the noise while at higher SNRs it compensates the spectral difference between the two environments. This algorithm is simple but needs an environment-specific training [2].

## 3.2 CDCN

In Codeword-Dependent Cepstral Normalizationcodewords of the speech are generated in the training database. Expectation Maximization is applied to obtain the ML estimation of the parameters concerning additive noise and the linear filtering. ML estimator generates cepstral coefficients that best matches the cepstral coefficients of speech in the testing environment to the positions of codewords in the training environment.

One advantage of CDCN is that it works without the prior knowledge of the testing environment. The disadvantage of it is its higher computational expenses: the structural knowledge about the nature of the speech signal degradations is difficult to obtain, but it can improve the accuracy of the algorithm considerably [2].

## 3.3 FCDCN

Fixed Codeword-Dependent Cepstral Normalization not only uses the instantaneous SNR of the signal to generate an additive correction vector (like SDCN) , it also differs from codeword to codeword in the training environment (like CDCN) as well.

$x=z+r[k,l]$                                                                                                                                                                                    (1)

As in (1), for each frame of a speech signal, z is the cepstral vector of the corrupted signal, x represents the cepstral vector of the compensated signal, l is the index for SNR, k is the index for the VQ codeword and $r[k,l]$ is the correction vector.

$\left&space;\|&space;z+r\left&space;[k,l&space;\right&space;]&space;-c\left&space;[k&space;\right&space;]\right&space;\|^{2}$                                                                                                                                                                       (2)

The c[k] is a codebook vector chosen from the training database such that (2) is minimized. Using Expectation Maximization, the new correction vectors is estimated and it maximizes the likelihood of the data. The process of the algorithm is as follows:

I.Estimate the initial values of $r^{'}[k,l]$ and $\sigma^{2}[l]$

II.Estimate the posterior probabilities of the mixture components,

$f_{i}\left&space;[&space;k&space;\right&space;]=\frac{exp\left&space;(&space;-\frac{1}{2\sigma&space;^{2}\left&space;[l&space;_{i}&space;\right&space;]}\left&space;\|&space;z_{i}&space;+r^{'}\left&space;[&space;k,l&space;\right&space;]-c\left&space;[&space;k&space;\right&space;]\right&space;\|&space;^{2}\right&space;)}{\sum_{p=0}^{K-1}exp\left&space;(&space;-\frac{1}{2\sigma&space;^{2}\left&space;[&space;l_{i}&space;\right&space;]}&space;\left&space;\|&space;z_{i}+r^{'}\left&space;[&space;p,l_{i}&space;\right&space;]-c\left&space;[&space;p&space;\right&space;]\right&space;\|^{2}\right&space;)}$                                                                                                    (3)

where i is the index of frame and $l_{i}$ is the instantaneous SNR of the ith frame.

III.Maximize the likelihood of data with new estimates of  $r[k,l]$ and $\sigma&space;^{2}[l]$

$r[k,l]=\frac{\sum_{i=0}^{N-1}(x_{i}-z_{i})f_{i}[k]\delta&space;[l-l_{i}]}{\sum_{i=0}^{N-1}f_{i}[k]\delta&space;[l-l_{i}]}$                                                                                                                                     (4)

$\sigma&space;^{2}[l]=\frac{\sum_{i=0}^{N-1}&space;\sum_{k=0}^{K-1}\left&space;\|&space;x_{i}-z_{i}-r[k,l]\right&space;\|^{2}f_{i}[k]\delta&space;[l-l_i{}]}{\sum_{i=0}^{N-1}\sum_{k=0}^{K-1}f_{i}[k]\delta&space;[l-l_{i}]}$                                                                                                  (5)

IV.Go to step 2 if the algorithm doesn't convergence, otherwise stop.

We can see that convergence can be achieved in two or three steps if we choose the initial value of the correction vector obtained from the SDCN algorithm [2].

## 3.4 MFCDCN

Multiple fixed Codeword-Dependent Cepstral Normalization is an advanced version of the FCDCN which doesn't require the environment-specific training. In MFCDCN, the correction vector becomes $r[k,l,m]$, in which m specifies the environment in which the correction vector is trained. That is to say, with an input utterance from an unknown environment, correction vectors with different possible environments are applied one after another to minimize the average residual VQ-distortion indicated by (6) [2].

$\left&space;\|&space;z+r\left&space;[k,l,m&space;\right&space;]&space;-c\left&space;[k&space;\right&space;]\right&space;\|^{2}$                                                                                                                                                                (6)

# 4. Example of application

This section presents an examples of cepstral normalization, segmental cepstral mean and variance normalization.

Now that the Mel Frequency Cepstral Coefficients may be changed by the stationary noise, the noise robustness is obtained by making the distribution of the cepstral coefficients invariant to the noise condition [3]. In cepstral mean and variance normalzation (CMVN) cepstral coefficients are linearly transformed to have zero mean and unit variance to ensure its robustness against noise. Since in our example, the transformation parameters are calculated segmentally, we call it segmental cepstral mean and variance normalization.

The feature vector is normalized as in (7) before trained or tested [4]:

$\hat{x}_{t}[i]=&space;\frac{x_{t}[i]-\mu&space;_{t}[i]}{\sigma&space;_{t}[i]}$                                                                                                                                                                        (7)

Where $x_{t}[i]$ is the $i$th component of the input feature vector. The mean $\mu&space;_{t}[i]$ and the standard deviation $\sigma&space;_{t}[i]$ of the corresponding component is estimated over a sliding window of length N as in (8) and (9):

$\mu&space;_{t}[i]=\frac{1}{N}\sum_{n=t-N/2}^{t+N/2-1}x_{n}[i]$                                                                                                                                                                (8)

$\sigma&space;^{2}_{t}[i]=\frac{1}{N}\sum_{n=t-N/2}^{t+N/2-1}(x_{n}[i]-\mu&space;_{t}[i])^{2}$                                                                                                                                        (9)

In the experiment for the segmental cepstral mean and variance normalization,there is a two-speaker two-channel database with two microphones, PARAT and H374(V)5, which is used to recognize Norwegian digit strings of known length from different speakers. Through the result of the experiment we can find that, if the CMVN algorithm is combined with a two-stage mel-warped Wiener Filtering algorithm and the globally poorly diagonal covariance (GV), the accuracy of the word recognition is sufficiently high not only under the stationary noise, CV90, but under the non-stationary, noise Babble as well [4].

# 5. References

[1] ¨Cepstral mean normalization¨, http://staffhome.ecm.uwa.edu.au/~00014742/research/speech/local/entropic/HAPIBook/node85.html

[2] ¨Efficient cepstral normalization for robust speech recognition¨, Fu-Hua Liu, Richard M. Stern, Xuedong Huang, Alejandro Acero , Department of Electrical and Computer Engineering , School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213

[3] ¨On the liminations of cepstral features in noise¨, Openshaw, J. P. and Mason, J. S., Proc. ICASSP-94, Vol. 2, pp.49-52, 1994

[4] ¨Cepstral mean and variance normalization in the model domain¨, Ole Morten Strand, Andreas Egeberg, Norwegian University of Science and Technology , 7491 Trondheim, Norway