# Non-negative Matrix Factorization (2)

1971 H. Lawton and A. Sylvestre presented a method to find the shapes of two overlapping functions from an observation of additive mixtures[1]. This method is called Self Modeling Curve Resolution(SMCR) which was frequently used in the field of spectrophotometry whose outcome is an additive mixture of two unknown, non-negative and linear independent functions(„basis“). Of course the combination parameter of these functions can’t be non-negative. The feasibility of this method to find the „basis“ of mixtures was verified by PCA. SMCR is the prime idea of non-negative matrix factorization(NMF), which consider only two bases functions. An extension of number of basis was introduced by P. Paatero and U. Tapper in 1994 as the concept positive matrix factorization(PMF)[2]. Until 1999 the notion of NMF started to become famous through viewpoint from D. Lee & H. Seung, that perception of the whole based on perception of its parts[3]. Moreover they provided two algorithms to learn the parts of objects.

This page is trying to introduce first the basic idea of NMF and parts-based, second different algorithms to compute NMF, third NMF based applications for Speech Recognition(SR).

# 1 Introduction of NMF and Parts-based Idea

NMF is a matrix factorization method with non-negative constrain, where all the matrices in this process should be non-negative. (Here non-negative matrix is different from positive definite matrix which fulfills ${A^{T}XA}\ge&space;0$, but all the elements in the matrix are equal or larger than zero).

Given a non-negative matrix V, find non-negative matrix factors W and H such that: ${V}\approx{WH}$[4]. V is a data set, which contains m ${R}^{n}$ samples,$V&space;\in&space;R^{nm}$. Factor matrices W and H $\in&space;R^{nk},&space;R^{km}$ respectively, where r should be smaller than n or m. NMF can also be written column-wise, namely ${v}\approx&space;{Wh}$, the column of V can be approximated via a linear combination of the columns of W, here W is the basis matrix. This is the basic idea of NMF.

In order to illustrate how it works for speech signal, what are the meaning of each matrix and what is parts-based, here a small example of NMF deconvolution for sound object extraction will be quoted[5], (where deconvolution here can be neglect).

Given a spectrogram of a segment of sound signal X(the spectrogram will be calculated usually via Short Time Fourier Transform STFT), using NMF to find the linear approximation W and H for X.

The lower right plot is the magnitude spectrogram X, it contains 256 DFT slices and each DFT slice were overlapped by 128 points. The NMF performed here decompose X into W with 3 bases(3 leftmost plots, each of them has 10 DFT slices), and a weight matrix H, also called hidden variable matrix, which is the invisible layer of the original signal.

${X}&space;\approx&space;{WH}$

Connect to parts-based concept, the 3 bases of W are representing parts with respect to this X, signal. Which means in this case each DFT slice in X will be represented by the multiplication of parts-bases W and corresponding weight h. For fixed bases W, the hidden variable H implies features hidden in the speech segmentation, ideally these should be our desired speech features.

From this example it’s easy to figure out, that the input signal which presented in both time and frequency domain can be via NMF divided into, one as frequency bases(W) and one as features(H) which is only temporal.

# 2 Algorithms to Compute NMF

There are two ways to get factorization. One is exact NMF[6], which based on some extra conditions of original matrix and won’t be discussed here. Another is through updating W and H. Since this method do not gain a exact solution, whether the current W and H a good representation for the original signal or not. An object function/cost function must be given for update rules. Mathematically such a function is called metric, but only when it is symmetric. (Otherwise it can be considered as divergence).

## 2.1 Cost Functions

There exists many divergences which can compute the distance between two matrices. A standard method is squared Euclidean distance[4]:

$D_{EU}(A||B)=\sum&space;_{&space;ij&space;}^{&space;}{&space;{&space;({&space;A&space;}_{&space;ij&space;}-{&space;B&space;}_{&space;ij&space;})&space;}^{&space;2&space;}&space;}$                                                                           (1)

where A=V, B=WH, i j can simply be regarded as indices in original matrix(which are not related to the following update indices). Since this distance is based on a norm, which means it fulfills homogeneity. This property means, EU-Distance don’t tell the statistical property of the data.

Second method is based on Kullback–Leibler divergence[4]:

${&space;D&space;}_{&space;KL&space;}(A\left\|&space;B&space;\right)&space;=\sum&space;_{&space;ij&space;}^{&space;}{&space;({&space;A&space;}_{&space;ij&space;}\log&space;{&space;\frac&space;{&space;{&space;A&space;}_{&space;ij&space;}&space;}{&space;{&space;B&space;}_{&space;ij&space;}&space;}&space;}&space;-{&space;A&space;}_{&space;ij&space;}+{&space;B&space;}_{&space;ij&space;})&space;}$                                             (2)

It is also called relative entropy and describe the difference between two statistical distribution. Let A and B be two pdfs, since Log is a convex function, after reformulation it is easy to show that KL-divergence is always equal or larger than zero, equality holds if and only if A equal to B. Since KL-divergence do not fulfill symmetric and triangle inequality property it is not a real distance.

Third method is based on Itakura-Saito divergence[7]:

$D_{&space;IS&space;}(A\left\|&space;B&space;\right)&space;=\sum&space;_{&space;ij&space;}^{&space;}{&space;(\frac&space;{&space;{&space;A&space;}_{&space;ij&space;}&space;}{&space;{&space;B&space;}_{&space;ij&space;}&space;}&space;-\log&space;{&space;\frac&space;{&space;{&space;A&space;}_{&space;ij&space;}&space;}{&space;{&space;B&space;}_{&space;ij&space;}&space;}&space;}&space;-1)&space;}$                                                         (3)

This divergence reflect perceptual difference between the original spectrum and approximated spectrum. Under the assumption of superimposed gaussian components of given signal, the meaning of IS-NMF is as in this paper illustrated, its statistical contribution. Namely, the maximum likelihood estimation of W and H from V which is in sum of Gaussians components is exactly the NMF of V using IS divergence[7], the proof is shown in this paper.

## 2.2 Update Rules

For given above cost functions, the update rules should be found, which have to converge fast to the minimum and save computation. Here notice that, since these cost functions are convex in W only or in H only(where B = WH), which means only a local minimal of them can be guaranteed.

A directly approximation method is using gradient descent, but it converge slow to local minimum. Another method is conjugate gradient method, it converge faster for local minimum but the implementation is more complicated. Also the draw back of convergence using gradient based method is sensitivity to the step size, which is not practical for applications[4]. The following multiplicative update rules for the first two cost functions introduced by Lee&Seung and multiplicative rule for IS-NMF did a good balance between speed of convergence and implementation complexity.

Update rule of W and H for Euclidean distance cost function is as follow:

${&space;H_{&space;a\mu&space;}\leftarrow&space;}{&space;H_{&space;a\mu&space;}\frac&space;{&space;{&space;(W^{&space;T&space;}V)_{&space;a\mu&space;}&space;}&space;}{&space;({&space;W&space;}^{&space;T&space;}WH)_{&space;a\mu&space;}&space;}&space;}$                                                                                     (4)

${&space;W_{&space;ia&space;}\leftarrow&space;{&space;W_{&space;ia&space;}\frac&space;{&space;(VH^{&space;T&space;})_{&space;ia&space;}&space;}{&space;(WHH^{&space;T&space;})_{&space;ia&space;}&space;}&space;}&space;}$                                                                                      (5)

where i is the number of rows of V which turn to the row index of W and u is the number of columns of V which also turn into the column index of H, a is the reduced rank of W and H.

Update rule for KL-divergence is:

${&space;H&space;}_{&space;a\mu&space;}\leftarrow&space;{&space;H&space;}_{&space;a\mu&space;}\frac&space;{&space;{&space;\Sigma&space;}_{&space;i&space;}{&space;{&space;W&space;}_{&space;ia&space;}&space;}{&space;V_{&space;iu&space;}/({&space;WH&space;})_{&space;i\mu&space;}&space;}&space;}{&space;\Sigma&space;_{&space;k&space;}W_{&space;ka&space;}&space;}$                                                                       (6)

${&space;W_{&space;ia&space;}&space;}\leftarrow&space;{&space;W&space;}_{&space;ia&space;}\frac&space;{&space;{&space;\Sigma&space;}_{&space;\mu&space;}H_{&space;a\mu&space;}V_{&space;i\mu&space;}/(WH)_{&space;i\mu&space;}&space;}{&space;{&space;\Sigma&space;}_{&space;v&space;}H_{&space;av&space;}&space;}$                                                                       (7)

The only difference of indices are k and v, which ensure the constrain of the columns of W and rows of H both to sum to unity. Because it’s convenient to eliminating the degeneracy associated with the invariance of WH, when ${&space;W&space;}\rightarrow&space;{&space;W\Lambda&space;},&space;{&space;H&space;}\rightarrow&space;{&space;\Lambda&space;^{&space;-1&space;}{&space;H&space;}&space;}$, where ${\Lambda}$is a diagonal matrix[3].

The prove of non increasing of first two cost functions under such update rules were detailed described in this paper[4]. The two update rules are multiplicative. In particular, a simple additive update can be used for H, this is also shown in the paper.

IS-NMF multiplicative update rule is(IS-NMF/MU)[7]:

$H\leftarrow&space;{&space;H\frac&space;{&space;W^{&space;T&space;}(\nabla&space;^{&space;2&space;}(-\log&space;(WH)).V)&space;}{&space;W^{&space;T&space;}(\nabla&space;^{&space;2&space;}(-\log&space;(WH)).WH)&space;}&space;}$                                                             (8)

$W\leftarrow&space;{&space;W&space;}.\frac&space;{&space;(\nabla&space;^{&space;2&space;}(-\log&space;(WH)).V){&space;H^{&space;T&space;}&space;}&space;}{&space;(\nabla&space;^{&space;2&space;}(-\log&space;(WH)).WH)H^{&space;T&space;}&space;}$                                                            (9)

An advantage of IS-NMF compare with EU/KL-NMF is scale invariance. To see this property we need firstly the definition of $\beta$ - divergence, where when $\beta=1$ corresponds to KL-divergence, $\beta&space;=&space;2$ is EU-divergence. According to the definition of $\beta$ -divergence it it easy to get:

$d_\beta&space;(\gamma&space;x|\gamma&space;y)&space;=&space;\gamma&space;^\beta&space;d_\beta(x|y)$                                                                                        (10)

holds for all $\beta$ -divergence. And scale invariance holds only when $\beta&space;=&space;0$. This means the cost of a bad factorization for a low-power coefficients of x is the same as the cost of the factorization for a higher power coefficients. In contrary to the cases when $\beta&space;>0$, that rely more on larger coefficients, which leads to a less accuracy in the reconstruction of lower power components in audio spectra[7].

Additional, there is an extension of Estimation Maximization IS-NMF(IS-NMF/EM), SAGE algorithm, which is for special structured data (here is the frontal mentioned superimposed Gaussian components). It overcomes the disadvantage of IS-NMF/EM, that all the parameters from factorization in each iteration have to be updated, rather only a part of parameters will be updated.

Given multiple gaussian i.d.d. variables GMM

${\chi}\in&space;\mathbb{R}^{FN}&space;,{&space;x&space;}_{n}=\sum&space;_{&space;k=1&space;}^{&space;K&space;}{&space;{&space;c&space;}_{&space;k,n&space;}&space;}&space;,\quad&space;where\quad&space;{&space;c&space;}_{&space;k,n&space;}\sim&space;{&space;{&space;N&space;}_{&space;{&space;c&space;}&space;}&space;}(0,{&space;h&space;}_{&space;kn&space;}diag({&space;w&space;}_{&space;k&space;}))$            (11)

K is the number of basis functions.

And define ${C}_{k}:={{\left\{&space;c_{k,1},…,c_{k,N}&space;\right\}}$s of parameter ${\theta_{k}}=\left\{&space;{w_{k},h_{k}&space;}&space;\right\}$. The SAGE choose for each${\theta_{k}}$ a hidden-data space ${C}_{k}$, and the union of ${\theta_{k}}$ is the IS-NMF/EM of X.

The corresponding update rules is:

$\\h_{kn}^{(i+1)}&space;=&space;\frac{1}{F}\sum_{f}{\frac{v_{k,f}^{,}}{w_{fk}^{(i)}}}\\&space;w_{fk}^{(i+1)}&space;=&space;\frac{1}{N}\sum_{n}{\frac{v_{k,fn}^{,}}{h_{kn}^{(i+1)}}}$                                                                                          (12)

where: $v_{k,fn}&space;=&space;\left|{\mu_{k,fn}^{post}}\right|^{2}+\lambda_{k,fn}^{post}$, and $\lambda$ and $\mu$ are posterior mean and variance of $c_{k,fn}$. And $v_{k,fn}^{,}$ is computed from the most up-to-date parameters.

This model performs a good estimation for musical pitches compared with KL-NMF or EU-NMF even IS-NMF/MU[7].

# 3 Applications of NMF for SR

Because of parts-based characteristic of NMF and non-negativity of speech signal. NMF can be widely used for speech signal processing combined with some other constrains, such as the statistical model of input signal and sparsity constrain for H matrix.

## 3.1 Speech Denoising

Combine with statistical parameters of signal, NMF can be implemented for denoising

As in preprocessing part mentioned. The noise in speech processing can be regarded as stationary and non-stationary. The most filtering methods only deal with the former. By using NMF with priors[8], environment noise can be mostly eliminated, which non-stationary is.

In training step, the source signals, noise and speech, their spectrums are available individually. Namely${V_{speech}}$ and ${V_{noise}}$. By using KL-divergence, in training phase the cost function for both sources should be minimized respectively. Thus ${H_{speech}},{H_{noise}}$ were obtained with sizes $n_{b}&space;\times&space;n_{st}$ and $n_{b}&space;\times&space;n_{nt}$. Same as for $M_{speech}$ and $M_{speech}$ with same size $n_{f}&space;\times&space;n_{b}$. After nmf, the statistics of ${H_{speech}},{H_{noise}}$ will also be estimated, here Gaussians representation of signals was chosen for connivance. Then to compute the empirical means and covariances of their log values, yielding${\mu_{speech}}&space;,&space;{\mu_{noise}}$, ${\Lambda_{speech}}$ and ${\Lambda_{noise}}$ where each ${\mu}$ is vector and each ${\Lambda}$ is an covariance matrix.

According to the new coefficients ${\Lambda}&space;,&space;{\mu}$, a regularized cost function is introduced. A negative log likelihood for $\log{H}$ is added:

${L(H)}&space;=&space;-\frac{1}{2}\sum_{k}\left\{&space;{\left$${\log{H_{all_{ik}}-\mu&space;_{all}}}\right$$^{T}&space;\Lambda_{all}^{-1}&space;\left({\log{H_{all_{ik}}-\mu_{all}}}\right)-\log{\left[{\left({2\pi}\right)^{2n_{b}}\left|{\Lambda}\right|}\right]}}&space;\right\}$(13)

such that $H_{all}$ can be consistent with the statistics of $H_{speech}&space;,&space;H_{noise}$[8].

In denoising stage, to fix ${W_{speech}}&space;,&space;{W_{noise}}$ and assume that the basis functions perform a good description for speech and noise. Assuming the speech and noise are indeoendent, to concatenate to form ${\mu_{all}}&space;=&space;[{\mu_{speech}};&space;{\mu_{noise}}]$ and ${\Lambda_{all}}&space;=&space;[{\Lambda_{speech}}&space;{0};&space;{0}&space;{\Lambda_{noise}}&space;]$.

Finally, to reconstruct the denoised spectrogram, just to compute ${\overset{^}{V_{speech}}}&space;=&space;{W_{speech}}{H_{all}}$.

An simple example to illustrate NMF with prior: for$n_{b}&space;,&space;n_{f}$ both equal to 2, i.e. signal and noise signal are consist of high and low frequency components. The original signal shows that for speech the high and low frequencies correlated strongly, while in noise they are kind of negative correlated. Although the assumption is that the statistic of speech and noise are quite different. The results shows the potential, that by regularizing the KL-divergence with a log likelihood, such NMF algorithm do well for non-stationary denoisng for speech signals.

## 3.2 Single Channel Speech Separation using Sparse NMF(SNMF)

As a learning method, NMF is able to learn a dictionary for given training data, with new constrain of sparsity for code matrix H, which the reconstruction do speech separation of mixture.

To restrict the sparsity of H such that the data can be well approximated with some known statistic distributions, is the basic idea of speech separation using SNMF.

Consider a mixture as a sum of source signals:

$\mathcal{Y}&space;=&space;\sum_{i}^{R}{\mathcal{Y_{i}}}$

Where each source can be decomposed to its own basic functions ${\mathcal{D}_{i}}$ and code matrix ${\mathcal{H}_{i}}$. Thus by enforcing ${\mathcal{H}_{total}}$ to be sparse, there should exists a total Dictionary which was concatenated by $\mathcal{D}_{i}$, such that $\mathcal{Y}$ can be separated into $\mathcal{Y}_{i}$.

A new sparse constrain will be added to cost function. Since no statistic of data was considered, it’s more convenience to use EU-divergence. And the new cost function is:

$\mathcal{E}&space;=&space;\left\|&space;{Y-\bar{D}H}\right\|_{F}^{2}&space;+&space;\lambda&space;\sum_{ij}{H_{ij}}$                                                                          (14)

And corresponding update rule is:

$H_{ij}&space;\leftarrow&space;H_{ij}\cdot&space;\frac{Y_{i}^{T}\bar{D}_{j}}{R_{i}^{T}\bar{D}_{j}+\lambda}$                                                                                        (15)

$D_{&space;j&space;}\leftarrow&space;D_{&space;j&space;}\cdot&space;\frac&space;{&space;\Sigma&space;_{&space;i&space;}H_{&space;ij&space;}\left[&space;{&space;Y_{&space;i&space;}+\left(&space;{&space;R_{&space;i&space;}^{&space;T&space;}\bar&space;{&space;D&space;}&space;_{&space;j&space;}&space;}&space;\right)&space;\bar&space;{&space;D&space;}&space;_{&space;j&space;}&space;}&space;\right]&space;}{&space;\Sigma_{i}H_{ij}\left[&space;{R_{i}+\left({V_{i}^{T}\bar{D}_{j}}\right)\bar{D}_{j}}\right]&space;}$                                                            (16)

Where $\lambda$ is sparse degree, smaller \lambda infers deep sparsity. A relaxation from zero norm to one norm holds for an optimal sparsest reconstruction iff the space of parameters of cost fulfills the null space property.

Apply SNMF to the training data for individual speakers to learn dictionaries. By separation step the merging over-complete dictionary will keep fix, the sparse code H for mixture is then obtained. The separated source is reconstructed from the over-complete dictionary and H.

To notice that the computation of dictionary per speaker is computation demanded. Since the phoneme based property of speech signal. It’s able to learn dictionaries in phoneme level, by using HMM phoneme recognizer.

For each phoneme the dictionary is over-complete, it is shown that there are some general frequency properties of respective phoneme.

Intuitively to restrict the sparsity of H, the total over-complete dictionary can be regarded  as concatenation of phoneme level dictionary, which means such sparse H can be treat as our speech features, when given suitable training data.

The SNR is computed from separated wave form compare with clean sources.

$\lambda$ is important for the performance of separation. It controls the sparsity directly and the size of dictionary indirectly, namely for smaller $\lambda$ dictionary size increases.

The draw back of NMF is evident, that is its computation demand. Decomposition generally need 500 to 1000 iterations, and for each iteration the computation of cost function and parameter updates are included. Moreover in this process will be done in both training and testing phase, though by testing phase absence an update of basic functions. This leads to a hard decision of divergence and finding corresponding fast convergent update rules.

Contrary the regularization possibilities for cost functions provide a flexible and practical choice for different demands and multiple statistic of data.

To end this work, there is a question deserve to be mentioned. That is when dose NMF give a correct decomposition into parts?[F.R.] Namely under which condition performs NMF unique and correct recovery? And these should hold for all NMF algorithms.

References:

[1] W. H. Lawton (1971). Self Modeling Curve Resolution, technometrics vol.13, No.3, 617-622

[2] P. Paatero (2006) „Positive matrix factorization: optimal error estimator“, Environmetrics vol.5 Issue2, 111-126

[3] D. D. Lee (1999) „Learning the parts of objection by non-negative matrix factorization“, Nature vol. 401, 788-791

[4] D. D. Lee (2000) „Algorithms for non-negative matrix factorization“

[5] P. SMaragdis (2004) „Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs“

[6] Campbell, S.L.; G.D. Poole (1981) "Computing nonnegative rank factorizations.". Linear Algebra Appl. 35: 175–182

[7] C. Fevotte (2008) „Nonnegative matrix factorization with the Itakura-Saito divergence. With application to music analysis“

[8] K. Wilson (2008) „Speech denoising using Nonnegative matirx factorization with prior“

[9] M. N. Schmidt (2006) „Single-channel speech separation using sparse non-negative matrix factorization“

Further reading: D. Donoho „When dose Non-negative matrix factorization give a correct decomposition into parts?“