Advances in Nonnegative Matrix and Tensor Factorization

Computational Intelligence & Neuroscience

Advances in Nonnegative Matrix and Tensor Factorization

Guest Editors: Andrzej Cichocki, Morten Mørup, Paris Smaragdis, Wenwu Wang, and Rafal Zdunek

Advances in Nonnegative Matrix andTensor Factorization

Computational Intelligence and Neuroscience

Advances in Nonnegative Matrix andTensor Factorization

Guest Editors: Andrzej Cichocki, Morten Mørup,Paris Smaragdis, Wenwu Wang, and Rafal Zdunek

Copyright © 2008 Hindawi Publishing Corporation. All rights reserved.

This is a special issue published in volume 2008 of “Computational Intelligence and Neuroscience.” All articles are open access articlesdistributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in anymedium, provided the original work is properly cited.

Editor-in-ChiefAndrzej Cichocki, Riken, Brain Science Institute, Japan

Advisory Editors

Georg Adler, GermanyShun-ichi Amari, Japan

Remi Gervais, France Johannes Pantel, Germany

Associate Editors

Fabio Babiloni, ItalySylvain Baillet, FranceTheodore W. Berger, USAVince D Calhoun, USASeungjin Choi, KoreaS. Antonio Cruces-Alvarez, SpainDeniz Erdogmus, USASimone G. O. Fiori, Italy

Shangkai Gao, ChinaLars Kai Hansen, DenmarkShiro Ikeda, JapanPasi A. Karjalainen, FinlandYuanqing Li, SingaporeHiroyuki Nakahara, JapanKarim G. Oweiss, USARodrigo Quian Quiroga, UK

Saeid Sanei, UKHiroshige Takeichi, JapanAkaysha Tang, USAFabian Joachim Theis, JapanShiro Usui, JapanMarc Van Hulle, BelgiumYoko Yamaguchi, JapanLiqing Zhang, China

Contents

Advances in Nonnegative Matrix and Tensor Factorization, A. Cichocki, M. Mørup,P. Smaragdis, W. Wang, and R. ZdunekVolume 2008, Article ID 852187, 3 pages

Probabilistic Latent Variable Models as Nonnegative Factorizations, Madhusudana Shashanka,Bhiksha Raj, and Paris SmaragdisVolume 2008, Article ID 947438, 8 pages

Fast Nonnegative Matrix Factorization Algorithms Using Projected Gradient Approaches forLarge-Scale Problems, Rafal Zdunek and Andrzej CichockiVolume 2008, Article ID 939567, 13 pages

Theorems on Positive Data: On the Uniqueness of NMF, Hans Laurberg, Mads Græsbøll Christensen,Mark D. Plumbley, Lars Kai Hansen, and Søren Holdt JensenVolume 2008, Article ID 764206, 9 pages

Nonnegative Matrix Factorization with Gaussian Process Priors, Mikkel N. Schmidt andHans LaurbergVolume 2008, Article ID 361705, 10 pages

Extended Nonnegative Tensor Factorisation Models for Musical Sound Source Separation,Derry FitzGerald, Matt Cranitch, and Eugene CoyleVolume 2008, Article ID 872425, 15 pages

Gene Tree Labeling Using Nonnegative Matrix Factorization on Biomedical Literature,Kevin E. Heinrich, Michael W. Berry, and Ramin HomayouniVolume 2008, Article ID 276535, 12 pages

Single-Trial Decoding of Bistable Perception Based on Sparse Nonnegative Tensor Decomposition,Zhisong Wang, Alexander Maier, Nikos K. Logothetis, and Hualou LiangVolume 2008, Article ID 642387, 10 pages

Pattern Expression Nonnegative Matrix Factorization: Algorithm and Applications to Blind SourceSeparation, Junying Zhang, Le Wei, Xuerong Feng, Zhen Ma, and Yue WangVolume 2008, Article ID 168769, 10 pages

Robust Object Recognition under Partial Occlusions Using NMF, Daniel Soukup and Ivan BajlaVolume 2008, Article ID 857453, 14 pages

Hindawi Publishing CorporationComputational Intelligence and NeuroscienceVolume 2008, Article ID 852187, 3 pagesdoi:10.1155/2008/852187

Editorial

Advances in Nonnegative Matrix and Tensor Factorization

A. Cichocki,1 M. Mørup,2 P. Smaragdis,3 W. Wang,4 and R. Zdunek5

1 Laboratory for Advanced Brain Signal Processing, RIKEN Brain Science Institute, Saitama 351-0198, Japan2 Department of Informatics and Mathematical Modeling, Technical University of Denmark, Richard Petersens Plads,Building 321, 2800 Lyngby, Denmark

3 Advanced Technology Labs, Adobe Systems Inc., 275 Grove Street, Newton, MA 02466, USA4 Centre for Vision, Speech, and Signal Processing, University of Surrey, Guildford GU2 7XH, UK5 Institute of Telecommunications, Teleinformatics, and Acoustics, Wroclaw University of Technology,Wybrzeze Wyspianskiego 27, 50370 Wroclaw, Poland

Correspondence should be addressed to W. Wang, [email protected]

Received 16 June 2008; Accepted 16 June 2008

Copyright © 2008 A. Cichocki et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Nonnegative matrix factorization (NMF) and its extensionknown as nonnegative tensor factorization (NTF) are emerg-ing techniques that have been proposed recently. The goal ofNMF/NTF is to decompose a nonnegative data matrix into aproduct of lower-rank nonnegative matrices or tensors (i.e.,multiway arrays). An NMF approach similar to independentcomponent analysis (ICA) or sparse component analysis(SCA) is very useful and promising for decomposing high-dimensional datasets into a lower-dimensional space. Agreat deal of interest has been given very recently to NMFmodels and techniques due to their capability of providingnew insights and relevant information on the complexlatent relationships in experimental datasets, and due toproviding meaningful components with physical or physio-logical interpretations. For example, in bioinformatics, NMFand its extensions have been successfully applied to geneexpression, sequence analysis, functional characterizationof genes, clustering, and text mining. The main differencebetween NMF and other classical factorizations such asPCA, SCA, or ICA methods relies on the nonnegativity,and usually also additional constraints such as sparseness,smoothness, and/or orthogonality imposed on the models.These constraints tend to lead to a parts-based representationof the data, because they allow only additive, not subtractive,combinations of data items. In this way, the nonnegativecomponents or factors produced by this approach can beinterpreted as parts of the data. In other words, NMFyields nonnegative factors, which can be advantageousfrom the point of view of interpretability of the estimatedcomponents. Furthermore, in many real applications, data

have a multiway (multiway array or tensor) structure.Exemplary data are video stream (rows, columns, RGBcolor coordinates, time), EEG in neuroscience (channels,frequency, time, samples, conditions, subjects), bibliographictext data (keywords, papers, authors, journals), and so on.Conventional methods preprocess multiway data, arrangingthem into a matrix. Recently, there has been a great deal ofresearch on multiway analysis which conserves the originalmultiway structure of the data. The techniques have beenshown to be very useful in a number of applications,such as signal separation, feature extraction, audio coding,speech classification, image compression, spectral clustering,neuroscience, and biomedical signal analysis.

This special issue focuses on the most recent advancesin NMF/NTF methods, with emphasis on the efforts madeparticularly by the researchers from the signal processingand neuroscience area. It reports novel theoretical results,efficient algorithms, and their applications. It also providesinsight into current challenging areas, and identifies futureresearch directions.

This issue includes several important contributionswhich cover a wide range of approaches and techniques forNMF/NTF and their applications. These contributions aresummarized as follows.

The first paper, entitled “Probabilistic latent variablemodels as nonnegative factorizations” by M. Shashanka etal., presents a family of probabilistic latent variable modelsthat can be used for analysis of nonnegative data. Thepaper shows that there are strong ties between NMF andthis family, and provides some straightforward extensions

2 Computational Intelligence and Neuroscience

which can help in dealing with shift invariances, higher-orderdecompositions, and sparsity constraints. Furthermore, itargues through these extensions that the use of this approachallows for rapid development of complex statistical modelsfor analyzing nonnegative data.

The second paper, entitled “Fast nonnegative matrixfactorization algorithms using projected gradient approachesfor large-scale problems” by R. Zdunek and A. Cichocki,investigates the applicability of projected gradient (PG)methods to NMF, based on the observation that the PGmethods have high efficiency in solving large-scale convexminimization problems subject to linear constraints, sincethe minimization problems underlying NMF of large matri-ces well match this class of minimization problems. In partic-ular, the paper has investigated several modified and adoptedmethods, including projected Landweber method, Barzilai-Borwein gradient projection, projected sequential subspaceoptimization, interior-point Newton algorithm, and sequen-tial coordinatewise minimization algorithm, and comparedtheir performance in terms of signal-to-interference ratioand elapsed time, using a simple benchmark of mixedpartially dependent nonnegative signals.

The third paper, entitled “Theorems on positive data: onthe uniqueness of NMF” by H. Laurberg et al., investigatesthe conditions for which NMF is unique, and introduces sev-eral theorems which can determine whether the decomposi-tion is in fact unique or not. Several examples are providedto show the use of the theorems and their limitations. Thepaper also shows that corruption of a unique NMF matrixby additive noise leads to a noisy estimation of the noise-freeunique solution. Moreover, it uses a stochastic view of NMFto analyze which characterization of the underlying modelwill result in an NMF with small estimation errors.

The fourth paper, entitled “Nonnegative matrix factor-ization with Gaussian process priors” by M. N. Schmidtand H. Laurberg, presents a general method for includingprior knowledge in NMF, based on Gaussian process priors.It assumes that the nonnegative factors in the NMF arelinked by a strictly increasing function to an underlyingGaussian process specified by its covariance function. TheNMF decompositions are found to be in agreement with theprior knowledge of the distribution of the factors, such assparseness, smoothness, and symmetries.

The fifth paper, entitled “Extended nonnegative tensorfactorisation models for musical sound source separation” byD. FitzGerald et al., presents a new additive synthesis-basedNTF approach which allows the use of linear-frequency spec-trograms as well as imposing strict harmonic constraints,resulting in an improved model as compared with someexisting shift-invariant tensor factorization algorithms inwhich the use of log-frequency spectrograms to allow shiftinvariance in frequency causes problems when attempting toresynthesize the separated sources. The paper further studiesthe addition of a source filter model to the factorizationframework, and presents an extended model which is capableof separating mixtures of pitched and percussive instrumentssimultaneously.

The sixth paper, entitled “Gene tree labeling usingnonnegative matrix factorization on biomedical literature”

by K. E. Heinrich et al., addresses a challenging problemfor biological applications, that is, identifying functionalgroups of genes. It examines the NMF technique for labelinghierarchical trees. It proposes a generic labeling algorithm aswell as an evaluation technique, and discusses the effects ofdifferent NMF parameters with regard to convergence andlabeling accuracy. The primary goals of this paper are toprovide a qualitative assessment of the NMF and its variousparameters and initialization, to provide an automated wayto classify biomedical data, and to provide a method forevaluating labeled data assuming a static input tree. Thispaper also proposes a method for generating gold standardtrees.

The seventh paper, entitled “Single-trial decoding ofbistable perception based on sparse nonnegative tensordecomposition” by Z. Wang et al., presents a sparse NTF-based method to extract features from the local field potential(LFP), collected from the middle temporal visual cortexin a macaque monkey, for decoding its bistable structure-from-motion perception. The advantages of the sparse NTF-based feature-extraction approach lie in its capability to yieldcomponents common across the space, time, and frequencydomains, yet discriminative across different conditions with-out prior knowledge of the discriminating frequency bandsand temporal windows for a specific subject. The resultssuggest that imposing the sparseness constraints on theNTF improves extraction of the gamma band feature whichcarries the most discriminative information for bistableperception.

The eighth paper, entitled “Pattern expression non-negative matrix factorization: algorithm and applicationsto blind source separation” by J. Zhang et al., presentsa pattern expression NMF (PE-NMF) approach from theview point of using basis vectors most effectively to expresspatterns. Two regularization or penalty terms are introducedto be added to the original loss function of a standardNMF for effective expression of patterns with basis vectorsin the PE-NMF. A learning algorithm is presented, andthe convergence of the algorithm is proved theoretically.Three illustrative examples for blind source separationincluding heterogeneity correction for gene microarray dataindicate that the sources can be successfully recoveredwith the proposed PE-NMF when the two parameters canbe suitably chosen from prior knowledge of the prob-lem.

The last paper, entitled “Robust object recognitionunder partial occlusions using NMF” by D. Soukup andI. Bajla, studies NMF methods for recognition tasks withoccluded objects. The paper analyzes the influence ofsparseness on recognition rates for various dimensionsof subspaces generated for two image databases, ORLface database, and USPS handwritten digit database. Italso studies the behavior of four types of distancesbetween a projected unknown image object and featurevectors in NMF subspaces generated for training data.In the recognition phase, partial occlusions in the testimages have been modeled by putting two randomlylarge, randomly positioned black rectangles into each testimage.

Computational Intelligence and Neuroscience 3

Acknowledgments

The guest editors of this special issue are extremely gratefulto all the reviewers who took time to carefully read thesubmitted manuscripts and to provide critical commentswhich helped to ensure the high quality of this issue. Theguest editors are also much indebted to the authors for theirimportant contributions. All these tremendous efforts anddedication have contributed to make this issue a reality.

A. CichockiM. Mørup

P. SmaragdisW. Wang

R. Zdunek


Research Article

Probabilistic Latent Variable Models asNonnegative Factorizations

Madhusudana Shashanka,1 Bhiksha Raj,2 and Paris Smaragdis3

1 Mars Incorporated, 800 High Street, Hackettstown, New Jersy 07840, USA2 Mitsubishi Electric Research Laboratories, 201 Broadway, Cambridge MA 02139, USA3 Adobe Systems Incorporated, 275 Grove Street, Newton MA 02466, USA

Correspondence should be addressed to Paris Smaragdis, [email protected]

Received 21 December 2007; Accepted 13 February 2008

Recommended by Rafal Zdunek

This paper presents a family of probabilistic latent variable models that can be used for analysis of nonnegative data. We showthat there are strong ties between nonnegative matrix factorization and this family, and provide some straightforward extensionswhich can help in dealing with shift invariances, higher-order decompositions and sparsity constraints. We argue through theseextensions that the use of this approach allows for rapid development of complex statistical models for analyzing nonnegative data.

Copyright © 2008 Madhusudana Shashanka et al. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

1. Introduction

Techniques to analyze nonnegative data are required inseveral applications such as analysis of images, text corporaand audio spectra to name a few. A variety of techniqueshave been proposed for the analysis of such data, such asnonnegative PCA [1], nonnegative ICA [2], nonnegativematrix factorization (NMF) [3], and so on. The goal ofall of these techniques is to explain the given nonnegativedata as a guaranteed nonnegative linear combination of aset of nonnegative “bases” that represents realistic “buildingblocks” for the data. Of these, probably the most developedis non-negative matrix factorization, with much recentresearch devoted to the topic [4–6]. All of these approachesview each data vector as a point in an N-dimensional spaceand attempt to identify the bases that best explain thedistribution of the data within this space. For the sake ofclarity, we will refer to data that represent vectors in any spaceas point data.

A somewhat related, but separate topic that has garneredmuch research over the years is the analysis of histogramsof multivariate data. Histogram data represent the countsof occurrences of a set of events in a given data set. Theaim here is to identify the statistical factors that affect the

occurrence of data through the analysis of these countsand appropriate modeling of the distributions underlyingthem. Such analysis is often required in the analysis of text,behavioral patterns, and so on. A variety of techniques, suchas probabilistic latent semantic analysis [7], latent Dirichletallocation [8], and so on and their derivatives have latelybecome quite popular. Most, if not all of them, can berelated to a class of probabilistic models, known in thebehavioral sciences community as latent class models [9–11],that attempt to explain the observed histograms as havingbeen drawn from a set of latent classes, each with its owndistribution. For clarity, we will refer to histograms andcollections of histograms as histogram data.

In this paper, we argue that techniques meant for analysisof histogram data can be equally effectively employedfor decomposition of nonnegative point data as well, byinterpreting the latter as scaled histograms rather thanvectors. Specifically, we show that the algorithms usedfor estimating the parameters of a latent class model arenumerically equivalent to the update rules for one formof NMF. We also propose alternate latent variable modelsfor histogram decomposition that are similar to thosecommonly employed in the analysis of text, to decomposepoint data and show that these too are identical to the


update rules for NMF. We will generically refer to theapplication of histogram-decomposition techniques to pointdata as probabilistic decompositions. (This must not beconfused with approaches that model the distribution of theset of vectors. In our approach, the vectors themselves arehistograms, or, alternately, scaled probability distributions.)

Beyond simple equivalences to NMF, the probabilisticdecomposition approach has several advantages, as weexplain. Nonnegative PCA/ICA and NMF are primarilyintended for matrix-like two-dimensional characterizationsof data—the analysis is obtained for matrices that are formedby laying data vectors side-by-side. They do not naturallyextend to higher-dimensional tensorial representations, thishas been often accomplished by implicit unwrapping thetensors into a matrix. However, the probabilistic decompo-sition naturally extends from matrices to tensors of arbitrarydimensions.

It is often desired to control the form or structure ofthe learned bases and their projections. Since the procedurefor learning the bases that represent the data is statistical,probabilistic decomposition affords control over the formof the learned bases through the imposition of a prioriprobabilities, as we will show. Constraints such as sparsitycan also be incorporated through these priors.

We also describe extensions to the basic probabilisticdecomposition framework that permits shift invariancealong one or more of the dimensions (of the data tensor) thatcan abstract convolutively combined bases from the data.

The rest of the paper is organized as follows. Since, theprobabilistic decomposition approach we promote in thispaper is most analogous to nonnegative matrix factorization(NMF) among all techniques that analyze nonnegative pointdata, we begin with a brief discussion of NMF. We presentthe family of latent variable models in Section 3 that we willemploy for probabilistic decompositions. We present tensorgeneralizations in Section 4.1 and convolutive factorizationsin Section 4.2. In Section 4.3, we discuss extensions suchas incorporation of sparsity and in Section 4.4, we presentaspects of geometric interpretation of these decompositions.

2. Nonnegative Matrix Factorization

Nonnegative matrix factorization was introduced by [3] tofind nonnegative parts-based representation of data. GivenanM×N matrix V, where each column corresponds to a datavector, NMF approximates it as a product of nonnegativematrices W and H, that is, V ≈ WH, where W is an M × Kmatrix and H is a K × N matrix. The above approximationcan be written column by column as vn ≈ Whn, where vnand hn are the nth columns of V and H, respectively. Inother words, each data vector vn is approximated by a linearcombination of the columns of W, weighted by the entries ofhn. The columns of W can be thought of as basis vectors that,when combined with appropriate mixture weights (entries ofthe columns of H), provide a linear approximation of V.

The optimal choice of matrices W and H are defined bythose nonnegative matrices that minimize the reconstructionerror between V and WH. Different error functions havebeen proposed which lead to different update rules (e.g.,

[3, 12]). Shown below are multiplicative update rules derivedby [3] using an error measure similar to the Kullback-Leiblerdivergence:

Wmk ←−Wmk

∑

n

Vmn

(WH)mnHkn,

Wmk ←− Wmk∑mWmk

,

Hkn ←− Hkn

∑

m

WmkVmn

(WH)mn,

(1)

where Aij represents the value at ith row and the jth columnof matrix A.

3. Latent Variable Models

In its simplest form, NMF expresses an M×N data matrix Vas the product of non-negative matrices W and H. The idea isto express the data vectors (columns of V) as a combinationof a set of basis components or latent factors (columns ofW). Below, we show that a class of probabilistic modelsemploying latent variables, known in the field of social andbehavioral sciences as latent class models (e.g., [9, 11, 13]), isequivalent to NMF.

Let us represent the two dimensions of the matrix V by x1

and x2, respectively. We can consider the nonnegative entriesVx1x2 as having been generated by an underlying probabilitydistribution P(x1, x2). Variables x1 and x2 are multinomialrandom variables, where x1 can take one out of a set of Mvalues in a given draw and x2 can take one out of a set of Nvalues in a given draw. In other words, one can model Vmn,the entry in row m and column n, as the number of timesfeatures x1 = m and x2 = n were picked in a set of repeateddraws from the distribution P(x1, x2). Unlike NMF whichtries to characterize the observed data directly, latent classmodels characterize the underlying distribution P(x1, x2).This subtle difference of interpretation preserves all theadvantages of NMF, while overcoming some of its limitationsby providing a framework that is easy to generalize, extend,and interpret.

There are two ways of modeling P(x1, x2) and we considerthem separately below.

3.1. Symmetric Factorization

Latent class models enable one to attribute the observa-tions as being due to hidden or latent factors. The maincharacteristic of these models is conditional independence—multivariate data are modeled as belonging to latent classessuch that the random variables within a latent class areindependent of one another. The model expresses a multi-variate distribution such as P(x1, x2) as a mixture where eachcomponent of the mixture is a product of one-dimensionalmarginal distributions. In the case of two dimensional datasuch as V, the model can be written mathematically as

P(x1, x2

) =∑

z∈{1,2,...,K}P(z)P

(x1|z

)P(x2|z

). (2)


In (2), z is a latent variable that indexes the hiddencomponents and takes values from the set {1, . . . ,K}. Thisequation assumes the principle of local independence, wherebythe latent variable z renders the observed variables x1 andx2 independent. This model was presented independently asprobabilistic latent component analysis (PLCA) by [14]. Theaim of the model is to characterize the distribution under-lying the data as shown above by learning the parameters sothat hidden structure present in the data becomes explicit.

The model can be expressed as a matrix factorization.Representing the parameters P(x1|z), P(x2|z), and P(z) asentries of matrices W, G, and S, respectively, where

(i) W is a M × K matrix such that Wmk corresponds tothe probability P(x1 = m|z = k);

(ii) G is a K ×N matrix such that Gkn corresponds to theprobability P(x2 = n|z = k); and

(iii) S is aK×K diagonal matrix such that Skk correspondsto the probability P(z = k);

one can write the model of (2) in matrix form as

P = WSG, or equivalently,

P = WH,(3)

where the entries of matrix P correspond to P(x1, x2) andH = SG. Figure 1 illustrates the model schematically.

Parameters can be estimated using EM algorithm. Theupdate equations for the parameters can be written as

P(z|x1, x2

) = P(z)P(x1|z

)P(x2|z

)∑

z P(z)P(x1|z

)P(x2|z

) ,

P(xi|z

) =∑

j∈{1,2}, j /=iVx1x2P(z|x1, x2

)∑

x1,x2Vx1x2P

(z|x1, x2

) ,

P(z) =∑

x1,x2Vx1x2P

(z|x1, x2

)∑

z,x1,x2Vx1x2P

(z|x1, x2

) .

(4)

Writing the above update equations in matrix form usingW and H from (3), we obtain

Wmk ←−Wmk

∑

n

Vmn

(WH)mnHkn, Wmk ←− Wmk∑

mWmk,

Hkn ←− Hkn

∑

m

WmkVmn

(WH)mn, Hkn ←− Hkn∑

k,nHkn.

(5)

The above equations are identical to the NMF updateequations of (1) upto a scaling factor in H. This is due tothe fact that the probabilistic model decomposes P which isequivalent to a normalized version of the data V. Reference[14] presents detailed derivation of the update algorithmsand comparison with NMF update equations. This modelhas been used in analyzing image and audio data amongother applications (e.g., [14–16]).

3.2. Asymmetric Factorization

The latent class model of (2) considers each dimen-sion symmetrically for factorization. The two dimensional

P(x1, x2) =

P(x1, x2, z) P(x1|z) P(z)

P(x2|z)

=

Figure 1: Latent variable model of (2) as matrix factorization.

P(x1|x2) =

P(x1|z)P(z|x2)

Figure 2: Latent variable model of (6) as matrix factorization.

distribution P(x1, x2) is expressed as a mixture of two-dimensional latent factors where each factor is a product ofone-dimensional marginal distributions. Now, consider thefollowing factorization of P(x1, x2):

P(x1, x2

) = P(xi)P(xj|xi

),

P(xj|xi

) =∑

z

P(xj|z

)P(z|xi

), (6)

where i, j ∈ {1, 2}, i /= j and z is a latent variable. This versionof the model with asymmetric factorization is popularlyknown as probabilistic latent semantic analysis (PLSA) in thetopic-modeling literature [7].

Without loss of generality, let j = 1 and i = 2. We canwrite the above model in matrix form as qn = Wgn, whereqn is a column vector indicating P(x1|x2), gn is a columnvector indicating P(z|x2), and W is a matrix with the (m, k)thelement corresponding to P(x1 = m|z = k). If z takes Kvalues, W is a M × K matrix. Concatenating all columnvectors qn and gn as matrices Q and G, respectively, one canwrite the model as

Q = WG, or equivalently

V = WGS = WH,(7)

where S is a N × N diagonal matrix whose nth diagonalelement is the sum of the entries of vn (the nth column of V),and H = GS. Figure 2 provides a schematic illustration of themodel.

Given data matrix V, parameters P(x1|z) and P(z|x2) areestimated by iterations of equations derived using the EMalgorithm:

P(z|x1, x2

) = P(z|x2

)P(x1|z

)∑

z P(z|x2

)P(x1|z

) ,

P(x1|z

) =∑

x2Vx1x2P

(z|x1, x2

)∑

x1,x2Vx1x2P

(z|x1, x2

) ,

P(z|x2

) =∑

x1Vx1x2P

(z|x1, x2

)∑

x1Vx1x2

.

(8)


Writing the above equations in matrix form using W and Hfrom (7), we obtain

Wmk ←−Wmk

∑

n

Vmn

(WH)mnHkn,

Wmk ←− Wmk∑mWmk

,

Hkn ←− Hkn

∑

m

WmkVmn

(WH)mn.

(9)

The above set of equations is exactly identical to the NMFupdate equations of (1). See [17, 18] for detailed derivationof the update equations. The equivalence between NMF andPLSA has also been pointed out by [19]. The model has beenused for the analysis of audio spectra (e.g., [20]), images (e.g.,[17, 21]), and text corpora (e.g., [7]).

4. Model Extensions

The popularity of NMF comes mainly from its empiricalsuccess in finding “useful components” from the data.As pointed out by several researchers, NMF has certainimportant limitations despite the success. We have presentedprobabilistic models that are numerically closely related to oridentical to one of the widely used NMF update algorithms.Despite the numerical equivalence, the methodological dif-ference in approaches is important. In this section, we outlinesome advantages of using this alternate probabilistic view ofNMF.

The first and most straightforward implication of usinga probabilistic approach is that it provides a theoretical basisfor the technique. And more importantly, the probabilisticunderpinning enables one to utilize all the tools and machin-ery of statistical inference for estimation. This is crucial forextensions and generalizations of the method. Beyond theseobvious advantages, below we discuss some specific exampleswhere utilizing this approach is more useful.

4.1. Tensorial Factorization

NMF was introduced to analyze two-dimensional data.However, there are several domains with nonnegative mul-tidimensional data where a multidimensional correlate ofNMF could be very useful. This problem has been termed asnonnegative tensor factorization (NTF). Several extensionsof NMF have been proposed to handle multi-dimensionaldata (e.g., [4–6, 22]). Typically, these methods flatten thetensor into a matrix representation and proceed furtherwith analysis. Conceptually, NTF is a natural generalizationof NMF, but the estimation algorithms for learning theparameters, however, do not lend themselves to extensionseasily. Several issues contribute to this difficulty. We do notpresent the reasons here due to lack of space but a detaileddiscussion can be found in [6].

Now, consider the symmetric factorization case of thelatent variable model presented in Section 3.1. This model isnaturally suited for generalizations to multiple dimensions.In its general form, the model expresses a K-dimensional

distribution as a mixture, where eachK-dimensional compo-nent of the mixture is a product of one-dimensional marginaldistributions. Mathematically, it can be written as

P(x) =∑

z

P(z)K∏

j=1

P(xj|z

), (10)

where P(x) is a K-dimensional distribution of the randomvariable x = x1, x2, . . . , xK . z is the latent variable indexingthe mixture components and P(xj|z) are one-dimensionalmarginal distributions. Parameters are estimated by itera-tions of equations derived using the EM algorithm and theyare

R(x, z) =P(z)

∏Nj=1P

(xj|z

)

∑z′ P(z′)∏N

j=1P(xj|z′

) ,

P(z) =∑

j

∑

x j

P(x)R(x, z),

P(xj|z

) =∑

i:i /= j∑

xiP(x)R(x, z)

P(z).

(11)

In the two-dimensional case, the update equationsreduce to (4).

To illustrate the kind of output of this algorithm,consider the following toy example. The input P(x) wasthe 3-dimensional distribution shown in the upper leftplot in Figure 3. This distribution can also be seen as arank 3 positive tensor. It is clearly composed out of twocomponents, each being an isotropic Gaussian with meansat μ1 = 11, 11, 9 and μ2 = 14, 14, 16 and variances σ2

1 = 1and σ2

2 = 1/2, respectively. The bottom row of plots showsthe derived sets of P(xj|z) using the estimation procedure wejust described. We can see that each of them is composed outof a Gaussian at the expected position and with the expectedvariance. The approximated P(x) using this mode is shownin the top right. Other examples of applications on morecomplex data and a detailed derivation of the algorithm canbe found in [14, 23].

4.2. Convolutive Decompositions

Given a two-dimensional dataset, NMF finds hidden struc-ture along one dimension (columnwise) that is characteristicto the entire dataset. Consider a scenario where there islocalized structure present along both dimensions (rowsand columns) that has to be extracted from the data. Anexample dataset would be an acoustic spectrogram of humanspeech which has structure along both frequency and time.Traditional NMF is unable to find structure across bothdimensions and several extensions have been proposed tohandle such datasets (e.g., [24, 25]).

The latent variable model can be extended for suchdatasets and the parameter estimation still follows a simpleEM algorithm based on the principle of maximum likeli-hood. The model, known as a shift invariant version of PLCA,can be mathematically written as [23]

P(x) =∑

z

(P(z)

∫P(

w, τ|z)P(h− τ|z)dτ)

, (12)


810

12 14

x

8

10

12

14

y

8

10

12

14

16

z

P(x)

(a)

8 10 1214

x

8

10

12

14

y

8

10

12

14

16

z

Approximated P(x)

(b)

5 10 15 20

1

2

P(x1|z)

(c)

5 10 15 20

1

2

P(x2|z)

(d)

5 10 15 20

1

2

P(x3|z)

(e)

Figure 3: An example of a higher dimensional positive datadecomposition. An isosurface of the original input is shown at thetop left, the approximation by the model in (10) is shown in thetop right, and the extracted marginals (or factors) are shown in thelower plots.

where the kernel distribution P(w, τ|z) = 0,∀τ /∈R whereR defines a local convex region along the dimensions ofx. Similar to the simple model of (2), the model expressesP(x) as a mixture of latent components. But instead ofeach component being a simple product of one-dimensionaldistributions, the components are convolutions between amultidimensional “kernel distribution” and a multidimen-sional “impulse distribution”. The update equations for theparameters are

R(x, τ, z) = P(z)P(

w, τ|z)P(h− τ|z)∑z′P(z′) ∫

P(

w, τ′|z′)P(h− τ′|z′)dτ′ ,

P(z) =∫R(x, z)dx,

P(w, τ|z) =∫P(x)R(x, τ, z)dh

P(z),

P(h|z) =∫P(w, h + τ)R(w, h + τ, τ, z)dw dτ∫

P(w, h′ + τ)R(w, h′ + τ, τ, z)dh′dw dτ.

(13)

Detailed derivation of the algorithm can be found in [14].The above model is able to deal with tensorial data just as wellas matrix data. To illustrate this model, consider the picturein the top left of Figure 4. This particular image is a rank-3 tensor (x, y, color). We wish to discover the underlyingcomponents that make up this image. The components arethe digits 1, 2, 3 and appear in various spatial locations,thereby necessitating a “shift-invariant” approach. Using the

P(x)

(a)

Approximated P(x)

(b)

1 2 30

0.10.20.30.4

P(z)

(c)

P(w, τ|z1)

(d)

P(w, τ|z2)

(e)

P(w, τ|z3)

(f)

P(h|z1)

(g)

P(h|z2)

(h)

P(h|z3)

(i)

Figure 4: An example of a higher dimensional shift-invariantpositive data decomposition. The original input is shown at the topleft, the approximation by the model in (12) is shown in the topmiddle, and the extracted kernels and impulses are shown in thelower plots.

No constraint

Ker

nel

s

(a)

Sparse impulseconstraint

Ker

nel

s

(b)

Sparse kernelconstraint

Ker

nel

s

(c)

No constraint

Impu

lses

(d)

Sparse impulseconstraint

Impu

lses

(e)

Sparse kernelconstraint

Impu

lses

(f)

Figure 5: Example of the effect of the entropic prior on a set ofkernel and impulse distributions. If no constraint is imposed, theinformation is evenly distributed among the two distributions (leftcolumn), if sparsity is imposed on the impulse distribution, mostinformation lies in the kernel distribution (middle column), andvice verse if we request a sparse kernel distribution (right column).

aforementioned algorithm, we obtain the results shown inFigure 4. Other examples of such decompositions on morecomplex data are shown in [23].

The example above illustrates shift invariance, but it isconceivable that “components” that form the input mightoccur with transformations such as rotations and/or scalingin addition to translations (shifts). It is possible to extend thismodel to incorporate invariance to such transformations.


(100)

(010)3 basis vectors

(001)

Simplex boundaryData points

Basis vectorsConvex hull

Figure 6: Illustration of the latent variable model. Panel shows3-dimensional data distributions as points within the Standard 2-Simplex given by {(001), (010), (100)}. The model approximatesdata distributions as points lying within the convex hull formedby the components (basis vectors). Also shown are two data points(marked by + and×) and their approximations by the model (resp.,shown by ♦ and �).

The derivation follows naturally from the approach outlinedabove, but we omit further discussion here due to spaceconstraints.

4.3. Extensions in the Form of Priors

One of the more apparent limitations of NMF is related tothe quality of components that are extracted. Researchershave pointed out that NMF, as introduced by Lee and Seung,does not have an explicit way to control the “sparsity”of the desired components [26]. In fact, the inability toimpose sparsity is just a specific example of a more generallimitation. NMF does not provide a way to impose known orhypothesized structure about the data during estimation.

To elaborate, let us consider the example of sparsity.Several extensions have been proposed to NMF to incor-porate sparsity (e.g., [26–28]). The general idea in thesemethods is to impose a cost function during estimationthat incorporates an additional constraint that quantifies thesparsity of the obtained factors. While sparsity is usuallyspecified as the L0 norm of the derived factors [29], theactual constraints used consider an L1 norm, since the L0norm is not amenable to optimization within a procedurethat primarily attempts to minimize the L2 norm of the errorbetween the original data and the approximation given bythe estimated factors. In the probabilistic formulation, therelationship of the sparsity constraint to the actual objectivefunction optimized is more direct. We characterize sparsitythrough the entropy of the derived factors, as originallyspecified in [30]. A sparse code is defined as a set ofbasis vectors such that any given data point can be largelyexplained by only a few bases from the set, such that therequired contribution of the rest of the bases to the data pointis minimal; that is, the entropy of the mixture weights bywhich the bases are combined to explain the data point is low.

A sparse code can now be obtained by imposing the entropicprior over the mixture weights. For a given distribution θ,the entropic prior is defined as P(θ) ∝ e−βH(θ), where H(θ)is the entropy. Imposition of this prior (with a positive β)on the mixture weights just means that we obtain solutionswhere mixture weights with low entropy are more likely tooccur—a low entropy ensures that few entries of the vectorare significant. Sparsity has been imposed in latent variablemodels by utilizing the entropic prior and has been shown toprovide a better characterization of the data [17, 18, 23, 31].Detailed derivation and estimation algorithms can be foundin [17, 18]. Notice that priors can be imposed on any set ofparameters during estimation.

Information theoretically, entropy is a measure of infor-mation content. One can consider the entropic prior asproviding an explicit way to control the amount of “informa-tion content” desired on the components. We illustrate thisidea using a simple shift-invariance case. Consider an imagewhich is composed out of scattered plus sign characters.Upon analysis of that image, we would expect the kerneldistribution to be a “+”, and the impulse distribution to bea set of delta functions placing it appropriately in space.However, using the entropic prior we can distribute theamount of information from the kernel distribution to theimpulse distribution or vice-versa. We show the results fromthis analysis in Figure 5 in terms of three cases - where noentropic prior is used (left panels), where it is used to makethe impulse sparse (mid panels), and where it is used tomake the kernel sparse (right panels). In the left panels,information about the data is distributed both in the kernel(top) and in the impulse distribution (bottom). In the othertwo cases, we were able to concentrate all the informationeither in the kernel or in the impulse distribution by makinguse of the entropic prior.

Other prior distributions that have been used in variouscontexts include the Dirichlet [8, 32] and log-normaldistributions [33] among others. The ability to utilizeprior distributions during estimation provides a way toincorporate information known about the problem. Moreimportantly, the probabilistic framework provides provenmethods of statistical inference techniques that one canemploy for parameter estimation. We point out that theseextensions can work with all the generalizations that werepresented in the previous sections.

4.4. Geometrical Interpretation

We also want to briefly point out that probabilistic modelscan sometimes provide insights that are helpful for anintuitive understanding of the workings of the model.

Consider the asymmetric factorization case of the latentvariable model as given by (6). Let us refer to the normalizedcolumns of the data matrix V (obtained by scaling the entriesof every column to sum to 1), vn, as data distributions.It can be shown that learning the model is equivalent toestimating parameters such that the model P(x1|x2) for anydata distribution vx2 best approximates it. Notice that thedata distributions vx2 , model approximations P(x1|x2), andcomponents P(x1|z) are all M-dimensional vectors that sum


to unity, and hence points in a (M − 1) simplex. The modelexpresses P(x1|x2) as points within the convex hull formedby the components P(x1|z). Since it is constrained to liewithin this convex hull, P(x1|x2) can model vx2 accuratelyonly if the latter also lies within the convex hull. Thus theobjective of the model is to estimate P(x1|z) as corners ofa convex hull such that all the data distributions lie within.This is illustrated in Figure 6 for a toy dataset of 400 three-dimensional data distributions.

Not all probabilistic formulations provide such a cleangeometric interpretation but in certain cases as outlinedabove, it can lead to interpretations that are intuitivelyhelpful.

5. Discussion and Conclusions

In this paper, we presented a family of latent variable modelsand shown their utility in the analysis of nonnegative data.We show that the latent variable models decompositions arenumerically identical to the NMF algorithm that optimizesa Kullback Leibler metric. Unlike previously reported results[34], the proof of equivalence requires no assumption aboutthe distribution of the data, or indeed any assumption aboutthe data besides nonnegativity. The algorithms presented inthis paper primarily compute a probabilistic factorization ofnon-negative data that optimizes the KL distance betweenthe factored approximation and the actual data. We arguethat the use of this approach presents a much more straight-forward way to make easily extensible models. (It is notclear that the approach can be extended to similarly derivefactorizations that optimize other Bregman divergences suchas the L2 metric—this is a topic for further investigation.)

To demonstrate this, we presented extensions that dealwith tensorial data, shift invariances, and use priors on theestimation. The purpose of this paper is not to highlightthe use of these approaches nor to present them thoroughly,but rather demonstrate a methodology which allows easierexperimentation with nonnegative data analysis and opensup possibilities for more stringent and probabilistic model-ing than before. A rich variety of real world applications andderivations of these and other models can be found in thereferences.

Acknowledgment

Madhusudana Shashanka acknowledges the support andhelpful feedback received from Michael Giering at Mars, Inc.

References

[1] M. D. Plumbley and E. Oja, “A “nonnegative pca” algorithmfor independent component analysis,” IEEE Transactions onNeural Networks, vol. 15, no. 1, pp. 66–76, 2004.

[2] M. D. Plumbley, “Geometrical methods for non-negative ICA:manifolds, lie groups and toral subalgebras,” Neurocomputing,vol. 67, pp. 161–197,2005.

[3] D. D. Lee and H. S. Seung, “Learning the parts of objects bynon-negative matrix factorization,” Nature, vol. 401, no. 6755,pp. 788–791, 1999.

[4] M. Heiler and C. Schnorr, “Controlling sparseness in non-negative tensor factorization,” in Proceeding of the 9th Euro-pean Conference on Computer Vision (ECCV ’06), vol. 3951,pp. 56–67, Graz, Austria, May 2006.

[5] A. Cichocki, R. Zdunek, S. Choi, R. Plemmons, andS. Amari, “Nonnegative tensor factorization using alphaand beta divergences,” in Proceedings of the IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP ’07), Honolulu, Hawaii, USA, April 2007.

[6] A. Shashua and T. Hazan, “Non-negative tensor factorizationwith applications to statistics and computer vision,” inProceedings of the 22nd International Conference on MachineLearning (ICML ’05), pp. 793–800, Bonn, Germany, August2005.

[7] T. Hofmann, “Unsupervised learning by probabilistic latentsemantic analysis,” Machine Learning, vol. 42, no. 1-2, pp. 177–196, 2001.

[8] D. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,”The Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.

[9] P. Lazarsfeld and N. Henry, Latent Structure Analysis,Houghton Mifflin, Boston, Mass, USA, 1968.

[10] J. Rost and R. Langeheine, Eds., Applications of Latent Traitand Latent Class Models in the Social Sciences, Waxmann, NewYork, NY, USA, 1997.

[11] L. A. Goodman, “Exploratory latent structure analysis usingboth identifiable and unidentifiable models,” Biometrika,vol. 61, no. 2, pp. 215–231, 1974.

[12] D. Lee and H. Seung, “Algorithms for non-negative matrixfactorizatio,” in Proceedings of the 14th Annual Conference onAdvances in Neural Information Processing Systems (NIPS ’01),Vancouver, BC, Canada, December 2001.

[13] B. F. Green Jr., “Latent structure analysis and its relation tofactor analysis,” Journal of the American Statistical Association,vol. 47, no. 257, pp. 71–76, 1952.

[14] P. Smaragdis and B. Raj, “Shift-invariant probabilistic latentcomponent analysis,” to appear in Journal of Machine LearningResearch.

[15] P. Smaragdis, B. Raj, and M. Shashanka, “Supervised andsemi-supervised separation of sounds from single-channelmixtures,” in Proceedings of the 7th International Conference onIndependent Component Analysis and Blind Signal Separation(ICA ’07), pp. 414–421, London, UK, September 2007.

[16] P. Smaragdis, B. Raj, and M. Shashanka, “A probabilisticlatent variable model for acoustic modeling,” in Proceedingsof the Advances in Models for Acoustic Processing Workshop(NIPS ’06), Whistler, BC, Canada, December 2006.

[17] M. Shashanka, B. Raj, and P. Smaragdis, “Sparse overcompletelatent variable decomposition of counts data,” in Proceedingsof the 21th Annual Conference on Neural Information ProcessingSystems (NIPS ’07), Vancouver, BC, Canada, December 2007.

[18] M. Shashanka, “Latent variable framework for modeling andseparating single-channel acoustic sources,” Ph.D. disserta-tion, Boston University, Boston, Mass, USA, 2007.

[19] E. Gaussier and C. Goutte, “Relation between plsa and nmfand implications,” in Proceedings of the 28th Annual Interna-tional ACM SIGIR Conference on Research and Developmentin Information Retrieval (SIGIR ’05), pp. 601–602, Salvador,Brazil, August 2005.


[20] B. Raj and P. Smaragdis, “Latent variable decomposition ofspectrograms for single channel speaker separation,” in Pro-ceedings of IEEE Workshop on Applications of Signal Processingto Audio and Acoustics (WASPAA ’05), pp. 17–20, New Paltz,NY, USA, October 2005.

[21] M. Shashanka, B. Raj, and P. Smaragdis, “Probabilistic latentvariable model for sparse decompositions of non-negativedata,” to appear in IEEE Transactions on Pattern Analysis andMachine Intelligence.

[22] M. Welling and M. Weber, “Positive tensor factorization,”Pattern Recognition Letters, vol. 22, no. 12, pp. 1255–1261,2001.

[23] P. Smaragdis, B. Raj, and M. Shashanka, “Sparse and shift-invariant feature extraction from non-negative data,” in Pro-ceedings of IEEE International Conference on Acoustic, Speech,and Signal Processing (ICASSP ’08), Las Vegas, Nev, USA,March-April 2008.

[24] P. Smaragdis, “Non-negative matrix factor deconvolution;extraction of multiple sound sources from monophonicinputs,” in Proceedings of the 5th International Conference onIndependent Component Analysis and Blind Signal Separation(ICA ’04), vol. 3195, pp. 494–499, Granada, Spain, September2004.

[25] P. Smaragdis, “Convolutive speech bases and their applicationto supervised speech separation,” IEEE Transactions on Audio,Speech and Language Processing, vol. 15, no. 1, pp. 1–12, 2007.

[26] P. O. Hoyer, “Non-negative matrix factorization with sparse-ness constraints,” The Journal of Machine Learning Research,vol. 5, pp. 1457–1469, 2004.

[27] M. Morup and M. Schmidt, “Sparse non-negative matrixfactor 2-d deconvolution,” Tech. Rep., Technical University ofDenmark, Lyngby, Denmark, 2006.

[28] J. Eggert and E. Korner, “Sparse coding and NMF,” inProceedings of the IEEE International Joint Conference onNeural Networks (IJCNN ’04), vol. 4, pp. 2529–2533, Budapest,Hungary, July 2004.

[29] D. Donoho, “For most large undetermined systems of linearequations the minimal l1-norm solution is also the sparsestsolution,” Communications on Pure and Applied Mathematics,vol. 59, no. 7, pp. 903–934, 2006.

[30] B. A. Olshausen and D. J. Field, “Emergence of simple-cellreceptive field properties by learning a sparse code for naturalimages,” Nature, vol. 381, no. 6583, pp. 607–609, 1996.

[31] M. V. S. Shashanka, B. Raj, and P. Smaragdis, “Sparse overcom-plete decomposition for single channel speaker separation,” inProceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’07), vol. 2, pp. 641–644,Honolulu, Hawaii, USA, April 2007.

[32] B. Raj, M. V. S. Shashanka, and P. Smaragdis, “Latent dirichletdecomposition for single channel speaker separation,” inProceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’06), vol. 5, Toulouse,France, May 2006.

[33] D. Blei and J. Lafferty, “Correlated topic models,” in Proceed-ings of the 20th Annual Conference on Neural Information Pro-cessing Systems (NIPS ’06), Vancouver, BC, Canada, December2006.

[34] J. Canny, “Gap: a factor model for discrete data,” in Pro-ceedings of the 27th Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval(SIGIR ’04), Sheffield, UK, July 2004.


Research Article

Fast Nonnegative Matrix Factorization Algorithms UsingProjected Gradient Approaches for Large-Scale Problems

Rafal Zdunek1 and Andrzej Cichocki2, 3, 4

1 Instiute of Telecommunications, Teleinformatics and Acoustics, Wroclaw University of Technology,Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland

2 Laboratory for Advanced Brain Signal Processing, Brain Science Institute RIKEN, Wako-shi, Saitama 351-0198, Japan3 Institute of Theory of Electrical Engineering, Measurement and Information Systems, Faculty of Electrical Engineering,Warsaw University of Technology, 00-661 Warsaw, Poland

4 Systems Research Institute, Polish Academy of Science (PAN), 01-447 Warsaw, Poland

Correspondence should be addressed to Rafal Zdunek, [email protected]

Received 15 January 2008; Revised 18 April 2008; Accepted 22 May 2008

Recommended by Wenwu Wang

Recently, a considerable growth of interest in projected gradient (PG) methods has been observed due to their high efficiencyin solving large-scale convex minimization problems subject to linear constraints. Since the minimization problems underlyingnonnegative matrix factorization (NMF) of large matrices well matches this class of minimization problems, we investigate andtest some recent PG methods in the context of their applicability to NMF. In particular, the paper focuses on the following modifiedmethods: projected Landweber, Barzilai-Borwein gradient projection, projected sequential subspace optimization (PSESOP),interior-point Newton (IPN), and sequential coordinate-wise. The proposed and implemented NMF PG algorithms are comparedwith respect to their performance in terms of signal-to-interference ratio (SIR) and elapsed time, using a simple benchmark ofmixed partially dependent nonnegative signals.

Copyright © 2008 R. Zdunek and A. Cichocki. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction and Problem Statement

Nonnegative matrix factorization (NMF) finds such nonneg-ative factors (matrices) A = [ai j] ∈ RI×J and X = [xjt] ∈RJ×T with a ai j ≥ 0, xjt ≥ 0 that Y ∼= AX, given theobservation matrix Y = [yit] ∈ RI×T , the lower-rank J , andpossibly other statistical information on the observed data orthe factors to be estimated.

This method has found a variety of real-world applica-tions in the areas such as blind separation of images andnonnegative signals [1–6], spectra recovering [7–10], patternrecognition and feature extraction [11–16], dimensionalityreduction, segmentation and clustering [17–32], languagemodeling, text mining [25, 33], music transcription [34], andneurobiology (gene separation) [35, 36].

Depending on an application, the estimated factorsmay have different interpretation. For example, Lee andSeung [11] introduced NMF as a method for decomposing

an image (face) into parts-based representations (partsreminiscent of features such as lips, eyes, nose, etc.). In blindsource separation (BSS) [1, 37, 38], the matrix Y representsthe observed mixed (superposed) signals or images, A is amixing operator, and X is a matrix of true source signalsor images. Each row of Y or X is a signal or 1D imagerepresentation, where I is a number of observed mixedsignals and J is a number of hidden (source) components.The index t usually denotes a sample (discrete time instant),where T is the number of available samples. In BSS, weusually have T � I ≥ J , and J is known or can be relativelyeasily estimated using SVD or PCA.

Our objective is to estimate the mixing matrix A andsources X subject to nonnegativity constraints of all theentries, given Y and possibly the knowledge on a statisticaldistribution of noisy disturbances.

Obviously, NMF is not unique in general case, and it ischaracterized by a scale and permutation indeterminacies.


These problems have been addressed recently by manyresearchers [39–42], and in this paper, the problems willnot be discussed. However, we have shown by extensivecomputer simulations that in practice with overwhelmingprobability we are able to achieve a unique nonnegative fac-torization (neglecting unavoidable scaling and permutationambiguities) if data are sufficiently sparse and a suitable NMFalgorithm is applied [43]. This is consistent with very recenttheoretical results [40].

The noise distribution is strongly application-dependent,however, in many BSS applications, a Gaussian noise isexpected. Here our considerations are restricted to thiscase, however, the alternative NMF algorithms optimized todifferent distributions of the noise exist and can be found, forexample, in [37, 44, 45].

NMF was proposed by Paatero and Tapper [46, 47] butLee and Seung [11, 48] highly popularized this method byusing simple multiplicative algorithms to perform NMF. Anextensive study on convergence of multiplicative algorithmsfor NMF can be found in [49]. In general, the multiplicativealgorithms are known to be very slowly convergent forlarge-scale problems. Due to this reason, there is a need tosearch more efficient and fast algorithms for NMF. Manyapproaches have been proposed in the literature to relax theseproblems. One of them is to apply projected gradient (PG)algorithms [50–53] or projected alternating least-squares(ALS) algorithms [33, 54, 55] instead of multiplicative ones.Lin [52] suggested applying the Armijo rule to estimatethe learning parameters in projected gradient updates forNMF. The NMF PG algorithms proposed by us in [53] alsoaddress the issue with selecting such a learning parameterthat is the steepest descent and also keeps some distanceto a boundary of the nonnegative orthant (subspace ofreal nonnegative numbers). Another very robust techniqueconcerns exploiting the information from the second-orderTaylor expansion term of a cost function to speed up theconvergence. This approach was proposed in [45, 56], wherethe mixing matrix A is updated with the projected Newtonmethod, and the sources in X are computed with theprojected least-squares method (the fixed point algorithm).

In this paper, we extend our approach to NMF that wehave initialized in [53]. We have investigated, extended, andtested several recently proposed PG algorithms such as theoblique projected Landweber [57], Barzilai-Borwein gradi-ent projection [58], projected sequential subspace optimiza-tion [59, 60], interior-point Newton [61], and sequentialcoordinate-wise [62]. All the presented algorithms in thispaper are quite efficient for solving large-scale minimizationproblems subject to nonnegativity and sparsity constraints.

The main objective of this paper is to develop, extend,and/or modify some of the most promising PG algorithms toa standard NMF problem and to find optimal conditions orparameters for such a class of NMF algorithms. The secondobjective is to compare the performance and complexityof these algorithms for NMF problems, and to discoveror establish the most efficient and promising algorithms.We would like to emphasize that most of the discussedalgorithms have not been implemented neither used till nowor even tested before for NMF problems, but they have been

rather considered for solving a standard system of algebraicequations: Ax(k) = y(k) (for only k = 1) where the matrixA and the vectors y are known. In this paper, we considera much more difficult and complicated problem in whichwe have two sets of parameters and additional constraintsof nonnegativity and/or sparsity. So it was not clear till nowwhether such algorithms would work efficiently for NMFproblems, and if so, what kind of projected algorithms is themost efficient? To our best knowledge only the Lin-PG NMFalgorithm is widely used and known for NMF problems.We have demonstrated experimentally that there are severalnovel PG gradient algorithms which are much more efficientand consistent than the Lin-PG algorithm.

In Section 2, we briefly discuss the PG approach to NMF.Section 3 describes the tested algorithms. The experimentalresults are illustrated in Section 4. Finally, some conclusionsare given in Section 5.

2. Projected Gradient Algorithms

In contrast to the multiplicative algorithms, the class of PGalgorithms has additive updates. The algorithms discussedhere approximately solve nonnegative least squares (NNLS)problems with the basic alternating minimization techniquethat is used in NMF:

minxt≥0

DF(

yt||Axt) = 1

2

∥∥yt − Axt

∥∥2

2, t = 1, . . . ,T , (1)

minai≥0

DF(

yi||XTai

) = 12

∥∥y

i−XTai

∥∥2

2, i = 1, . . . , I (2)

or in the equivalent matrix form

minxjt≥0

DF(Y||AX) = 12‖Y− AX‖2

F , (3)

minai j≥0

DF(

YT ||XTAT) = 1

2

∥∥YT −XTAT

∥∥2F , (4)

where A = [a1, . . . , aJ] ∈ RI×J , AT = [a1, . . . , aI] ∈ RJ×I ,X = [x1, . . . , xT] ∈ RJ×T , XT = [x1, . . . , xJ] ∈ RT×J , Y =[y1, . . . , yT] ∈ RI×T , YT = [y

1, . . . , y

I] ∈ RI×T , and usually

I ≥ J . The matrix A is assumed to be a full-rank matrix, sothere exists a unique solution X∗ ∈ RJ×T for any matrix Ysince the NNLS problem is strictly convex (with respect toone set of variables {X}).

The solution x∗t to (1) satisfies the Karush-Kuhn-Tucker(KKT) conditions:

x∗t ≥ 0, gX(

x∗t) ≥ 0, gX

(x∗t)T

x∗t = 0, (5)

or in the matrix notation

X∗ ≥ 0, GX(

X∗) ≥ 0, tr

{GX(

X∗)T

X∗} = 0,

(6)

where gX and GX are the corresponding gradient vector andgradient matrix:

gX(

xt) = ∇xtDF

(yt||Axt

) = AT(

Axt − yt), (7)

GX(X) = ∇XDF(Y||AX) = AT(AX− Y). (8)


Similarly, the KKT conditions for the solution a∗ to (2),and the solution A∗ to (4) are as follows:

a∗i ≥ 0, gA(

a∗i) ≥ 0, gA

(a∗i)T

a∗i = 0, (9)

or in the matrix notation:

A∗ ≥ 0, GA(

A∗) ≥ 0, tr

{A∗GA

(A∗)T} = 0,

(10)

where gA and GA are the gradient vector and gradient matrixof the objective function:

gA(

at) = ∇aiDF

(yi||XTai

) = (XTai − yi

),

GA(A) = ∇ADF(

YT ||XTAT) = (AX− Y)XT .

(11)

There are many approaches to solve the problems (1) and(2), or equivalently (3) and (4). In this paper, we discussselected projected gradient methods that can be generallyexpressed by iterative updates:

X(k+1) = PΩ[

X(k) − η(k)X P(k)

X

],

A(k+1) = PΩ[

A(k) − η(k)A P(k)

A

],

(12)

where PΩ[ξ] is a projection of ξ onto a convex feasible setΩ = {ξ ∈ R : ξ ≥ 0}, namely, the nonnegative orthant R+

(the subspace of nonnegative real numbers), P(k)X and P(k)

A are

descent directions for X and A, and η(k)X and η(k)

A are learningrules, respectively.

The projection PΩ[ξ] can be performed in several ways.One of the simplest techniques is to replace all negativeentries in ξ by zero-values, or in practical cases, by a smallpositive number ε to avoid numerical instabilities. Thus,

P[ξ] = max{ε, ξ}. (13)

However, this is not the only way to carry out the projectionPΩ(ξ). It is typically more efficient to choose the learning

rates η(k)X and η(k)

A so as to preserve nonnegativity of thesolutions. The nonnegativity can be also maintained by solv-ing least-squares problems subject to the constraints (6) and(10). Here we present the exemplary PG methods that workfor NMF problems quite efficiently, and we implementedthem in the Matlab toolboxm, NMFLAB/NTFLAB, for signaland image processing [43]. For simplicity, we focus ourconsiderations on updating the matrix X, assuming that theupdates for A can be obtained in a similar way. Note that theupdates for A can be readily obtained solving the transposedsystem XTAT = YT , having X fixed (updated in the previousstep).

3. Algorithms

3.1. Oblique Projected Landweber Method

The Landweber method [63] performs gradient-descentminimization with the following iterative scheme:

X(k+1) = X(k) − ηG(k)X , (14)

Set A, X, % Initialization

η(X)j = 2

(ATA1J ) j,

For k = 1, 2, . . ., % Inner iterations for XGX = AT(AX− Y), % Gradient with respect to XX ← PΩ[X− diag{ηj}GX], % Updating

End

Algorithm 1: (OPL).

where the gradient is given by (8), and the learning rateη ∈ (0,ηmax). The updating formula assures an asymptoticalconvergence to the minimal-norm least squares solution forthe convergence radius defined by

ηmax = 2λmax

(ATA

) , (15)

where λmax(ATA) is the maximal eigenvalue of ATA.Since A is a nonnegative matrix, we have λmax(ATA) ≤max j[ATA1J] j , where 1J = [1, . . . , 1]T ∈ RJ . Thus themodified Landweber iterations can be expressed by theformula

X(k+1) = PΩ[

X(k) − diag{ηj}

G(k)X

], where ηj = 2

(ATA1J

)j

.

(16)

In the obliqueprojected Landweber (OPL) method [57],which can be regarded as a particular case of the PGiterative formula (12), the solution obtained with (14) ineach iterative step is projected onto the feasible set. Finally,we have Algorithm 1 for updating X.

3.2. Projected Gradient

One of the fundamental PG algorithms for NMF wasproposed by Lin in [52]. This algorithm, which we refer toas the Lin-PG, uses the Armijo rule along the projection arc

to determine the steplengths η(k)X and η(k)

A in the iterativeupdates (12). For the cost function being the squaredEuclidean distance, PX = (A(k))T(A(k)X(k) − Y) and PA =(A(k)X(k+1) − Y)(X(k+1))T .

For computation of X, such a value of ηX is decided, onwhich

η(k)X = βmk , (17)

where mk is the first nonnegative integer m that satisfies

DF(

Y||AX(k+1))−DF(

Y||AX(k))

≤ σ∇XDF(

Y||AX(k))T(X(k+1) −X(k)).(18)

The parameters β ∈ (0, 1) and σ ∈ (0, 1) decide about aconvergence speed. In this algorithm we set σ = 0.01, β = 0.1experimentally as default.

The Matlab implementation of the Lin-PG algorithm isgiven in [52].


Set A, X,αmin,αmax,α(0) ∈ [αmin,αmax] ∈ RT %Initialization

For k = 1, 2, . . ., % Inner iterationsΔ(k) = PΩ[X(k) − α(k)∇XDF(Y||AX(k))]−X(k),λ(k) = arg min

λ(k)t ∈[0,1]

DF(Y||A(X + Δ(k)diag{λ})),

where λ = [λt] ∈ RT ,X(k+1) = X(k) + Δ(k)diag{λ},γ(k) = diag{(Δ(k))TATAΔ(k)},If γ(k)

t = 0: α(k+1)t = αmax,

Else α(k+1)t = min{αmax, max{αmin,

[(Δ(k))TΔ(k)]tt/γ(k)t }},

EndEnd % Inner iterations

Algorithm 2: (GPSR-BB).

3.3. Barzilai-Borwein Gradient Projection

The Barzilai-Borwein gradient projection method [58, 64] ismotivated by the quasi-Newton approach, that is, the inverseof the Hessian is replaced with an identity matrix Hk = αkI,where the scalar αk is selected so that the inverse Hessian hassimilar behavior as the true Hessian in the recent iteration.Thus,

X(k+1)−X(k)≈αk(∇XD

(Y||A(k)X(k+1))−∇XD

(Y||A(k)X(k))).

(19)

In comparison to, for example, Lin’s method [52], thismethod does not ensure that the objective function decreasesat every iteration, but its general convergence has beenproven analytically [58]. The general scheme of the Barzilai-Borwein gradient projection algorithm for updating X is inAlgorithm 2.

Since DF(Y||AX) is a quadratic function, the line searchparameter λ(k) can be derived in the following closed-formformula:

λ(k) = max{

0, min{

1,diag

{(Δ(k))T∇XDF(Y||AX)

}

diag{(Δ(k))TATAΔ(k)}

}}.

(20)

The Matlab implementation of the GPSR-BB algorithm,which solves the system AX = Y of multiple measurementvectors subject to nonnegativity constraints, is given inAlgorithm 4 (see also NMFLAB).

3.4. Projected Sequential SubspaceOptimization

The projected sequential subspace optimization (PSESOP)method [59, 60] carries out a projected minimization ofa smooth objective function over a subspace spanned byseveral directions which include the current gradient andgradient from the previous iterations, and the Nemirovskidirections. Nemirovski [65] suggested that convex smooth

Set A, x(0)t , p % Initialization

For k = 1, 2, . . ., % Inner iterationsd(k)

1 = x(k) − x(0),g(k) = ∇xtDF(yt||Axt),G(p) = [g(k−1), g(k−2), . . . , g(k−p)] ∈ RJ×p,

wk =⎧⎨

⎩1 if k = 1,

1/2 +√

1/4 +w2k−1 if k > 1,

w(k) = [wk ,wk−1, . . . ,wk−p+1]T ∈ Rp,d(k)

2 = G(p)w(k),D(k) = [d(k)

1 , d(k)2 , g(k), G(p)],

α(k)∗ = arg minαDF(yt||A(x(k)

t + D(k)α(k))),x(k+1) = PΩ[x(k) + D(k)α(k)

∗ ]End % Inner iterations

Algorithm 3: (NMF-PSESOP).

unconstrained optimization is optimal if the optimizationis performed over a subspace which includes the current

gradient g(x), the directions d(k)1 = x(k) − x(0), and the linear

combination of the previous gradients d(k)2 = ∑k−1

n=0wng(xn)with the coefficients wn, n = 0, . . . , k − 1. The directionsshould be orthogonal to the current gradient. Narkiss andZibulevsky [59] also suggested to include another direction:p(k) = x(k)−x(k−1), which is motivated by a natural extensionof the conjugate gradient (CG) method to a nonquadraticcase. However, our practical observations showed that thisdirection does not have a strong impact on the NMFcomponents, thus we neglected it in our NMF-PSESOPalgorithm. Finally, we have Algorithm 3 for updating xtwhich is a single column vector of X.

The parameter p denotes the number of previous iteratesthat are taken into account to determine the current update.

The line search vector α(k)∗ derived in a closed form for

the objective function DF(yt||Axt) is as follows:

α(k)∗ = −

((D(k))TATAD(k) + λI

)−1(D(k))T∇xtDF

(yt||Axt

).

(21)

The regularization parameter can be set as a very smallconstant to avoid instabilities in inverting a rank-deficientmatrix in case that D(k) has zero-value or dependentcolumns.

3.5. Interior Point Newton Algorithm

The interior point Newton (IPN) algorithm [61] solves theNNLS problem (1) by searching the solution satisfying theKKT conditions (5) which equivalently can be expressed bythe nonlinear equations

D(

xt)

g(

xt) = 0, (22)

where D(xt) = diag{d1(xt), . . . ,dJ(xt)}, xt ≥ 0, and

dj(

xt) =

⎧⎨

⎩

xjt if gj(

xt) ≥ 0,

1 otherwise.(23)


% Barzilai-Borwein gradient projection (GPSR-BB) algorithm

%function [X] = nmf gpsr bb(A,Y,X,no iter)%% [X] = nmf gpsr bb(A,Y,X,no iter) finds such matrix X that solves% the equation AX = Y subject to nonnegativity constraints.%% INPUTS:% A - system matrix of dimension [I by J]% Y - matrix of observations [I by T]% X - matrix of initial guess [J by T]% no iter - maximum number of iterations%% OUTPUTS:% X - matrix of estimated sources [J by T]%% #########################################################################% Parametersalpha min = 1E-8; alpha max = 1;alpha = .1∗ones(1,size(Y,2));B = A’∗A; Yt = A’∗Y;

for k=1:no iter

G = B∗X - Yt;delta = max(eps, X - repmat(alpha,size(G,1),1).∗G) - X;deltaB = B∗delta;lambda = max(0, min(1, -sum(delta.∗G,1)./(sum(delta.∗deltaB,1) + eps)));X = max(eps,X + delta.∗repmat(lambda,size(delta,1),1));gamma = sum(delta.∗deltaB,1) + eps;if gamma

alpha = min(alpha max, max(alpha min, sum(delta. 2,1)./gamma ));else

alpha = alpha max;end

end

Algorithm 4

Applying the Newton method to (22), we have in the kthiterative step

(Dk(

xt)

ATA + Ek(

xt))

pk = −Dk(

xt)

gk(

xt), (24)

where

Ek(

xt) = diag

{e1(

xt), . . . , eJ

(xt)}. (25)

In [61], the entries of the matrix Ek(xt) are defined by

ej(

xt) =

{gj(

xt)

if 0 ≤ gj(

xt)< x

γjt, or

(gj(

xt))γ

> xjt,

1 otherwise,(26)

for 1 < γ ≤ 2.If the solution is degenerate, that is, t = 1, . . . ,T , ∃ j :

x∗jt = 0, and gjt = 0, the matrix Dk(xt)ATA + Ek(xt) may besingular. To avoid such a case, the system of equations hasbeen rescaled to the following form:

Wk(

xt)

Dk(

xt)

Mk(

xt)

pk = −Wk(

xt)

Dk(

xt)

gk(

xt)

(27)

with

Mk(

xt) = ATA + Dk

(xt)−1

Ek(

xt),

Wk(

xt) = diag

{w1(

xt), . . . ,wJ

(xt)}

,

wj(

xt) = (dj

(xt)

+ ej(

xt))−1

,

(28)

for xt > 0. In [61], the system (27) is solved by the inexactNewton method, which leads to the following updates:

Wk(

xt)

Dk(

xt)

Mk(

xt)

pk = −Wk(

xt)

Dk(

xt)

gk(

xt)

+ rk(

xt),

(29)

pk=max{σ , 1−∥∥PΩ

[x(k)t +pk

]−x(k)t

∥∥

2

}(PΩ[x(k)t +pk

]−x(k)t

),

(30)

x(k+1)t = x(k)

t + pk, (31)

where σ ∈ (0, 1), rk(xt) = AT(Axt − yt), and PΩ[·] is aprojection onto a feasible set Ω.


The transformation of the normal matrix ATA by thematrix Wk(xt)Dk(xt) in (27) means the system matrixWk(xt)Dk(xt)Mk(xt) is no longer symmetric and positive-definite. There are many methods for handling such sys-tems of linear equations, like QMR [66], BiCG [67, 68],BiCGSTAB [69], or GMRES-like methods [70], however,they are more complicated and computationally demandingthan, for example, the basic CG algorithm [71]. To apply theCG algorithm the system matrix in (27) must be convertedto a positive-definite symmetric matrix, which can be easilydone with normal equations. The methods like CGLS [72]or LSQR [73] are therefore suitable for such tasks. Thetransformed system has the form

Zk(

xt)

pk = −Sk(

xt)

gk(

xt)

+ rk(

xt), (32)

Sk(

xt) =

√Wk(

xt)

Dk(

xt), (33)

Zk(

xt) = Sk

(xt)

Mk(

xt)

Sk(

xt)

= Sk(

xt)

ATASk(

xt)

+ Wk(

xt)

Ek(

xt),

(34)

with pk = S−1k (xt)pk and rk = S−1

k (xt)rk(xt).Since our cost function is quadratic, its minimization

in a single step is performed with combining the projectedNewton step with the constrained scaled Cauchy step that isgiven in the form

p(C)k = −τkDk

(xt)

gk(

xt), τk > 0. (35)

Assuming x(k)t + p(C)

k > 0, τk is chosen as being eitherthe unconstrained minimizer of the quadratic functionψk(−τkDk(xt)gk(xt)) or a scalar proportional to the distanceto the boundary along −Dk(xt)gk(xt), where

ψk(p) = 12

pTMk(

xt)

p + pTgk(

xt)

= 12

pT(

ATA+D−1k

(xt)

Ek(

xt))

p+pTAT(

Ax(k)t −yt

).

(36)

Thus

τk=

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

τ1 = arg minτψk(− τkDk

(xt)

gk(

xt))

if x(k)t − τ1Dk

(xt)

gk(

xt)> 0,

τ2=θminj

{ x(k)jt

(Dk(

xt)

gk(

xt))

j

:(

Dk(

xt)

gk(

xt))

j >0}

otherwise,(37)

where ψk(−τkDk(xt)gk(xt)) = (gk(xt))TDk(xt)gk(xt)/(Dk ×(xt)gk(xt))

TMk(xt)Dk(xt)gk(xt) with θ∈ (0, 1). For ψk(p(C)k ) <

0, the global convergence is achieved if red(x(k+1)t −x(k)

t ) ≥ β,β ∈ (0, 1) with

red(p) = ψk(p)

ψk(

p(C)k

) . (38)

The usage of the constrained scaled Cauchy step leads tothe following updates:

s(k)t = t

(p(C)k − pk

)+ pk,

x(k+1)t = x(k)

t + s(k)t ,

(39)

with t ∈ [0, 1), pk and p(C)k are given by (30) and (35), res-

pectively, and t is the smaller square root (laying in (0, 1)) ofthe quadratic equation:

π(t) = ψk(t(

p(C)k − pk

)+ pk

)− βψk(

p(C)k

) = 0. (40)

The Matlab code of the IPN algorithm, which solves thesystem Axt = yt subject to nonnegativity constraints, is givenin Algorithm 5. To solve the transformed system (32), we usethe LSQR method implemented in Matlab 7.0.

3.6. Sequential Coordinate-Wise Algorithm

The NNLS problem (1) can be expressed in terms of thefollowing quadratic problem (QP) [62]:

minxt≥0

Ψ(

xt), (t = 1, . . . ,T), (41)

where

Ψ(

xt) = 1

2xTt Hxt + cTt xt , (42)

with H = ATA and ct = −ATyt.The sequential coordinate-wise algorithm (SCWA) pro-

posed first by Franc et al. [62] solves the QP problem givenby (41) updating only single variable xjt in one iterative step.It should be noted that the sequential updates can be easilydone, if the function Ψ(xt) is equivalently rewritten as

Ψ(

xt) = 1

2

∑

p∈I

∑

r∈I

xptxrt(

ATA)pr +

∑

p∈I

xpt(

ATyt)pt

= 12x2jt

(ATA

)j j + xjt

(ATyt

)jt

+ xjt∑

p∈I\{ j}xpt(

ATA)p j +

∑

p∈I\{ j}xpt(

ATyt)pt

+12

∑

p∈I\{ j}

∑

r∈I\{ j}xptxrt

(ATA

)pr

= 12x2jth j j + xjtβjt + γjt,

(43)

where I = {1, . . . , J}, and

hj j =(

ATA)j j ,

βjt =(

ATyt)jt +

∑

p∈I\{ j}xpt(

ATA)p j

= [ATAxt + ATyt]jt −

(ATA

)j jx jt,

γjt =∑

p∈I\{ j}xpt(

ATyt)pt +

12

∑

p∈I\{ j}

∑

r∈I\{ j}xptxrt

(ATA

)pr .

(44)


% Interior Point Newton (IPN) algorithm function%function [x] = nmf ipn(A,y,x,no iter)%% [x]=nmf ipn(A,y,x,no iter) finds such x that solves the equation Ax = y% subject to nonnegativity constraints.%% INPUTS:% A - system matrix of dimension [I by J]% y - vector of observations [I by 1]% x - vector of initial guess [J by 1]% no iter - maximum number of iterations%% OUTPUTS:% x - vector of estimated sources [J by 1]%% #########################################################################% Parameterss = 1.8; theta = 0.5; rho = .1; beta = 1;H = A’∗A; yt = A’∗y; J = size(x,1);

% Main loopfor k=1:no iter

g = H∗x - yt; d = ones(J,1); d(g >= 0) = x(g >= 0);ek = zeros(J,1); ek(g >=0 & g < x. s) = g(g >=0 & g < x. s);M = H + diag(ek./d);dg = d.∗g;tau1 = (g’∗dg)/(dg’∗M∗dg); tau 2vec = x./dg;tau2 = theta∗min(tau 2vec(dg > 0));tau = tau1∗ones(J,1); tau(x - tau1∗dg <= 0) = tau2;w = 1./(d + ek); sk = sqrt(w.∗d); pc = - tau.∗dg;Z = repmat(sk,1,J).∗M.∗repmat(sk’,J,1);rt = -g./sk;[pt,flag,relres,iter,resvec] = lsqr(Z,rt - g.∗sk,1E-8);p = pt.∗sk;phx = max(0, x + p) - x;ph = max(rho, 1 - norm(phx))∗phx;Phi pc = .5∗pc’∗M∗pc + pc’∗g; Phi ph = .5∗ph’∗M∗ph + ph’∗g;red p = Phi ph/Phi pc; dp = pc - ph;

if red p >= betat = 0;

elseax = .5∗dp’∗M∗dp; bx = dp’∗(M∗ph + g);cx = Phi ph - beta∗Phi pc;Deltas = sqrt(bx 2 - 4∗ax∗cx);t1 = .5∗(-bx + Deltas)/ax; t2 = .5∗(-bx - Deltas)/ax;t1s = t1 > 0 & t1 < 1; t2s = t2 > 0 & t2 < 1; t = min(t1, t2);if (t <= 0)

if t1st = t1s;

elset = t2s;

endend

end

sk = t∗dp + ph;x = x + sk;

end % for k

Algorithm 5


01234

s4

0

1

2

3

s3

0

5

10

15

s2

0

1

2

3

s1

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

(a)

01020

y805

10

y7

0

5

y6

0

5

y5

0

5

y4

05

10

y3

05

10

y2

05

10

y1

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

(b)

Figure 1: Dataset: (a) original 4 source signals, (b) observed 8 mixed signals.

Considering the optimization of Ψ(xt) with respect to theselected variable xjt, the following analytical solution can bederived:

x∗jt = arg minΨ([x1t, . . . , xjt, . . . , xJt

]T)

= arg min12x2jth j j + xjtβjt + γjt

= max(

0,− βjthj j

)

= max(

0, xjt −[

ATAxt]jt +

[ATyt

]jt

(ATA

)j j

).

(45)

Updating only single variable xjt in one iterative step, wehave

x(k+1)pt = x(k)

pt , ∀p ∈ I \ { j}, x(k+1)jt /= x(k)

jt . (46)

Any optimal solution to the QP (41) satisfies the KKTconditions given by (5) and the stationarity condition of thefollowing Lagrange function:

L(

xt, λt) = 1

2xTt Hxt + cTt xt − λTt xt, (47)

where λt ∈ RJ is a vector of Lagrange multipliers(or dual variables) corresponding to the vector xt. Thus,∇xtL(xt, λt) = Hxt + ct − λt = 0. In the SCWA, the Lagrange

multipliers are updated in each iteration according to theformula

λ(k+1)t = λ(k)

t +(x(k+1)jt − x(k)

jt

)h j , (48)

where h j is the jth column of H, and λ(0)t = ct .

Finally, the SCWA can take the following updates:

x(k+1)jt = max

(0, x(k)

jt −λ(k)j

(ATA

)j j

),

x(k+1)pt = x(k)

pt , ∀p ∈ I \ { j}

λ(k+1)t = λ(k)

t +(x(k+1)jt − x(k)

jt

)h j .

(49)

4. Simulations

All the proposed algorithms were implemented in ourNMFLAB, and evaluated with the numerical tests related totypical BSS problems. We used the synthetic benchmark of 4partially dependent nonnegative signals (with only T = 1000samples) which are illustrated in Figure 1(a). The signals aremixed by random, uniformly distributed nonnegative matrix


A ∈ R8×4 with the condition number cond{A} = 4.11. Thematrix A is displayed in

A =

⎡

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

0.0631 0.7666 0.0174 0.6596

0.2642 0.6661 0.8194 0.2141

0.9995 0.1309 0.6211 0.6021

0.2120 0.0954 0.5602 0.6049

0.4984 0.0149 0.2440 0.6595

0.2905 0.2882 0.8220 0.1834

0.6728 0.8167 0.2632 0.6365

0.9580 0.9855 0.7536 0.1703

⎤

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

. (50)

The mixing signals are shown in Figure 1(b).Because the number of variables in X is much greater

than in A, that is, I × J = 32 and J × T = 4000, we testthe projected gradient algorithms only for updating A. Thevariables in X are updated with the standard projected fixedpoint alternating least squares (FP-ALS) algorithm that isextensively analyzed in [55].

In general, the FP-ALS algorithm solves the least-squaresproblem

X∗ = arg minX

{12‖Y− AX‖2

F

}(51)

with the Moore-Penrose pseudoinverse of a system matrix,that is, in our case, the matrix A. Since in NMF usuallyI ≥ J , we formulate normal equations as ATAX = ATY,and the least-squares solution of minimal l2-norm to thenormal equations is XLS = (ATA)−1ATY = A+Y, whereA+ is the Moore-Penrose pseudoinverse of A. The projectedFP-ALS algorithm is obtained with a simple “half-rectified”projection, that is,

X = PΩ[

A+Y]. (52)

The alternating minimization is nonconvex in spite ofthe cost function being convex with respect to one set ofvariables. Thus, most NMF algorithms may get stuck in localminima, and hence, the initialization plays a key role. Inthe performed tests, we applied the multistart initializationdescribed in [53] with the following parameters: N = 10(number of restarts), Ki = 30 (number of initial alternatingsteps), and Kf = 1000 (number of final alternating steps).Each initial sample of A and X has been randomly generatedfrom a uniform distribution. Each algorithm has been testedfor two cases of inner iterations, that is, with k = 1 and k = 5.The inner iterations mean a number of iterative steps thatare performed to update only A (with fixed X, i.e., beforegoing to the update of X and vice versa). Additionally, themultilayer technique [53, 54] with 3 layers (L = 3) is applied.

The multilayer technique can be regarded as multistepdecomposition. In the first step, we perform the basic decom-position Y = A1X1 using any available NMF algorithm,where A1 ∈ RI×J and X1 ∈ RJ×T with I ≥ J . In thesecond stage, the results obtained from the first stage are usedto perform the similar decomposition: X1 = A2X2, whereA2 ∈ RJ×J and X2 ∈ RJ×T , using the same or different update

rules, and so on. We continue our decomposition takinginto account only the last achieved components. The processcan be repeated arbitrary many times until some stoppingcriteria are satisfied. In each step, we usually obtain gradualimprovements of the performance. Thus, our model has theform Y = A1A2 · · ·ALXL with the basis matrix defined asA = A1A2 · · ·AL ∈ RI×J . Physically, this means that we buildup a system that has many layers or cascade connection of Lmixing subsystems.

There are many stopping criteria for terminating thealternating steps. We stop the iterations if s ≥ Kf = 1000or the following condition ‖A(s)−A(s−1)‖F < ε is held, wheres stands for the number of alternating step, and ε = 10−5.Note that the condition (20) can be also used as a stoppingcriterion, especially as the gradient is computed in eachiteration of the PG algorithms.

The algorithms have been evaluated with the signal-to-interference ratio (SIR) measures, calculated separatelyfor each source signal and each column in the mixingmatrix. Since NMF suffers from scale and permutationindeterminacies, the estimated components are adequatelypermuted and rescaled. First, the source and estimatedsignals are normalized to a uniform variance, and thenthe estimated signals are permuted to keep the same orderas the source signals. In NMFLAB [43], each estimatedsignal is compared to each source signal, which results inthe performance (SIR) matrix that is involved to make thepermutation matrix. Let x j and x j be the jth source andits corresponding (reordered) estimated signal, respectively.Analogically, let a j and a j be the jth column of the trueand its corresponding estimated mixing matrix, respectively.Thus, the SIRs for the sources are given by

SIR(X)j = −20 log

{∥∥x j − x j

∥∥

2∥∥x j

∥∥2

}, j = 1, . . . , J , [dB]

(53)

and similarly for each column in A we have

SIR(A)j = −20 log

{∥∥a j − a j∥∥

2∥∥a j∥∥

2

}, j = 1, . . . , J , [dB].

(54)

We test the algorithms with the Monte Carlo (MC)analysis, running each algorithm 100 times. Each run hasbeen initialized with the multistart procedure. The algo-rithms have been evaluated with the mean-SIR values thatare calculated as follows:

SIRX = 1J

J∑

j=1

SIR(X)j ,

SIRA = 1J

J∑

j=1

SIR(A)j ,

(55)

for each MC sample. The mean-SIRs for the worst (with thelowest mean-SIR values) and best (with the highest mean-SIR values) samples are given in Table 1. The number kmeans the number of inner iterations for updating A, and


Table 1: Mean-SIRs [dB] obtained with 100 samples of Monte Carlo analysis for the estimation of sources and columns of mixing matrixfrom noise-free mixtures of signals in Figure 1. Sources X are estimated with the projected pseudoinverse. The number of inner iterationsfor updating A is denoted by k, and the number of layers (in the multilayer technique) by L. The notation best or worst in parenthesis thatfollows the algorithm name means that the mean-SIR value is calculated for the best or worst sample from Monte Carlo analysis, respectively.In the last column, the elapsed time [in seconds] is given for each algorithm with k = 1 and L = 1.

AlgorithmMean-SIRA [dB] Mean-SIRX [dB]

TimeL = 1 L = 3 L = 1 L = 3

k = 1 k = 5 k = 1 k = 5 k = 1 k = 5 k = 1 k = 5

M-NMF (best) 21 22.1 42.6 37.3 26.6 27.3 44.7 40.71.9M-NMF (mean) 13.1 13.8 26.7 23.1 14.7 15.2 28.9 27.6

M-NMF (worst) 5.5 5.7 5.3 6.3 5.8 6.5 5 5.5

OPL(best) 22.9 25.3 46.5 42 23.9 23.5 55.8 511.9OPL(mean) 14.7 14 25.5 27.2 15.3 14.8 23.9 25.4

OPL(worst) 4.8 4.8 4.8 5.0 4.6 4.6 4.6 4.8

Lin-PG(best) 36.3 23.6 78.6 103.7 34.2 33.3 78.5 92.88.8Lin-PG(mean) 19.7 18.3 40.9 61.2 18.5 18.2 38.4 55.4

Lin-PG(worst) 14.4 13.1 17.5 40.1 13.9 13.8 18.1 34.4

GPSR-BB(best) 18.2 22.7 7.3 113.8 22.8 54.3 9.4 108.12.4GPSR-BB(mean) 11.2 20.2 7 53.1 11 20.5 5.1 53.1

GPSR-BB(worst) 7.4 17.3 6.8 24.9 4.6 14.7 2 23

PSESOP(best) 21.2 22.6 71.1 132.2 23.4 55.5 56.5 137.25.4PSESOP(mean) 15.2 20 29.4 57.3 15.9 34.5 27.4 65.3

PSESOP(worst) 8.3 15.8 6.9 28.7 8.2 16.6 7.2 30.9

IPG(best) 20.6 22.2 52.1 84.3 35.7 28.6 54.2 81.42.7IPG(mean) 20.1 18.2 35.3 44.1 19.7 19.1 33.8 36.7

IPG(worst) 10.5 13.4 9.4 21.2 10.2 13.5 8.9 15.5

IPN(best) 20.8 22.6 59.9 65.8 53.5 52.4 68.6 67.214.2IPN(mean) 19.4 17.3 38.2 22.5 22.8 19.1 36.6 21

IPN(worst) 11.7 15.2 7.5 7.1 5.7 2 1.5 2

RMRNSD(best) 24.7 21.6 22.2 57.9 30.2 43.5 25.5 62.43.8RMRNSD(mean) 14.3 19.2 8.3 33.8 17 21.5 8.4 33.4

RMRNSD(worst) 5.5 15.9 3.6 8.4 4.7 13.8 1 3.9

SCWA(best) 12.1 20.4 10.6 24.5 6.3 25.6 11.9 34.42.5SCWA(mean) 11.2 16.3 9.3 20.9 5.3 18.6 9.4 21.7

SCWA(worst) 7.3 11.4 6.9 12.8 3.8 10 3.3 10.8

L denotes the number of layers in the multilayer technique[53, 54]. The notation L = 1 means that the multilayertechnique was not used. The elapsed time [in seconds] ismeasured in Matlab, and it informs us in some sense about adegree of complexity of the algorithm.

For comparison, Table 1 contains also the resultsobtained for the standard multiplicative NMF algorithm(denoted as M-NMF) that minimizes the squared Euclideandistance. Additionally, the results of testing the PG algo-rithms which were proposed in [53] have been alsoincluded. The acronyms Lin-PG, IPG, RMRNSD refer tothe following algorithms: projected gradient proposed byLin [52], interior-point gradient, and regularized minimalresidual norm steepest descent (the regularized version of theMRNSD algorithm that was proposed by Nagy and Strakos[74]). These NMF algorithms have been implemented and

investigated in [53] in the context of their usefulness to BSSproblems.

5. Conclusions

The performance of the proposed NMF algorithms can beinferred from the results given in Table 1. In particular, theresults show how the algorithms are sensitive to initialization,or in other words, how easily they fall in local minima. Alsothe complexity of the algorithms can be estimated from theinformation on the elapsed time that is measured in Matlab.

It is easy to notice that our NMF-PSESOP algorithmgives the best estimation (the sample which has the highestbest-SIR value), and it gives only slightly lower mean-SIRvalues than the Lin-PG algorithm. Considering the elapsedtime, the PL, GPSR-BB, SCWA, and IPG belong to the fastest


algorithms, while the Lin-PG and IPN algorithms are theslowest.

The multilayer technique generally improves the perfor-mance and consistency of all the tested algorithms if thenumber of observation is close to the number of nonnegativecomponents. The highest improvement can be observed forthe NMF-PSESOP algorithm, especially when the number ofinner iterations is greater than one (typically, k = 5).

In summary, the best and the most promising NMG-PG algorithms are NMF-PSESOP, GPSR-BB, and IPG algo-rithms. However, the final selection of the algorithm dependson a size of the problem to be solved. Nevertheless, theprojected gradient NMF algorithms seem to be much better(in the sense of speed and performance) in our tests thanthe multiplicative algorithms, provided that we can use thesquared Euclidean cost function which is optimal for datawith a Gaussian noise.

References

[1] A. Cichocki, R. Zdunek, and S. Amari, “New algorithmsfor nonnegative matrix factorization in applications to blindsource separation,” in Proceedings of IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP ’06),vol. 5, pp. 621–624, Toulouse, France, May 2006.

[2] J. Piper, V. P. Pauca, R. J. Plemmons, and M. Giffin, “Objectcharacterization from spectral data using nonnegative matrixfactorization and information theory,” in Proceedings of theAMOS Technical Conference, pp. 1–12, Maui, Hawaii, USA,September 2004.

[3] M. N. Schmidt and M. Mørup, “Nonnegative matrix factor2-D deconvolution for blind single channel source separa-tion,” in Proceedings of the 6th International Conference onIndependent Component Analysis and Blind Signal Separation(ICA ’06), vol. 3889 of Lecture Notes in Computer Science, pp.700–707, Charleston, SC, USA, March 2006.

[4] A. Ziehe, P. Laskov, K. Pawelzik, and K.-R. Mueller, “Non-negative sparse coding for general blind source separation,”in Advances in Neural Information Processing Systems 16,Vancouver, Canada, 2003.

[5] W. Wang, Y. Luo, J. A. Chambers, and S. Sanei, “Nonnegativematrix factorization for note onset detection of audio signals,”in Proceedings of the 16th IEEE International Workshop onMachine Learning for Signal Processing (MLSP ’06), pp. 447–452, Maynooth, Ireland, September 2006.

[6] W. Wang, “Squared euclidean distance based convolutivenonnegative matrix factorization with multiplicative learningrules for audio pattern separation,” in Proceedings of the7th IEEE International Symposium on Signal Processing andInformation Technology (ISSPIT ’07), pp. 347–352, Cairo,Egypt, December 2007.

[7] H. Li, T. Adali, W. Wang, and D. Emge, “Nonnegative matrixfactorization with orthogonality constraints for chemicalagent detection in Raman spectra,” in Proceedings of the15th IEEE International Workshop on Machine Learning forSignal Processing (MLSP ’05), pp. 253–258, Mystic, Conn,USA, September 2005.

[8] P. Sajda, S. Du, T. Brown, L. Parra, and R. Stoyanova,“Recovery of constituent spectra in 3D chemical shift imagingusing nonnegative matrix factorization,” in Proceedings ofthe 4th International Symposium on Independent Component

Analysis and Blind (ICA ’03), pp. 71–76, Nara, Japan, April2003.

[9] P. Sajda, S. Du, T. R. Brown, et al., “Nonnegative matrixfactorization for rapid recovery of constituent spectra inmagnetic resonance chemical shift imaging of the brain,” IEEETransactions on Medical Imaging, vol. 23, no. 12, pp. 1453–1465, 2004.

[10] P. Sajda, S. Du, and L. C. Parra, “Recovery of constituentspectra using nonnegative matrix factorization,” in Wavelets:Applications in Signal and Image Processing X, vol. 5207 ofProceedings of SPIE, pp. 321–331, San Diego, Calif, USA,August 2003.

[11] D. D. Lee and H. S. Seung, “Learning the parts of objects bynonnegative matrix factorization,” Nature, vol. 401, no. 6755,pp. 788–791, 1999.

[12] W. Liu and N. Zheng, “Nonnegative matrix factorizationbased methods for object recognition,” Pattern RecognitionLetters, vol. 25, no. 8, pp. 893–897, 2004.

[13] M. W. Spratling, “Learning image components for objectrecognition,” Journal of Machine Learning Research, vol. 7, pp.793–815, 2006.

[14] Y. Wang, Y. Jia, C. Hu, and M. Turk, “Nonnegative matrixfactorization framework for face recognition,” InternationalJournal of Pattern Recognition and Artificial Intelligence, vol. 19,no. 4, pp. 495–511, 2005.

[15] P. Smaragdis, “Nonnegative matrix factor deconvolution;extraction of multiple sound sources from monophonicinputs,” in Proceedings of the 5th International Conference onIndependent Component Analysis and Blind Signal Separation(ICA ’04), vol. 3195 of Lecture Notes in Computer Science, pp.494–499, Granada, Spain, September 2004.

[16] P. Smaragdis, “Convolutive speech bases and their applicationto supervised speech separation,” IEEE Transactions on Audio,Speech and Language Processing, vol. 15, no. 1, pp. 1–12, 2007.

[17] J.-H. Ahn, S. Kim, J.-H. Oh, and S. Choi, “Multiplenonnegative-matrix factorization of dynamic PET images,” inProceedings of the 6th Asian Conference on Computer Vision(ACCV ’04), pp. 1009–1013, Jeju Island, Korea, January 2004.

[18] P. Carmona-Saez, R. D. Pascual-Marqui, F. Tirado, J. M.Carazo, and A. Pascual-Montano, “Biclustering of geneexpression data by non-smooth nonnegative matrix factoriza-tion,” BMC Bioinformatics, vol. 7, article 78, pp. 1–18, 2006.

[19] D. Guillamet, B. Schiele, and J. Vitria, “Analyzing nonnegativematrix factorization for image classification,” in Proceedingsof the 16th International Conference on Pattern Recognition(ICPR ’02), vol. 2, pp. 116–119, Quebec City, Canada, August2002.

[20] D. Guillamet and J. Vitria, “Nonnegative matrix factorizationfor face recognition,” in Proceedings of the 5th CatalanConference on Artificial Intelligence (CCIA ’02), pp. 336–344,Castello de la Plana, Spain, October 2002.

[21] D. Guillamet, J. Vitria, and B. Schiele, “Introducing a weightednonnegative matrix factorization for image classification,”Pattern Recognition Letters, vol. 24, no. 14, pp. 2447–2454,2003.

[22] O. G. Okun, “Nonnegative matrix factorization and classifiers:experimental study,” in Proceedings of the 4th IASTED Interna-tional Conference on Visualization, Imaging, and Image Process-ing (VIIP ’04), pp. 550–555, Marbella, Spain, September 2004.

[23] O. G. Okun and H. Priisalu, “Fast nonnegative matrixfactorization and its application for protein fold recognition,”EURASIP Journal on Applied Signal Processing, vol. 2006,Article ID 71817, 8 pages, 2006.


[24] A. Pascual-Montano, J. M. Carazo, K. Kochi, D. Lehmann,and R. D. Pascual-Marqui, “Non-smooth nonnegative matrixfactorization (nsNMF),” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 28, no. 3, pp. 403–415, 2006.

[25] V. P. Pauca, F. Shahnaz, M. W. Berry, and R. J. Plemmons,“Text mining using nonnegative matrix factorizations,” inProceedings of the 4th SIAM International Conference on DataMining (SDM ’04), pp. 452–456, Lake Buena Vista, Fla, USA,April 2004.

[26] F. Shahnaz, M. W. Berry, V. P. Pauca, and R. J. Plemmons,“Document clustering using nonnegative matrix factoriza-tion,” Journal on Information Processing & Management, vol.42, no. 2, pp. 373–386, 2006.

[27] T. Li and C. Ding, “The relationships among various non-negative matrix factorization methods for clustering,” inProceedings of the 6th IEEE International Conference on DataMining (ICDM ’06), pp. 362–371, Hong Kong, December2006.

[28] C. Ding, T. Li, W. Peng, and H. Park, “Orthogonal nonnegativematrix tri-factorizations for clustering,” in Proceedings of the12th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD ’06), pp. 126–135, Philadel-phia, Pa, USA, August 2006.

[29] R. Zass and A. Shashua, “A unifying approach to hardand probabilistic clustering,” in Proceedings of the 10th IEEEInternational Conference on Computer Vision (ICCV ’05), vol.1, pp. 294–301, Beijing, China, October 2005.

[30] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clusteringwith Bregman divergences,” in Proceedings of the 4th SIAMInternational Conference on Data Mining (SDM ’04), pp. 234–245, Lake Buena Vista, Fla, USA, April 2004.

[31] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra, “Minimum sum-squared residue co-clustering of gene expression data,” inProceedings of the 4th SIAM International Conference on DataMining (SDM ’04), pp. 114–125, Lake Buena Vista, Fla, USA,April 2004.

[32] S. Wild, Seeding nonnegative matrix factorization with thespherical k-means clustering, M.S. thesis, University of Col-orado, Boulder, Colo, USA, 2000.

[33] M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R.J. Plemmons, “Algorithms and applications for approximatenonnegative matrix factorization,” Computational Statisticsand Data Analysis, vol. 52, no. 1, pp. 155–173, 2007.

[34] Y.-C. Cho and S. Choi, “Nonnegative features of spectro-temporal sounds for classification,” Pattern Recognition Letters,vol. 26, no. 9, pp. 1327–1336, 2005.

[35] J.-P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov,“Metagenes and molecular pattern discovery using matrixfactorization,” Proceedings of the National Academy of Sciencesof the United States of America, vol. 101, no. 12, pp. 4164–4169,2004.

[36] N. Rao and S. J. Shepherd, “Extracting characteristic patternsfrom genome-wide expression data by nonnegative matrix fac-torization,” in Proceedings of the IEEE Computational SystemsBioinformatics Conference (CSB ’04), pp. 570–571, Stanford,Calif, USA, August 2004.

[37] A. Cichocki, R. Zdunek, and S. Amari, “Csiszar’s divergencesfor nonnegative matrix factorization: family of new algo-rithms,” in Independent Component Analysis and Blind SignalSeparation, vol. 3889 of Lecture Notes in Computer Science, pp.32–39, Springer, New York, NY, USA, 2006.

[38] A. Cichocki, R. Zdunek, and S. Amari, “Nonnegative matrixand tensor factorization,” IEEE Signal Processing Magazine,vol. 25, no. 1, pp. 142–145, 2008.

[39] D. Donoho and V. Stodden, “When does nonnegative matrixfactorization give a correct decomposition into parts?” inAdvances in Neural Information Processing Systems 16, Vancou-ver, Canada, 2003.

[40] A. M. Bruckstein, M. Elad, and M. Zibulevsky, “Sparsenonnegative solution of a linear system of equations isunique,” in Proceedings of the 3rd International Symposium onCommunications, Control and Signal Processing (ISCCSP ’08),St. Julians, Malta, March 2008.

[41] F. J. Theis, K. Stadlthanner, and T. Tanaka, “First results onuniqueness of sparse nonnegative matrix factorization,” inProceedings of the 13th European Signal Processing Conference(EUSIPCO ’05), Antalya, Turkey, September 2005.

[42] H. Laurberg, M. G. Christensen, M. D. Plumbley, L. K.Hansen, and S. H. Jensen, “Theorems on positive data:on the uniqueness of NMF,” Computational Intelligence andNeuroscience, vol. 2008, Article ID 764206, 9 pages, 2008.

[43] A. Cichocki and R. Zdunek, “NMFLAB for signal and imageprocessing,” Tech. Rep., Laboratory for Advanced Brain SignalProcessing, BSI, RIKEN, Saitama, Japan, 2006.

[44] A. Cichocki, S. Amari, R. Zdunek, R. Kompass, G. Hori, andZ. He, “Extended SMART algorithms for nonnegative matrixfactorization,” in Proceedings of the 8th International Confer-ence on Artificial Intelligence and Soft Computing (ICAISC ’06),vol. 4029 of Lecture Notes in Computer Science, pp. 548–562,Springer, Zakopane, Poland, June 2006.

[45] R. Zdunek and A. Cichocki, “Nonnegative matrix factor-ization with quasi-Newton optimization,” in Proceedings ofthe 8th International Conference on Artificial Intelligence andSoft Computing (ICAISC ’06), vol. 4029 of Lecture Notes inComputer Science, pp. 870–879, Zakopane, Poland, June 2006.

[46] P. Paatero, “Least-squares formulation of robust nonnegativefactor analysis,” Chemometrics and Intelligent Laboratory Sys-tems, vol. 37, no. 1, pp. 23–35, 1997.

[47] P. Paatero and U. Tapper, “Positive matrix factorization: anonnegative factor model with optimal utilization of errorestimates of data values,” Environmetrics, vol. 5, no. 2, pp. 111–126, 1994.

[48] D. D. Lee and H. S. Seung, “Algorithms for nonnegative matrixfactorization,” in Advances in Neural Information ProcessingSystems 13, pp. 556–562, Denver, Colo, USA, 2000.

[49] Ch.-J. Lin., “On the convergence of multiplicative updatealgorithms for nonnegative matrix factorization,” IEEE Trans-actions on Neural Networks, vol. 18, no. 6, pp. 1589–1596, 2007.

[50] M. T. Chu, F. Diele, R. Plemmons, and S. Ragni, “Optimal-ity, computation, and interpretation of nonnegative matrixfactorizations,” Tech. Rep., Departments of Mathematics andComputer Science, Wake Forest University, Winston-Salem,NC, USA, 2004.

[51] P. O. Hoyer, “Nonnegative matrix factorization with sparse-ness constraints,” Journal of Machine Learning Research, vol. 5,pp. 1457–1469, 2004.

[52] C.-J. Lin, “Projected gradient methods for nonnegative matrixfactorization,” Neural Computation, vol. 19, no. 10, pp. 2756–2779, 2007.

[53] A. Cichocki and R. Zdunek, “Multilayer nonnegative matrixfactorization using projected gradient approaches,” Interna-tional Journal of Neural Systems, vol. 17, no. 6, pp. 431–446,2007.

[54] A Cichocki and R. Zdunek, “Multilayer nonnegative matrixfactorization,” Electronics Letters, vol. 42, no. 16, pp. 947–948,2006.


[55] A. Cichocki and R. Zdunek, “Regularized alternating leastsquares algorithms for nonnegative matrix/tensor factoriza-tions,” in Proceedings of the 4th International Symposium onNeural Networks on Advances in Neural Networks (ISNN ’07),vol. 4493 of Lecture Notes in Computer Science, pp. 793–802,Springer, Nanjing, China, June 2007.

[56] R. Zdunek and A. Cichocki, “Nonnegative matrix factor-ization with constrained second-order optimization,” SignalProcessing, vol. 87, no. 8, pp. 1904–1916, 2007.

[57] B. Johansson, T. Elfving, V. Kozlov, Y. Censor, P.-E. Forssen,and G. Granlund, “The application of an oblique-projectedLandweber method to a model of supervised learning,”Mathematical and Computer Modelling, vol. 43, no. 7-8, pp.892–909, 2006.

[58] J. Barzilai and J. M. Borwein, “Two-point step size gradientmethods,” IMA Journal of Numerical Analysis, vol. 8, no. 1, pp.141–148, 1988.

[59] G. Narkiss and M. Zibulevsky, “Sequential subspace optimiza-tion method for large-scale unconstrained problems,” Tech.Rep. 559, Department of Electrical Engineering, Technion,Israel Institute of Technology, Haifa, Israel, October 2005.

[60] M. Elad, B. Matalon, and M. Zibulevsky, “Coordinate andsubspace optimization methods for linear least squares withnon-quadratic regularization,” Applied and ComputationalHarmonic Analysis, vol. 23, no. 3, pp. 346–367, 2007.

[61] S. Bellavia, M. Macconi, and B. Morini, “An interior pointNewton-like method for nonnegative least-squares problemswith degenerate solution,” Numerical Linear Algebra withApplications, vol. 13, no. 10, pp. 825–846, 2006.

[62] V. Franc, V. Hlavac, and M. Navara, “Sequential coordinate-wise algorithm for the nonnegative least squares problem,” inProceedings of the 11th International Conference on ComputerAnalysis of Images and Patterns (CAIP ’05), A. Gagalowicz andW. Philips, Eds., vol. 3691 of Lecture Notes in Computer Science,pp. 407–414, Springer, Versailles, France, September 2005.

[63] M. Bertero and P. Boccacci, Introduction to Inverse Problems inImaging, Institute of Physics, Bristol, UK, 1998.

[64] Y.-H. Dai and R. Fletcher, “Projected Barzilai-Borwein meth-ods for large-scale box-constrained quadratic programming,”Numerische Mathematik, vol. 100, no. 1, pp. 21–47, 2005.

[65] A. Nemirovski, “Orth-method for smooth convex optimiza-tion,” Izvestiia Akademii Nauk SSSR. Tekhnicheskaia Kiber-netika, vol. 2, 1982 (Russian).

[66] R. W. Freund and N. M. Nachtigal, “QMR: a quasi-minimal residual method for non-Hermitian linear systems,”Numerische Mathematik, vol. 60, no. 1, pp. 315–339, 1991.

[67] R. Fletcher, “Conjugate gradient methods for indefinite sys-tems,” in Proceedings of the Dundee Biennial Conference onNumerical Analysis, pp. 73–89, Springer, Dundee, Scotland,July 1975.

[68] C. Lanczos, “Solution of systems of linear equations byminimized iterations,” Journal of Research of the NationalBureau of Standards, vol. 49, no. 1, pp. 33–53, 1952.

[69] H. A. van der Vorst, “Bi-CGSTAB: a fast and smoothlyconverging variant of Bi-CG for the solution of nonsymmetriclinear systems,” SIAM Journal on Scientific and StatisticalComputing, vol. 13, no. 2, pp. 631–644, 1992.

[70] Y. Saad and M. H. Schultz, “GMRES: a generalized minimalresidual algorithm for solving nonsymmetric linear systems,”SIAM Journal on Scientific and Statistical Computing, vol. 7, no.3, pp. 856–869, 1986.

[71] M. R. Hestenes and E. Stiefel, “Method of conjugate gradientsfor solving linear systems,” Journal of Research of the NationalBureau of Standards, vol. 49, pp. 409–436, 1952.

[72] P. C. Hansen, Rank-Deficient and Discrete Ill-Posed Problems,SIAM, Philadelphia, Pa, USA, 1998.

[73] C. C. Paige and M. A. Saunders, “LSQR: an algorithmfor sparse linear equations and sparse least squares,” ACMTransactions on Mathematical Software, vol. 8, no. 1, pp. 43–71, 1982.

[74] J. G. Nagy and Z. Strakos, “Enforcing nonnegativity inimage reconstruction algorithms,” in Mathematical Modeling,Estimation, and Imaging, vol. 4121 of Proceedings of SPIE, pp.182–190, San Diego, Calif, USA, July 2000.


Research Article

Theorems on Positive Data: On the Uniqueness of NMF

Hans Laurberg,1 Mads Græsbøll Christensen,1 Mark D. Plumbley,2

Lars Kai Hansen,3 and Søren Holdt Jensen1

1 Department of Electronic Systems, Aalborg University, Niels Jernes Vej 12, 9220 Aalborg, Denmark2 Department of Electronic Engineering, Queen Mary, University of London, Mile End Road, London E1 4NS, UK3 Department of Informatics and Mathematical Modeling, Technical University of Denmark, Richard Petersens Plads,Building 321, 2800 Lyngby, Denmark

Correspondence should be addressed to Hans Laurberg, [email protected]

Received 1 November 2007; Accepted 13 March 2008


We investigate the conditions for which nonnegative matrix factorization (NMF) is unique and introduce several theorems whichcan determine whether the decomposition is in fact unique or not. The theorems are illustrated by several examples showing theuse of the theorems and their limitations. We have shown that corruption of a unique NMF matrix by additive noise leads to anoisy estimation of the noise-free unique solution. Finally, we use a stochastic view of NMF to analyze which characterization ofthe underlying model will result in an NMF with small estimation errors.

Copyright © 2008 Hans Laurberg et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Large quantities of positive data occur in research areassuch as music analysis, text analysis, image analysis, andprobability theory. Before deductive science is applied tolarge quantities of data, it is often appropriate to reducedata by preprocessing, for example, by matrix rank reductionor by feature extraction. Principal component analysis is anexample of such preprocessing. When the original data isnonnegative, it is often desirable to preserve this propertyin the preprocessing. For example, elements in a powerspectrogram, probabilities, and pixel intensities should stillbe nonnegative after the processing to be meaningful. Thishas led to the construction of algorithms for rank reductionof matrices and feature extraction generating nonnegativeoutput. Many of the algorithms are related to the nonneg-ative matrix factorization (NMF) algorithm proposed by Leeand Seung [1, 2]. NMF algorithms factorize a nonnegativematrix V∈ Rn×m

+ or R∈ Rn×m+ into two nonnegative matrices

W∈ Rn×r+ and H∈ Rr×m

+ :

V ≈ R = WH. (1)

There are no closed-form solutions to the problem of findingW and H given a V, but Lee and Seung [1, 2] proposed

two computationally efficient algorithms for minimizingthe difference between V and WH for two different errorfunctions. Later, numerous other algorithms have beenproposed (see [3]).

An interesting question is whether the NMF of a par-ticular matrix is unique. The importance of this questiondepends on the particular application of NMF. There can betwo different viewpoints when using a model like NMF—either one can believe that the model describes nature andthat the variables W and H have a physical meaning or onecan believe that the model can capture the part of interesteven though there is not a one-to-one mapping between theparameters and the model, and the physical system. Whenusing NMF, one can wonder whether V is a disturbed versionof some underlying WH or whether the data is constructedby another model or, in other words, a ground truth W andH does exist. These questions are important in evaluatingwhether or not it is a problem that there is another NMFsolution, W′H′, to the same data, that is,

V ≈ R = WH = W′H′. (2)

If NMF is used even though the data is not assumed to begenerated by (1), it may not be a problem that there are


several other solutions. On the other hand, if one assumesthat a ground truth exists, it may be a problem if the modelis not detectable, that is, if it is not possible to find W and Hfrom the data matrix V.

The first articles on the subject was two correspondencesbetween Berman and Thomas. In [4] Berman asked forwhat amounts to a simple characterization of the class ofnonnegative matrices R for which an NMF exists. As we shallsee, the answer by Thomas [5] can be transferred into anNMF uniqueness theorem.

The first article investigating the uniqueness of NMFis Donoho and Stodden [6]. They use convex duality toconclude that in some situations, where the column vectorsof W “describe parts,” and for that reason are nonoverlappingand thereby orthogonal, the NMF solution is unique.

Simultaneously with the development of NMF, Plumb-ley [7] worked with nonnegative independent componentanalysis where one of the problems is to estimate a rotationmatrix Q from observations on the form Qs, where s isa nonnegative vector. In this setup, Plumbley investigatesa property for a nonnegative independent and identicallydistributed (i.i.d.) vector s such that Q can be estimated. Heshows that if the elements in s are grounded and a sufficientlylarge set of observations is used, then Q can be estimated. Theuniqueness constraint in [7] is a statistical condition of s.

The result in [7] is highly relevant to the NMF unique-ness due to the fact that in most cases new NMF solutionswill have the forms WQ and Q−1H as described in Section 3.By using Plumbley’s result twice, a restricted uniquenesstheorem for NMF can be constructed.

In this paper, we investigate the circumstances underwhich NMF of an observed nonnegative matrix is unique.We present novel necessary and sufficient conditions for theuniqueness. Several examples illustrating these conditionsand their interpretations are given. Additionally, we showthat NMF is robust to additive noise. More specifically, weshow that it is possible to obtain accurate estimates of Wand H from noisy data when the generating NMF is unique.Lastly, we consider the generating NMF as a stochasticprocess and show that particular classes of such processesalmost surely result in unique NMFs.

This paper is structured as follows. Section 2 introducesthe notation, some definitions, and basic results. A precisedefinition and two characterizations of a unique NMF aregiven in Section 3. The minimum constraints of W and Hfor a unique NMF are investigated in Section 4. Conditionsand examples of a unique NMF are given in Section 5. InSection 6, it is shown that in situations where noise is addedto a data matrix with a unique NMF, it is possible to boundthe error of the estimates of W and H. A probabilistic view onthe uniqueness is considered in Section 7. The implicationof the theorems is discussed in Section 8, and Section 9concludes the paper.

2. Fundamentals

We will here introduce convex duality that will be theframework of the paper, but first we shall define the notationto be used. Nonnegative real numbers are denoted as

Desired solutionBorder of H∗

Span+ (W)

Figure 1: A three-dimensional space is scaled such that the vectorsare in the hyper plane: {p | [1 1 1]p = 1}. By the mappingto the hyper plane, a plane in R3 is mapped to a line and asimplicial cone is mapped to an area. In the figure, it can be observedthat the dashed triangle (desired solution) is the only triangle(third-order simplicial cone) that contains the shaded area (positivespan of W) while being within the solid border (the dual of H). TheNMF can be concluded to be unique by Theorem 1.

R+, ‖·‖F denotes the Frobenius norm, and span(·) is thespace spanned by the set of vectors. Each type of variableshas its own font. For instance, a scalar is denoted x, a columnvector is denoted x, a row vector is denoted by x, a matrix isdenoted by X, a set is denoted by X, and a random variable

is denoted by X . Moreover, xji is the ith index of the vector

x j . When a condition for a set is used to describe a matrix,it is referring to the set of column vectors in the matrix. TheNMF is symmetric in WT and H, so the theorems for onematrix may also be used for the other.

In the paper, we make a geometric interpretation of theNMF similar to that used in both [5, 6]. For that, we need thefollowing definitions.

Definition 1. The positive span is given by span+(b1, . . . ,bd) = {v =∑ibiai | a∈ Rd

+}.

In some literature, the positive span is called the conicalhull.

Definition 2. A set A is called a simplicial cone if there is a setB such that A = span+(B). The order of a simplicial cone Ais the minimum number of elements in B.

Definition 3. The dual to a set A, denoted A∗, is given byA∗ = {v | vTa ≥ 0 for all a ∈A}.

The following lemma is easy to prove and will be usedsubsequently. For a more general introduction to convexduality, see [8].

Lemma 1. (a) If X = span+(b1 . . . , bd), then y ∈ X∗ if andonly if yTbn ≥ 0 for all n.

(b) If X = span+(BT) and BT = [b1, . . . , bd] is invertible,then X∗ = span+(B−1).


(c) If Y ⊆X, then X∗ ⊆ Y∗.(d) If Y and X are closed simplicial cones and Y ⊂ X,

then X∗ ⊂ Y∗.

3. Dual Space and the NMF

In this section, our definition of unique NMF and somegeneral conditions for unique NMF are given. As a startingpoint, let us assume that both W and H have full rank, thatis, r = rank(R).

Let W′ and H′ be any matrices that fulfil, R = WH =W′H′. Then, span(W) = span(R) = span(W′). The columnvectors of W and W′ are therefore both bases for the samespace and as a result there exists a basis shift matrix Q∈ Rr×r

such that W′ = WQ. It follows that H′ = Q−1H. Therefore,all NMF solutions where r = rank(R), are of the form R =WQQ−1H. In these situations, the ambiguity of the NMF isthe Q matrix. Note that if r > rank(R) the above argumentsare not valid because rank(W) can differ from rank(W′) andthereby span(W) /= span(W′).

Example 1. The following is an example of anR4×4+ matrix of

rank 3, where there are two NMF solutions but no Q matrixto connect the solutions

⎡

⎢⎢⎢⎣

1 1 0 01 0 1 00 1 0 10 0 1 1

⎤

⎥⎥⎥⎦= R = R︸︷︷︸

W

I︸︷︷︸H

= I︸︷︷︸W′

R︸︷︷︸H′

. (3)

We mention in passing that Thomas [5] uses this matrix toillustrate a related problem. This completes the example.

Lemma 2 (Minc [9, Lemma 1.1]). The inverse of a non-negative matrix is nonnegative if and only if it is a scaledpermutation.

Lemma 2 shows that all NMF solutions on the forms WQand Q−1H, where Q is a scaled permutation, are valid, andthereby that NMF only can be unique up to a permutationand scaling. This leads to the following definition of uniqueNMF in this paper.

Definition 4. A matrix R has a unique NMF if the ambiguityis a permutation and a scaling of the columns in W and rowsin H.

The scaling and permutation ambiguity in the unique-ness definition is a well-known ambiguity that occurs inmany blind source separation problems. With this definitionof unique NMF, it is possible to make the following twocharacterizations of the unique NMF.

Theorem 1. If r = rank(R), an NMF is unique if and only ifthe positive orthant is the only r-order simplicial cone Q suchthat span+(WT) ⊆ Q ⊆ span+(H)∗.

Proof. The proof follows the analysis of the Q matrix abovein combination with Lemma 1(b). The theorem can also beproved by following the steps of the proof in [5].

Theorem 2 (see [6]). The NMF is unique if and only if there isonly one r-order simplicial cone Q such that span+(R) ⊆ Q ⊆P , where P is the positive orthant.

Proof. The proof follows directly from the definitions.

The first characterization is inspirited by [5] and thesecond characterization is implicit introduced in [6]. Notethat the two characterizations of the unique NMF analyzethe problem from two different viewpoints. Theorem 1 takesa known W and H pair as starting point and looks at thesolution from the “inside,” that is, the r-dimensional space ofrow vectors in W and column vectors in H. Theorem 2 looksat the problem from the “outside,” that is, the n-dimensionalcolumn space of R.

4. Matrix Conditions

If R = WH is unique, then both W and H have to be unique,respectively, that is, there is only one NMF of W and one ofH, namely, W = WI and H = IH. In this section, a necessarycondition for W and H is given and a sufficient condition isshown.

The following definition will be shown to be a necessarycondition for both the set of row vectors in W and columnvectors in H.

Definition 5. A set S of vectors in Rd+ is called boundary close

if for all j /= i and k > 0 there is an element s ∈ S such that

s j < ksi. (4)

In the case of closed sets, the boundary close conditionis that s j = 0 and si /= 0. In this section, the sets will be finite(and therefore closed), but in Section 7 the general definitionabove is needed.

Theorem 3. The set of row vectors in W has to be boundaryclose for the corresponding NMF to be unique.

Proof. If the set of row vectors in W are not boundary close,there exist indexes j /= i and k > 0 such that the jth elementis always more than k times larger than the ith element in therow vectors in W. Let Q = span+(q1, . . . , qr), where

qn ={

ei + ke j if n = i,

en otherwise,(5)

and en denotes the nth standard basis vector. This set fulfilsthe condition span+(WT) ⊆ Q ⊂ P and we therefore, usingTheorem 1, conclude that the NMF cannot be unique.

That not only the row vectors of W with small elementsdetermine the uniqueness can be seen from the followingexample.

Example 2. The following is an example where W is notunique but W = [ W

3 1 1

]is.


Let

W =

⎡

⎢⎣

0 1 1

1 0 1

1 1 0

⎤

⎥⎦ . (6)

Here W is boundary close but not unique since W = WI =IW. The uniqueness of W = [

W3 1 1

]can be verified by

plotting the matrix as shown in Figure 1, and observe thatthe conditions of Theorem 1 are fulfilled. This completes theexample.

In three dimensions, as in Example 2, it is easy toinvestigate whether a boundary close W is unique—if W =W′H′, then H′ can only have two types of structure: eitherthe trivial (desired) solution where H′ = I or a solutionwhere only the diagonal of H′ is zero. In higher dimensions,the number of combinations of nontrivial solutions increasesand it becomes more complicated to investigate all possiblenontrivial structures. For example, if W is the matrix fromExample 2, then the matrix

W =[

W 0

0 W

]

(7)

is boundary close and can be decomposed in several ways, forexample,

W =[

I 0

0 W

][W 0

0 I

]

=[

W 0

0 I

][I 0

0 W

]

=[

I 0

0 I

][W 0

0 W

]

.

(8)

Instead of seeking necessary and sufficient conditions for aunique W, a sufficient condition not much stronger than thenecessary is given. In this sufficient condition, we only focuson the row vectors of W with a zero (or very small) element.

Definition 6. A set of vectors S in Rd+ is called strongly

boundary close if it is boundary close, and there exists a z > 0and a numbering of the elements in the vectors such that forall k > 0 and n ∈ {1, . . . ,d − 1} there are d − n vectors fromS, {s1, . . . , sd−n} that fulfil the following:

(1) sjn < k

∑i>ns

ji for all j; and

(2) κ2([b1, . . . , bd−n]) ≤ z, where κ2(·) is the “conditionnumber” of the matrix defined as the ratio betweenthe largest and smallest singular values [10, page 81],b j = Pns j and Pn∈ Rd−n×d is a projection matrix thatpicks the d − n last element of a vector in Rd.

Theorem 4. If span+(WT) is strongly boundary close, then Wis unique.

The proof is quite technical and is therefore given inthe Appendix. The most important thing to notice is thatthe necessary condition in Theorem 3 and the sufficientconditions in Theorem 4 are very similar. The first item in the

strongly boundary close definition states that there have to beseveral vectors with small value. The second item ensures thatthe vectors with small value are linear independent in the lastelements.

5. Uniqueness of R

In this section, a condition for unique V is analyzed. First,Example 3 is used to investigate when a strongly boundaryclose W and H pair is unique. The section ends with aconstraint for W and H that results in a unique NMF.

Example 3. This is an investigation of uniqueness of R whenW and H are given as

H =

⎡

⎢⎣

α 1 1 α 0 0

1 α 0 0 α 1

0 0 α 1 1 α

⎤

⎥⎦ ,

W = HT ,

(9)

where 0 < α < 1. Both W and H are strongly boundary closeand the z parameter can be calculated as

z = κ2([

b1, . . . , bd−n])

= κ2

([α 1

1 α

])

= 1 + α

1− α .(10)

The equation above shows that small α will result in az close to one and an α close to one results in a largez. In Figure 2, the matrix R = WH is plotted for α ∈{0.1, 0.3, 0.5, 0.7}. The dashed line is the desired solutionand is repeated in all figures. It is seen that the shaded areaspan+(WT) is decreasing when α increases, and that the solidborder span+(H)∗ increases when α increases. For all α-values, both the shaded area and the solid border intersectwith the dashed triangle. Therefore, it is not possible to getanother solution by simply increasing/decreasing the desiredsolution. The figure shows that the NMF is unique for α ∈{0.1, 0.3} and not unique for α ∈ {0.5, 0.7} where thealternative solution is shown by a dotted line. That the NMFis not unique for α ∈ {0.5, 0.7} can also be verified byselecting the Q to be the symmetric orthonormal matrix

Q = QT = Q−1 = 13

⎡

⎢⎣

−1 2 2

2 −1 2

2 2 −1

⎤

⎥⎦ , (11)

and see that both WQ and Q−1H are nonnegative. If α = 0.3,then the matrix R is given by

R = 1100

⎡

⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

109 60 30 9 30 100

60 109 100 30 9 30

30 100 109 60 30 9

9 30 60 109 100 30

30 9 30 100 109 60

100 30 9 30 60 109

⎤

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

. (12)

This shows that R needs no zeros for the NMF to be unique.This completes the example.


(a) α = 0.1 (b) α = 0.3

(c) α = 0.5 (d) α = 0.7

Figure 2: The figure shows data constructed as in Example 3 andplotted in the same manner as in Figure 1, that is, the dashedtriangle is the desired solution, the solid line is the border of thedual of H, and the shaded area is the positive span of W. It can beseen that the NMF is unique when α equals 0.1 or 0.3 but not whenα equals 0.5 or 0.7. In the cases where the NMF is not unique, analternative solution is shown with a dotted line.

In the example above, W equals HT and thereby fulfilsthe same constraints. In many applications, the meaning ofW and H differs, for example, in music analysis where thecolumn vectors of W are spectra of notes and H is a noteactivity matrix [11].

Next, it is investigated how to make an asymmetricuniqueness constraint.

Definition 7. A set of vectors in Rd is called sufficiently spreadif for all j and k > 0, there is an element s ∈ S such that

s j > k∑

i /= j

si. (13)

Note that in the definition for sufficiently spread set thejth element is larger than the sum in contrast to the stronglyboundary close definition where the jth element is smallerthan the sum.

Lemma 3. The dual space of a sufficiently spread set is thepositive orthant.

Proof. A sufficiently spread set is nonnegative and thepositive orthant is therefore part of the dual set for anysufficiently spread set. Let b be a vector with a negativeelement in the jth element and select

k =∑

i /= j

∣∣bi

∣∣

−b j. (14)

In any sufficiently spread set, an s exists, such that s j >k∑

i /= jsi and therefore

sTb = s jb j +∑

i /= j

sibi ≤ s jb j +

(∑

i /= j

si

)(∑

i /= j

∣∣bi

∣∣

)

= −b j

(

− s j + k∑

i /= j

si

)

< 0.

(15)

The b is therefore not in the dual to any sufficiently spreadset.

In the case of finite sets, the sufficiently spread conditionis the same as the requirement for a scaled version of all thestandard basis vectors to be part of the sufficiently spread set.It is easy to verify that a sufficiently spread set also is stronglyboundary close and that the z parameter is one.

Theorem 5. If a pair [WT , H] is sufficiently spread andstrongly boundary close, then the NMF of R = WH is unique.

Proof. Lemma 3 states that the dual set of a sufficiently spreadset is the positive orthant,

span+(H)∗ = P = span+(I)∗. (16)

Theorem 4 states that WI is unique and by using (16) andTheorem 1 we conclude that R = WH is unique.

Theorem 5 is a stronger version of the results of Donohoand Stodden [6, Theorem 1]. Theorem 1 in [6] also assumesthat H is sufficiently spread, but the condition for WT isstronger than the strongly boundary close assumption.

6. Perturbation Analysis

In the previous sections, we have analyzed situations witha unique solution. In this section, it is shown that in somesituations the nonuniqueness can be seen as estimation noiseon W and H. The error function that describes how close anestimated [W′, H]′ pair is to the true [W, H] pair is

J(W,H)(

W′, H′)=minP,D

(∥∥W−W′(DP)

∥∥F+∥∥H−(DP)−1H′∥∥

F

),

(17)

where P is a permutation matrix and D is a diagonal matrix.

Theorem 6. Let R = WH be a unique NMF. Given some ε >0, there exists a δ > 0 such that any nonnegative V = R + N,where ‖N‖F < δ fulfils

J(W,H)(

W′, H′) < ε, (18)

where[

W′, H′] = arg minW′∈Rn×r+ ,H′∈Rr×m+

∥∥V−W′H′∥∥

F . (19)

The proof is given in the appendix. The theorem statesthat if the observation is corrupted by additive noise, thenit will result in noisy estimation of W and H. Moreover,


(a) (b)

(c) (d)

Figure 3: The three basis pictures: (a) a dog, (b) a man, and (c) thesun from Example 4, individually and summed in (d).

Theorem 6 shows that if the noise is small, it will resultin small estimation errors. In this section, the Frobeniusnorm is used in (17) and (19) to make Theorem 6 concrete.Theorem 6 is also valid with the same proof if any continuousmetric is used instead of the Frobenius norm in thoseequations.

Example 4. This example investigates the connection be-tween the additive noise in V and the estimation error on Wand H. The column vectors in W are basis pictures of a man,a dog, and the sun as shown in Figures 3(a), 3(b), and 3(c).In Figure 3(d), the sum of the three basis pictures is shown.The matrix H is the set of all combinations of the pictures,that is,

H =

⎡

⎢⎣

1 0 0 0 1 1 1

0 1 0 1 0 1 1

0 0 1 1 1 0 1

⎤

⎥⎦ . (20)

Theorem 5 can be used to conclude that the NMF of R =WH is unique because both WT and H are sufficiently spreadand thereby also strongly boundary close.

In the example, two different noise matrices, NN andNM , are used. The NN matrix models noisy observation andhas elements that are random uniform i.i.d. The NM matrixcontains elements that are minus one in the positions whereR has elements that are two and zero elsewhere, that is, NM

is minus one in the positions where the dog and the manare overlapping. In this case, the error matrix NM simulatesa model mismatch that occurs in the following two types ofreal-world data. If the data set is composed of pictures, thebasis pictures will be overlapping and a pixel in V will consistof one basis picture and not a mixture of the overlappingpictures. If the data is a set of amplitude spectra, the truemodel is an addition of complex values and not an additionof the amplitudes.

The estimation error of the factorization J(W,H)(W′, H′)is plotted in Figure 4 when the norm of the error matrixis μ, that is, V = WH + (N/‖N‖F)μ. An estimate of the[W′, H′] pair is calculated by using the iterative algorithmfor Frobenius norm minimized by Lee and Seung [2]. Thealgorithm is run for 500 iterations and is started from100 different positions. The decomposition that minimizes‖V−W′H′‖F is chosen, and J(W,H)(W′, H′) is calculatednumerically. Figure 4 shows that when the added error issmall, it is possible to estimate the underlying parameters.When the norm of added noise matrix increases, thebehavior of the two noise matrices, NN and NM , differ.For NN , the error of the estimate increases slowly with thenorm of the added matrix while the estimation error forNM increases dramatically when the norm is larger than 2.5.In the simulation, we have made the following observationthat can explain the difference in the performance of the twotypes of noise. When NN is used, the basis pictures remainnoisy versions of the man, the dog, and the sun. When NM

is used and the norm is larger than 2.5, the basis pictures arethe man excluding the overlap, the dog excluding the overlap,and the overlap of man and dog. Another way to describe thedifference is that the rank of NM is one and the disturbance isin one dimension, where NN is full rank and the disturbanceis in many dimensions. This completes the example.

Corollary 1. Let R = WH be a unique NMF and V = WH,where W = W + NW and H = H + NH . Given R and ε > 0there exists a δ > 0 such that if the largest absolute value of bothNW and NH is smaller than δ, then

J(W , H)

(W′, H′) < ε, (21)

where W′, H′ are any NMF of V.

Proof. This follows directly from Theorem 6.

The corollary can be used in situations where there aresmall elements in W and H but no (or not enough) zeroelements—as in the following example.

Example 5. Let R = WH, where W, H is generated asin Example 3. Let all elements in both NW and NH beequal to η. In Figure 5, V is plotted when α = 0.3 andη = {0.01, 0.05, 0.10, 0.15}. In this example, neither theshaded area nor the solid border intersect with the desiredsolution. Therefore, it is possible to get other solutions bysimply increasing/decreasing the desired solution. For η ={0.01, 0.05}, the corners of the solutions are close to thecorners of the desired solution. When η = 0.1, the cornerscan be placed mostly on the solid border and still form atriangle that contains the shaded area. When η = 0.15, thecorners can be anywhere on the solid border. This completesthe example.

7. Probability and Uniqueness

In this section, the row vectors of W and the column of H areseen as results of two random variables. Characteristics of the


10−3

10−2

10−1

100

101

102

Est

imat

ion

erro

r

10−1 100 101

Norm of error (μ)

Model mismatchAdditive noise

Figure 4: The graph shows the connection between the norm ofthe additive error ‖N‖F and the estimation error of the underlyingmodel J(W,H)(W′, H′). The two noise matrices from Example 4, NN

and NM , are plotted. In this example, the curves are aligned for smallerrors, and for larger errors, the model error NM results in muchlarger estimation errors.

(a) η = 0.01 (b) η = 0.05

(c) η = 0.10 (d) η = 0.15

Figure 5: Data constructed as in Example 5 and plotted in the samemanner as in Figure 1, that is, the dashed triangle is the desiredsolution, the solid line is the border of the dual of H, and the shadedarea is the positive span of W. In all the plots, α equals 0.3 and ηequals 0.01, 0.05, 0.1, and 0.15.

sample space (the possible outcome) of a random variablethat lead to unique NMF will be investigated.

Theorem 7. Let the row vectors of W be generated by therandom variable XW and let the column vectors of H begenerated by a random variable XH . If the sample space of

XW is strongly boundary close and the sample space of XH

is sufficiently spread, then for all ε > 0 and k < 1, there existNε and Mε such that

p(

minD,P

(‖DPQ− I‖F)< ε

)

> k, (22)

where Q is any matrix such that WQ and Q−1H arenonnegative and the data size R∈ Rn×m

+ is such that n > Nεand m > Mε.

Proof. If the data is scaled, D1RD2, it does not change thenonuniqueness of the solutions when measured by the Qmatrix. The proof is therefore done on the normalizedversions of W and H. Let YW and YH be the normalizedversion of XW and XH . There exist finite sets W and Hof vectors in the closure of YW and YH that are stronglyboundary close and sufficiently spread. By Theorem 5, it isknown that V = W H is unique. By increasing the numberof vectors sampled from YW and YH , for any ε′ > 0, therewill be two subsets of the vectors, W′ and H′, that with aprobability larger that any k < 1 will fulfil

ε′ >∥∥W−W′∥∥

F +∥∥H−H′∥∥

F . (23)

It is possible to use Corollary 1 on this subset. The fact thatlimiting minD,P(‖DPQ− I‖F) is equivalent to limiting (21)when the vectors are normalized concludes the proof.

Example 6. Let all the elements in H be exponential i.i.d.and therefore generated with a sufficiently spread samplespace. Additionally, let each row in W be exponential i.i.d.

plus a random vector with the sample space{( 0

11

),( 1

01

),( 1

10

)}

and thereby strongly boundary close. In Figure 6, the abovevariables are shown for the following four matrix sizesR ∈ {R10×10,R40×40,R100×100,R500×500}. This completes theexample.

8. Discussion

The approach in this paper is to investigate when nonnega-tivity leads to uniqueness in connection with NMF, V ≈ R =WH. Nonnegativity is the only assumption for the theorems,and the theorems therefore cannot be used as argument foran NMF to be nonunique if there is additional informationabout W or H. An example with stronger uniqueness resultsis the sparse NMF algorithm of Hoyer [12] built on theassumption that the row vectors in H have known ratiosbetween the L1 norm and the L2 norm. Theis et al. [13] haveinvestigated uniqueness in this situation and shown stronguniqueness results. Another example is data matrices withan added constant on each row. For this situation, the affineNMF algorithm [14] can make NMF unique even though thesetup violates Theorem 3 in this paper.

As shown in Figure 4, the type of noise greatly influenceson the error curves. In applications where noise is introducedbecause the additive model does not hold as, for example,when V is pictures or spectra, it is possible to influencethe noise by making a nonlinear function on the elements


(a) R[10×10] (b) R[40×40]

(c) R[100×100] (d) R[500×500]

Figure 6: The figure shows data constructed as in Example 6 plottedin the same manner as the previous figure with the exception thateach row vector of W is plotted instead of the positive span of thevectors. The size of R is shown under each plot.

of V. Such a nonlinear function is introduced in [15] andexperiments show that it improves the results. A theoreticalframework for finding good nonlinear functions will beinteresting to investigate.

The sufficiently spread condition defined in Section 5has an important role for unique NMF due to Lemma 3.The sufficiently spread assumption is seen indirectly inrelated areas where it also leads to unique solutions, forexample, in [7] where the groundedness assumption leadsto variables with a sufficiently spread sample space. If thematrix H is sufficiently spread, then the columns in W willoccur (almost) alone as columns in V. Deville [16] uses the“occur alone” assumption, and thereby sufficiently spreadassumption, to make blind source separation possible.

9. Conclusion

We have investigated the uniqueness of NMF from threedifferent viewpoints as follows:

(i) uniqueness in noise free situations;

(ii) the estimation error of the underlying model when amatrix with unique NMF is added with noise; and

(iii) the random processes that lead to matrices where theunderlying model can be estimated with small errors.

By doing this, we have shown that it is possible to makemany novel and useful characterizations that can be usedas theoretical underpinning for using the numerous NMFalgorithms. Several open issues can be found in all the threeviewpoints that, if addressed, will give a better understandingof nonnegative matrix factorization.

APPENDIX

Proof of Theorem 4. The theorem state that W = WI is aunique NMF. To proof this, it is shown that the conditionfor Theorem 1 is fulfilled. The positive orthant is self-dual(I = I−1) and thereby Q ⊆ P , where Q is an r-ordersimplicial cone that contains span+(WT). Let the set of rowvectors in W be denoted by W . An r-order simplicial cone,like Q, is a closed set and it therefore needs to contain theclosure of W denoted by W . The two items in Definition 6of strongly boundary close can be reformulated for W thatcontains the border:

(1) sjn = 0 for all j,

(2) the vectors [b1, . . . , bd−n] are linearly independent.

The rest of the proof follows by induction. If r = 2, thenW = P and is therefore unique. Let therefore r > 2. Thenr − 1 linearly independent vectors in W have zero as thefirst element, and r − 1 of the basis vectors therefore needto have zero in the first element. In other words, there isonly one basis vector with a nonzero first element. Let us callthis vector b1. For all j > 1 there is a vector in W which isnonnegative in the first element and zero in the jth element,so all the elements in b1 except the first have to be zero.The proof is completed by seeing that if the first element isremoved from the vectors in W , it is still strongly boundaryclose and the problem is therefore the r − 1 dimensionalproblem.

Proof of Theorem 6. Let G be the open set of all W′, H′ pairsthat are close to W and H,

G = {[W′, H′] | J(W,H)(

W′, H′) < ε}. (A.1)

Let G be the set of all nonnegative W, H pairs that are not inG and where max(W, H) ≤ √1 + max(R). The uniqueness ofR ensures that

∥∥R− WH

∥∥F > 0, (A.2)

for all [W, H] ∈ G. The fact that the Frobenius norm iscontinuous, G is a closed bounded set, and the statementabove is positive ensures that

min[W,H]∈G

(∥∥R− WH

∥∥F

) = δ′ > 0, (A.3)

since a continuous function attains its limits on a closedbounded set [17, Theorem 4.28]. The W, H pairs that arenot in G and where max (W, H) >

√1 + max(R) can either be

transformed by a diagonal matrix into a matrix pair from G,[WD, D−1H] ∈ G, having the same product (WH) or it canbe transformed into a pair where both W and H have largeelements, that is,

max(

WH)>√

1 + max(R)2= 1 + max(R), (A.4)

and thereby ‖R− WH‖F > 1.Select δ to be δ = min(1, δ′)/2. The error of the desired

solution R = WH can be bounded by ‖V− R‖F = ‖N‖F < δ.


Let V be any matrix constructed by a nonnegative matrix pairnot from G. Because of the way δ is selected, ‖V− R‖F ≥ 2δ.By the triangle inequality, we get

∥∥V−V

∥∥F + ‖V− R‖F ≥

∥∥V− R

∥∥F

∥∥V−V

∥∥F ≥

∥∥V− R

∥∥F − ‖V− R‖F

> 2δ − δ = δ > ‖V− R‖F .(A.5)

All solutions that are not in G therefore have a larger errorthan WH and will not be the minimizer of the error.

Acknowledgments

This research was supported by the Intelligent Soundproject, Danish Technical Research Council Grant no. 26-02-0092. The work of M. G. Christensen is supported bythe Parametric Audio Processing project, Danish ResearchCouncil for Technology, and Production Sciences Grant no.274-06-0521. Part of this work was previously presented at aconference [18].

References


[2] D. D. Lee and H. S. Seung, “Algorithms for non-negativematrix factorization,” in Advances in Neural InformationProcessing Systems 13, pp. 556–562, MIT Press, Cambridge,Mass, USA, 2000.

[3] M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R.J. Plemmons, “Algorithms and applications for approximatenonnegative matrix factorization,” Computational Statistics &Data Analysis, vol. 52, no. 1, pp. 155–173, 2007.

[4] A. Berman, “Problem 73-14, rank factorization of nonnegativematrices,” SIAM Review, vol. 15, no. 3, p. 655, 1973.

[5] L. Thomas, “Solution to problem 73-14, rank factorizations ofnonnegative matrices,” SIAM Review, vol. 16, no. 3, pp. 393–394, 1974.

[6] D. Donoho and V. Stodden, “When does non-negative matrixfactorization give a correct decomposition into parts?” inAdvances in Neural Information Processing Systems 16, pp.1141–1148, MIT Press, Cambridge, Mass, USA, 2004.

[7] M. Plumbley, “Conditions for nonnegative independent com-ponent analysis,” IEEE Signal Processing Letters, vol. 9, no. 6,pp. 177–180, 2002.

[8] R. T. Rockafellar, Convex Analysis, Princeton University Press,Princeton, NJ, USA, 1st edition, 1970.

[9] H. Minc, Nonnegative Matrices, John Wiley & Sons, New York,NY, USA, 1st edition, 1988.

[10] G. H. Golub and C. F. V. Loan, Matrix Computations, JohnsHopkins University Press, Baltimore, Md, USA, 3rd edition,1996.

[11] P. Smaragdis and J. Brown, “Non-negative matrix factoriza-tion for polyphonic music transcription,” in Proceedings of theIEEE Workshop on Applications of Signal Processing to Audioand Acoustics (WASPAA ’03), pp. 177–180, New Paltz, NY,USA, October 2003.

[12] P. O. Hoyer, “Non-negative matrix factorization with sparse-ness constraints,” The Journal of Machine Learning Research,vol. 5, pp. 1457–1469, 2004.

[13] F. Theis, K. Stadlthanner, and T. Tanaka, “First results onuniqueness of sparse non-negative matrix factorization,” inProceedings of the 13th European Signal Processing Conference(EUSIPCO ’05), Antalya, Turkey, September 2005.

[14] H. Laurberg and L. K. Hansen, “On affine non-negative matrixfactorization,” in Proceedings of the IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP ’07),vol. 2, pp. 653–656, Honolulu, Hawii, USA, April 2007.

[15] M. N. Schmidt, J. Larsen, and F.-T. Hsiao, “Wind noisereduction using non-negative sparse coding,” in Proceedings ofthe IEEE Workshop on Machine Learning for Signal Processing(MLSP ’07), pp. 431–436, Thessaloniki, Greece, August 2007.

[16] Y. Deville, “Temporal and time-frequency correlation-basedblind source separation methods,” in Proceedings of the 4thInternational Symposium on Independent Component Analysisand Blind Signal Separation (ICA ’03), pp. 1059–1064, Nara,Japan, April 2003.

[17] T. M. Apostol, Mathematical Analysis, Addison-Wesley, Read-ing, Mass, USA, 2nd edition, 1974.

[18] H. Laurberg, “Uniqueness of non-negative matrix factor-ization,” in Proceedings of the 14th IEEE/SP Workshop onStatistical Signal Processing (SSP ’07), pp. 44–48, Madison,Wis, USA, August 2007.


Research Article

Nonnegative Matrix Factorization with Gaussian Process Priors

Mikkel N. Schmidt1 and Hans Laurberg2

1 Department of Informatics and Mathematical Modelling, Technical University of Denmark, Richard Petersens Plads,DTU-Building 321, 2800 Lyngby, Denmark

2 Department of Electronic Systems, Aalborg University, Niels Jernes Vej 12, 9220 Aalborg Ø., Denmark

Correspondence should be addressed to Mikkel N. Schmidt, [email protected]

Received 31 October 2007; Revised 16 January 2008; Accepted 10 February 2008


We present a general method for including prior knowledge in a nonnegative matrix factorization (NMF), based on Gaussianprocess priors. We assume that the nonnegative factors in the NMF are linked by a strictly increasing function to an underlyingGaussian process specified by its covariance function. This allows us to find NMF decompositions that agree with our priorknowledge of the distribution of the factors, such as sparseness, smoothness, and symmetries. The method is demonstrated withan example from chemical shift brain imaging.

Copyright © 2008 M. N. Schmidt and H. Laurberg. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.

1. Introduction

Nonnegative matrix factorization (NMF) [1, 2] is a recentmethod for factorizing a matrix as the product of two matri-ces, in which all elements are nonnegative. NMF has foundwidespread application in many different areas includingpattern recognition [3], clustering [4], dimensionality reduc-tion [5], and spectral analysis [6, 7]. Many physical signals,such as pixel intensities, amplitude spectra, and occurrencecounts, are naturally represented by nonnegative numbers.In the analysis of mixtures of such data, nonnegativityof the individual components is a reasonable constraint.Recently, a very simple algorithm [8] for computing the NMFwas introduced. This has initiated much research aimed atdeveloping more robust and efficient algorithms.

Efforts have been made to enhance the quality of theNMF by adding further constraints to the decomposition,such as sparsity [9], spatial localization [10, 11], and smooth-ness [11, 12], or by extending the model to be convolutive[13, 14]. Many extended NMF methods are derived byadding appropriate constraints and penalty terms to a costfunction. Alternatively, NMF methods can be derived in aprobabilistic setting, based on the distribution of the data[6, 15–17]. These approaches have the advantage that theunderlying assumptions in the model are made explicit.

In this paper, we present a general method for usingprior knowledge to improve the quality of the solutionsin NMF. The method is derived in a probabilistic setting,and it is based on defining prior probability distributionsof the factors in the NMF model in a Gaussian processframework. We assume that the nonnegative factors in theNMF are linked by a strictly increasing function to anunderlying Gaussian process, specified by its covariancefunction. By specifying the covariance of the underlyingprocess, we can compute NMF decompositions that agreewith our prior knowledge of the factors, such as sparseness,smoothness, and symmetries. We refer to the proposedmethod as nonnegative matrix factorization with Gaussianprocess priors, or GPP-NMF for short.

2. NMF with Gaussian Process Priors

In the following, we derive a method for including priorinformation in an NMF decomposition by assuming Gaus-sian process priors (for a general introduction to Gaussianprocesses, see, e.g., Rasmussen and Williams [18].) In ourapproach, the Gaussian process priors are linked to thenonnegative factors in the NMF by a suitable link function.To setup the notation, we start by deriving the standard NMFmethod as a maximum likelihood (ML) estimator and then


move on to the maximum a posteriori (MAP) estimator.Then, we discuss Gaussian process priors and introduce achange of variable that gives better optimization properties.Finally, we discuss the selection of the link function.

2.1. Maximum Likelihood NMF

The NMF problem can be stated as

X = DH + N, (1)

where X ∈ RK×L is a data matrix that is factorized as theproduct of two element-wise nonnegative matrices, D ∈RK×M

+ and H ∈ RM×L+ , where R+ denotes the nonnegative

reals. The matrix N ∈ RK×L is the residual noise.There exists a number of different algorithms [8, 15–17,

19–21] for computing this factorization, some of which canbe viewed as maximum likelihood methods under certainassumptions about the distribution of the data. For example,least squares NMF corresponds to i.i.d. Gaussian noise [17],and Kullback-Leibler NMF corresponds to a Poisson process[6].

The ML estimate of D and H is given by

{DML, HML

} = arg minD,H≥0

LX|D,H(D, H), (2)

where LX|D,H(D, H) is the negative log likelihood of thefactors.

Example 1 (least squares NMF). An example of a maximumlikelihood NMF is the least squares estimate. If the noise isi.i.d. Gaussian with variance σ2

N , the likelihood of the factorsD and H can be written as

pLSX|D,H(X | D, H) = 1

(√2πσN

)KL exp

(

− ‖X−DH‖2F

2σ2N

)

.

(3)

The negative log likelihood, which serves as a cost functionfor optimization, is then

LLSX|D,H(D, H) ∝ 1

2σ2N

‖X−DH‖2F , (4)

where we use the proportionality symbol to denote equalitysubject to an additive constant. To compute a maximumlikelihood estimate of D and H, the gradient of the negativelog likelihood is useful:

∇HLLSX|D,H(D, H) = 1

σ2N

D�(DH−X), (5)

and the gradient with respect to D, which is easy to derive, issimilar because of the symmetry of the NMF problem.

The ML estimate can be computed by multiplicativeupdate rules based on the gradient [8], projected gradientdescent [19], alternating least squares [20], Newton-typemethods [21], or any other appropriate constrained opti-mization method.

2.2. Maximum a Posteriori NMF

In this paper, we propose a method to build prior knowledgeinto the solution of the NMF problem. We choose a priordistribution pD,H(D, H) over the factors in the model, thatcaptures our prior beliefs and uncertainties of the solutionwe seek. We then compute the maximum a posteriori (MAP)estimate of the factors. Using Bayes’ rule, the posterior isgiven by

pD,H|X(D, H | X) = pX|D,H(X | D, H)pD,H(D, H)pX(X)

. (6)

Since the numerator is constant, the negative log posterior isthe sum of a likelihood term that penalizes model misfit, anda prior term that penalizes solutions that are unlikely underthe prior:

LD,H|X(D, H) ∝ LX|D,H(D, H) + LD,H(D, H). (7)

The MAP estimate of D and H is

{DMAP, HMAP

} = arg minD,H≥0

LD,H|X(D, H), (8)

and it can again be computed using any appropriate opti-mization algorithm.

Example 2 (nonnegative sparse coding). An example of aMAP NMF is nonnegative sparse coding (NNSC) [9, 22],where the prior on H is i.i.d. exponential, and the prior onD is flat with each column constrained to have unit norm

pNNSCD,H (D, H) =

∏

i, j

λ exp(− λHi, j

),

∥∥Dk

∥∥ = 1 ∀k,

(9)

where ||Dk|| is the Euclidean norm of the kth column of D.This corresponds to the following penalty term in the costfunction:

LNNSCD,H (D, H) ∝ λ

∑

i, j

Hi, j . (10)

The gradient of the negative log prior with respect to H isthen

∇HLNNSCD,H = λ, (11)

and the gradient with respect to D is zero, with the furthernormalization constraint given in (9).

2.3. Gaussian Process Priors

In the following, we derive the MAP estimate under theassumption that the nonnegative matrices D and H are inde-pendently determined by a Gaussian process [18] connectedby a link function. The Gaussian process framework providesa principled and practical approach to the specification ofthe prior probability distribution of the factors in the NMFmodel. The prior is specified in terms of two functions:


1

50

100

1

50

100

1

50

100Noisy data

LS-NMF

GPP-NMF: incorrect prior

Underlying data

CNMF

GPP-NMF: correct prior

1 100 200 100 200

−10

−5

0

5

10

15

20

Figure 1: Toy example data matrix (upper left), underlying noise-free nonnegative data (upper right), and estimates using the four methodsdescribed in the text. The data has a fairly large amount of noise, and the underlying nonnegative factors are smooth in both directions. TheLS-NMF and CNMF decompositions are nonsmooth since these methods are not model of correlations in the factors. The GPP-NMF, whichuses a smooth prior, finds a smooth solution. When using the correct prior, the soulution is very close to the true underlying data.

(i) a covariance function that describes corellations in thefactors and (ii) a link function, that transforms the Gaussianprocess prior into a desired distribution over the nonnegativereals.

We assume that D and H are independent, so that we maywrite

LD,H(D, H) = LD(D) + LH(H). (12)

In the following, we consider only the prior for H, since thetreatment of D is equivalent due to the symmetry of the NMFproblem. We assume that there is an underlying variablevector, h ∈ RLM , which is zero-mean multivariate Gaussianwith covariance matrix Σh:

ph(h) = (2π∣∣Σh∣∣2)−(1/2)NL

exp(− 1

2h�Σ−1

h h)

, (13)

and linked to H via a link function, fh: R+ → R as

h = fh(vec (H)

), (14)

which operates element-wise on its input. The vec (·)function in the expression stacks its matrix operand columnby column. The link function should be strictly increasing,which ensures that the inverse exists. Later, we will furtherassume that the derivatives of fh and f −1

h exist. Under these

assumptions, the prior over H is given by (using the changeof variables theorem)

pH(H)

= ph(fh(vec (H)

))∣∣J(fh(vec (H)

))∣∣

∝ exp(− 1

2fh(vec (H)

)�Σ−1h fh

(vec (H)

))∏

i

∣∣ f ′h(vec (H)

)∣∣i,

(15)

where J(·) denotes the Jacobian determinant and f ′h is thederivative of the link function. The negative log prior is then

LH(H)

∝ 12fh(vec (H)

)�Σ−1h fh

(vec (H)

)−∑

i

log∣∣ f ′h

(vec (H)

)∣∣i.

(16)

This expression can be combined with an appropriate like-lihood function, such as the least-squares likelihood in (4)and can be optimized to yield the MAP solution; however,in our experiments, we found that a more simple and robustalgorithm can be obtained by making a change of variable asexplained next.

2.4. Change of Optimization Variable

Instead of optimizing over the nonnegative factors D and H,we introduce the variables δ and η, which are related to D


0

2

4

20 40 60 80 100

0

2

4

20 40 60 80 100

0

2

4

20 40 60 80 100

0

2

4

20 40 60 80 100

0

2

4

20 40 60 80 100

GP

P-N

MF:

corr

ect

prio

rG

PP-

NM

F:in

corr

ect

prio

rC

NM

FL

S-N

MF

Un

derl

yin

gda

ta

Columns of D

(a)

0

2

4

0

2

4

50 100 150 200

50 100 150 200

0

2

4

50 100 150 200

0

2

4

50 100 150 200

0

2

4

50 100 150 200

GP

P-N

MF:

corr

ect

prio

rG

PP-

NM

F:in

corr

ect

prio

rC

NM

FL

S-N

MF

Un

derl

yin

gda

ta

Rows of H

(b)

Figure 2: Underlying nonnegative factors in the toy example: columns of D (left) and rows of H (right). The factors found by the LS-NMFand the CNMF algorithms are noisy, whereas the factors found by the GPP-NMF method are smooth. When using the correct prior, thefactors found are very similar to the true factors.

and H by

D = gd(δ) = vec −1( f −1d

(C�d δ

)),

H = gh(η) = vec −1( f −1h

(C�h η

)),

(17)

where the vec −1(·) function maps its vector input into amatrix of appropriate size. The matrices Cd and Ch arethe matrix square roots (Cholesky decompositions) of thecovariance matricesΣd andΣh, such that δ and η are standardi.i.d. Gaussian.

This change of variable serves two purposes. First of all,we found that optimizing over the transformed variableswas faster, more robust, and less prone to getting stuck inlocal minima. Second, the transformed variables are notconstrained to be nonnegative, which allows us to use ex-isting unconstrained optimization methods to compute theirMAP estimate.

The prior distribution of the transformed variable η is

pη(η) = pH(gh(η)

) ∣∣J(gh(η)

)∣∣ = 1

(2π)LM/2exp

(− 1

2η�η

),

(18)

and the negative log prior is

Lη(η) ∝ 12η�η. (19)

To compute the MAP estimate of the transformed variables,we must combine this expression for the prior (and a similarexpression for the prior of δ) with a likelihood function, inwhich the same change of variable is made

Lδ,η|X(δ,η) = LX|D,H(gd(δ), gh(η)

)+

12δ�δ +

12η�η.

(20)


Then, the MAP solution can be found by optimizing over δand η as

{δMAP,ηMAP

} = arg minδ,η

Lδ,η|X(δ,η), (21)

and, finally, estimates of D and H can be computed using(17).

Example 3 (least squares nonnegative matrix factorizationwith Gaussian process priors (GPP-NMF)). If we use theleast squares likelihood in (4), the posterior distribution in(20) is given by

LLS-GPPδ,η|X (δ,η) = 1

2

(σ−2N

∥∥X− gd(δ)gh(η)∥∥2F + δ�δ + η�η

).

(22)

The MAP estimate of δ and η is found by minimizing thisexpression, for which the derivative is useful

∇ηLLS-GPPδ,η|X (δ,η)

= σ−2N

(vec

(gd(δ)�

(gd(δ)gh(η)−X

))�( f −1h

)′(C�h η

))�Ch+η,

(23)

where� denotes the Hadamard (element-wise) product. Thederivative with respect to δ is similar due to the symmetry ofthe NMF problem.

2.5. Link Function

Any strictly increasing link function that maps the non-negative reals to the real line can be used in the proposedframework; however, in order to have a better probabilisticinterpretation of the prior distribution, we propose a simpleprinciple for choosing the link function. We choose the linkfunction such that f −1

h maps the marginal distribution of theelements of the underlying Gaussian process vector h intoa specifically chosen marginal distribution of the elementsof H. Such an inverse function can be found as f −1

h (hi) =P−1H (Ph(hi)), where P(·) denotes the marginal cumulative

distribution functions (CDFs).

Since the marginals of a Gaussian process are Gaussian,Ph(hi) is the Gaussian (CDF), and, using (13), the inverse linkfunction is given by

f −1h

(hi) = P−1

H

(12

+12Φ

(hi√2σi

))

, (24)

where σ2i is the ith diagonal element of Σh and Φ(·) is the

error function.

Example 4 (exponential-to-Gaussian link function). If wechoose to have exponential marginals in H, as in NNSCdescribed in Example 2, we select PH as the exponential CDF.The inverse link function is then

f −1h

(hi) = −1

λlog

(12− 1

2Φ

(hi√2σi

))

, (25)

where λ is an inverse scale parameter. The derivative of theinverse link function, which is needed for the parameterestimation, is given by

(f −1h

)′(hi) = 1√

2πσiλexp

(

λ f −1h

(hi)− h2

i

2σ2i

)

. (26)

Example 5 (rectified-Gaussian-to-Gaussian link function).Another interesting nonnegative distribution is the rectifiedGaussian given by

p(x) =

⎧⎪⎪⎨

⎪⎪⎩

2√2πs

exp

(

− x2

2s2

)

, x ≥ 0,

0, x < 0.(27)

Using this pdf in (24), the inverse link function is

f −1h

(hi) =

√2sΦ−1

(12

+12Φ

(hi√2σi

))

, (28)

and the derivative of the inverse link function is

(f −1h

)′(hi) = s

2σiexp

(f −1h

(hi)2

2s2− h2

i

2σ2i

)

. (29)

2.6. Summary of the GPP-NMF Method

The GPP-NMF method can be summarized in the followingsteps.

(i) Choose a suitable negative log-likelihood functionLX|D,H(D, H) based on knowledge of the distributionof the data or the residual.

(ii) For each of the nonnegative factors D and H, choosesuitable link and covariance functions according toyour prior beliefs. If necessary, draw samples from theprior distribution to examine its properties.

(iii) Compute the MAP estimate of δ and η by (21) usingany suitable unconstrained optimization algorithm.

(iv) Compute D and H using (17).

Our Matlab implemention of the GPP-NMF method isavailable online [23].

3. Experimental Results

We will demonstrate the proposed method on two examples,first a toy example, and second an example taken from thechemical shift brain imaging literature.

In our experiments, we use the least squares GPP-NMFdescribed in Example 3 and the link functions described inExamples 4-5. The specific optimization method used tocompute the GPP-NMF MAP estimate is not the topic of thispaper, and any unconstrained optimization algorithm couldin principle be used. In our experiments, we used a simplegradient descent with line search to perform a total of 1000alternating updates of δ and η, after which the algorithmhad converged. For details of the implementation, see theaccompanying Matlab code [23].


3.1. Toy Example

We generated a 100×200 data matrix, Y, by taking a randomsample from the GPP-NMF model with two factors. Wechose the generating covariance function for both δ and ηas a Gaussian radial basis function (RBF)

φ(i, j) = exp

(

− (i− j)2

β2

)

, (30)

where i and j are two sample indices, and the length-scaleparameter, which determines the smoothness of the factors,was β2 = 100. We set the covariance between the two factorsto zero, such that the factors were uncorrelated. For thematrix D, we used the rectified-Gaussian-to-Gaussian linkfunction with s = 1; and for H, we used the exponential-to-Gaussian link function with λ = 1. Finally, we addedindependent Gaussian noise with variance σ2

N = 25, whichresulted in a signal-to-noise ratio of approximately −7 dB.The generated data matrix is shown in Figure 1.

We then decomposed the generated data matrix using thefollowing four different methods:

(i) LS-NMF: standard least squares NMF [8]. This algo-rithm does not allow negative data points, so thesewere set to zero in the experiment.

(ii) CNMF: constrained NMF [6, 7], which is a leastsquares NMF algorithm that allows negative observa-tions.

(iii) GPP-NMF: correct prior: the proposed method withcorrect link functions, covariance matrix, and param-eter values.

(iv) GPP-NMF: incorrect prior: to illustrate the sensitivityof the method to prior assumptions, we evaluated theproposed method with incorrect prior assumptions:we switched the link functions, such that for D weused the exponential-to-Gaussian, and for H we usedthe rectified-Gaussian-to-Gaussian. We used an RBFcovariance function with β2 = 10 and β2 = 1000 for Dand H, respectively.

The results of the decompositions are shown as recon-structed data matrices in Figure 1. All four methods findsolutions that visually appear to fit the underlying data. BothLS-NMF and CNMF find nonsmooth solutions, whereasthe two GPP-NMF results are smooth in accordance withthe priors. In the GPP-NMF with incorrect prior, the darkareas (high-pixel intensities) appear too wide in the first axisdirection and too narrow in the section axis direction, dueto the incorrect setting of the covariance function. The GPP-NMF with correct prior is visually almost equal to the trueunderlying data.

Plots of the estimated factors are show in Figure 2. Thefactors estimated by the LS-NMF and the CNMF methodsappear noisy and are nonsmooth, whereas the factors esti-mated by the GPP-NMF are smooth. The factors estimatedby the LS-NMF method have a positive bias, because of thetruncation of negative data. The GPP-NMF with incorrectprior has too many local extrema in the D factor and too fewin the H factor due to the incorrect covariance functions.

0

1

2

3

4

5

6

RM

SE

Noisy data Noise free data Underlying factors

NMFCNMF

GPP-NMF: incorrect priorGPP-NMF: correct prior

Figure 3: Toy example: root mean squared error (RMSE) withrespect to the noisy data, the underlying noise-free data, andthe true underlying nonnegative factors. The CNMF solution fitsthe noisy data slightly better, but the GPP-NMF solution fits theunderlying data much better.

There are only minor difference between the result of theGPP-NMF with the correct prior and the underlying factors.

Measures of root mean squared error (RMSE) of the fourdecompositions are given in Figure 3. All four methods fit thenoisy data almost equally well. (Note that, due to the additivenoise with variance 25, a perfect fit to the underlying factorswould result in an RMSE of 5 with respect to the noisy data.)The LS-NMF fits the data worst due to the truncation ofnegative data points, and the CNMF fits the data best, dueto overfitting. With respect to the noise-free data and theunderlying factors, the RMSE is worst for the LS-NMF andbest for the GPP-NMF with correct prior. The GPP-NMFwith incorrect prior is better than both LS-NMF and CNMFin this case. This shows that in this situation it is better touse a prior which is not perfectly correct, compared to usingno prior as in the LS-NMF and CNMF methods, (whichcorresponds to a flat prior over the nonnegative reals and nocorrelations).

3.2. Chemical Shift Brain Imaging Example

Next, we demonstrate the GPP-NMF method on 1H decou-pled 31P chemical shift imaging data of the human brain. Weuse the data set from Ochs et al. [24], which has also beenanalyzed by Sajda et al. [6, 7]. The data set, which is shownin Figure 4, consists of 512 spectra measured on an 8× 8× 8grid in the brain.

Ochs et al. [24] use PCA to determine that the data setis adequately described by two sources (which correspond tobrain and muscle tissue). They propose a bilinear Bayesianapproach, in which they use a smooth prior over theconstituent spectra, and force to zero the amplitude of thespectral shape corresponding to muscle tissue at 12 positionsdeep inside the head. Their approach produces physically


1

128

256

384

512

1

128

256

384

512

1

128

256

384

512

0 −10 −20 0 −10 −20

(ppm) (ppm)

Data

Estimate Residual

−2

−1

0

1

2

3

4

5

×104

0 −10 −20

0 −10 −20 0 −10 −20

CN

MF

GP

P-N

MF

Figure 4: Brain imaging data matrix (top) along with the estimated decomposition and residual for the CNMF (middle) and GPP-NMF(bottom) methods. In this view, the results of the two decompositions are very similar, the data appears to be modeled equally well and theresiduals are similar in magnitude.

5 0 −5 −10 −15 −20

(ppm)

0

0.1

0.2

(a)

0

0.1

0.2

5 0 −5 −10 −15 −20

(ppm)

(b)

Figure 5: Brain imaging data: random draw from the prior distribution with the parameters set as described in the text. The priordistribution of the constituent spectra (left) is exponential and smooth, and the spatial distribution (right) in the brain is exponential,smooth, and has a left-to-right symmetry.

plausible results, but it is computationally very expensive andtakes several hours to compute.

Sajda et al. [6, 7] propose an NMF approach that isreported also to produce physically plausible results. Theirmethod is several orders of magnitude faster, taking less thana second to compute. The disadvantage of the method ofSajda et al. compared to the Bayesian approach of Ochs et al.

is that it provides no mechanism for using prior knowledgeto improve the solution.

The GPP-NMF approach we propose in this paperbridges the gap between the two previous approaches, inthe sense that it is a relatively fast NMF approach, in whichpriors over the factors can be specified. These priors arespecified by the choice of the link and covariance functions.


5 0 −5 −10 −15 −20

(ppm)

0

0.1

0.2

(a)

0

0.1

0.2

5 0 −5 −10 −15 −20

(ppm)

(b)

Figure 6: CNMF decomposition result. The recovered spectra are physically plausible, and the spatial distribution in the brain for the muscle(top) and brain (bottom) tissue is somewhat separated. Muscle tissue is primarily found near the edge of the skull, whereas brain tissue isprimarily found at the inside of the head.

5 0 −5 −10 −15 −20

(ppm)

0

0.1

0.2

(a)

0

0.1

0.2

5 0 −5 −10 −15 −20

(ppm)

(b)

Figure 7: GPP-NMF decomposition result. The recovered spectra are very similar to the spectra found by the CNMF method, but they areslightly more smooth. The spatial distribution in the brain is highly separated between brain and muscle tissue, and it is more symmetricthan the CNMF solution.

We used prior predictive sampling to find reasonable settingsof the the function parameters: we drew random samplesfrom the prior distribution and examined properties of thefactors and reconstructed data. We then manually adjustedthe parameters of the prior to match our prior beliefs. Anexample of a random draw from the prior distribution is

shown in Figure 5, with the parameters set as describedbelow.

We assumed that the factors are uncorrelated, so thecovariance between factors is zero. We used a Gaussian RBFcovariance function for the constituent spectra, with a lengthscale of β = 0.3 parts per million (ppm), and we used the


exponential-to-Gaussian link function with λd = 1. This gavea prior for the spectra that is sparse with narrow smoothpeaks. For the amplitude at the 512 voxels in the head, weused a Gaussian RBF covariance function on the 3D voxelindices, with length scale β = 2. Furthermore, we centeredthe left-to-right coordinate axis in the middle of the brain,and computed the RBF kernel on the absolute value of thiscoordinate, so that a left-to-right symmetry was introducedin the prior distribution. Again, we used the exponential-to-Gaussian link function, and we chose λh = 2 · 10−4 tomatch the overall magnitude of the data. This gave a priorfor the amplitude distribution that is sparse, smooth, andsymmetric. The noise variance was set to σ2

N= 108 whichcorresponds to the noise level in the data set.

We then decomposed the data set using the proposedGPP-NMF algorithm and, for comparison, reproduced theresults of Sajda et al. [7] using their CNMF method. Theresults of the experiments are shown in Figure 4. An exampleof a random draw from the prior distribution is shown inFigure 5. The results of the CNMF is shown in Figure 6,and the results of the GPP-NMF is shown in Figure 7. Thefigures show the constituent spectra and the fifth axial sliceof the spatial distribution of the spectra. The 8 × 8 spatialdistributions are smoothed in the illustration, similar to theway the results are visualized in the literature.

The results show that both methods give physicallyplausible results. The main difference is that the spatialdistribution of the spectra corresponding to muscle andbrain tissue is much more separated in the GPP-NMF result,which is due to the exponential, smooth, and symmetricprior distribution. By including prior information, we obtaina solution, where the factor corresponding to muscle tissue isclearly located on the edge of the skull.

4. Conclusions

We have introduced a general method for exploiting priorknowledge in nonnegative matrix factorization, based onGaussian process priors, linked to the nonnegative factorsby a link function. The method can be combined withany existing NMF cost function that has a probabilisticinterpretation, and any existing unconstrained optimizationalgorithm can be used to compute the maximum a posterioriestimate.

Experiments on toy data show that with a suitableselection of the prior distribution of the nonnegative factors,the GPP-NMF method gives much better results in terms ofestimating the true underlying factors, both when comparedto traditional NMF and CNMF.

Experiments on chemical shift brain imaging data showthat the GPP-NMF method can be successfully used toinclude prior knowledge of the spectral and spatial distri-bution, resulting in better spatial separation between spectracorresponding to muscle and brain tissue.

Acknowledgments

We would like to thank Paul Sajda and Truman Brownfor making the brain imaging data available to us. This

research was supported by the Intelligent Sound project,Danish Technical Research Council Grant no. 26–02–0092and partly supported also by the European Commissionthrough the sixth framework IST Network of Excellence:Pattern Analysis, Statistical Modelling and ComputationalLearning (PASCAL), Contract no. 506778.

References



[3] L. Weixiang, Z. Nanning, and Y. Qubo, “Nonnegative matrixfactorization and its applications in pattern recognition,”Chinese Science Bulletin, vol. 51, no. 1, pp. 7–18, 2006.

[4] C. Ding, X. He, and H. D. Simon, “On the equivalence ofnonnegative matrix factorization and spectral clustering,” inProceedings of the 5th SIAM International Conference on DataMining, pp. 606–610, Newport Beach, Calif, USA, April 2005.

[5] S. Tsuge, M. Shishibori, S. Kuroiwa, and K. Kita, “Dimen-sionality reduction using nonnegative matrix factorization forinformation retrieval,” in Proceedings of IEEE InternationalConference on Systems, Man and Cybernetics, vol. 2, pp. 960–965, Tucson, Ariz, USA, October 2001.

[6] P. Sajda, S. Du, and L. C. Parra, “Recovery of constituentspectra using nonnegative matrix factorization,” in Wavelets:Applications in Signal and Image Processing, vol. 5207 ofProceedings of SPIE, pp. 321–331, San Diego, Calif, USA,August 2003.

[7] P. Sajda, S. Du, T. R. Brown, R. Stoyanova, D. C. Shungu, X.Mao, and L. C. Parra, “Nonnegative matrix factorization forrapid recovery of constituent spectra in magnetic resonancechemical shift imaging of the brain,” IEEE Transactions onMedical Imaging, vol. 23, no. 12, pp. 1453–1465, 2004.

[8] D. D. Lee and H. S. Seung, “Algorithms for nonnegative matrixfactorization,” in Proceedings of the 13th Annual Conferenceon Neural Information Processing Systems (NIPS ’00), pp. 556–562, Denver, Colo, USA, November-December 2000.

[9] P. Hoyer, “Nonnegative sparse coding,” in Proceedings of the12th IEEE Workshop on Neural Networks for Signal Processing(NNSP ’02), pp. 557–565, Martigny, Switzerland, September2002.

[10] S. Z. Li, X. W. Hou, H. J. Zhang, and Q. S. Cheng, “Learningspatially localized, parts-based representation,” in Proceedingsof IEEE Computer Society Conference on Computer Vision andPattern Recognition (CVPR ’01), vol. 1, pp. 207–212, Kauai,Hawaii, USA, December 2001.

[11] Z. Chen and A. Cichocki, “Nonnegative matrix factorizationwith temporal smoothness and/or spatial decorrelation con-straints,” Technical Report, Laboratory for Advanced BrainSignal Processing, RIKEN, Tokyo, Japan, 2005.

[12] T. Virtanen, “Sound source separation using sparse codingwith temporal continuity objective,” in Proceedings of theInternational Computer Music Conference (ICMC ’03), pp.231–234, Singapore, September-October 2003.

[13] P. Smaragdis, “Nonnegative matrix factor deconvolution;extraction of multiple sound sources from monophonicinputs,” in Proceedings of the 5th International Conference on


Independent Component Analysis and Blind Signal Separation(ICA ’04), vol. 3195 of Lecture Notes in Computer Science, pp.494–499, Springer, Granada, Spain, September 2004.

[14] M. N. Schmidt and M. Mørup, “Nonnegative matrix factor 2-d deconvolution for blind single channel source separation,” inProceedings of the 6th International Conference on IndependentComponent Analysis and Blind Signal Separation (ICA ’06),vol. 3889 of Lecture Notes in Computer Science, pp. 700–707,Springer, Charleston, SC, USA, March 2006.

[15] O. Winther and K. B. Petersen, “Bayesian independentcomponent analysis: variational methods and nonnegativedecompositions,” Digital Signal Processing, vol. 17, no. 5, pp.858–872, 2007.

[16] T. Hofmann, “Probabilistic latent semantic indexing,” inProceedings of the 22nd Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval(SIGIR ’99), pp. 50–57, Berkeley, Calif, USA, August 1999.

[17] A. Cichocki, R. Zdunek, and S.-I. Amari, “Csiszar’s diver-gences for nonnegative matrix factorization: family of newalgorithms,” in Proceedings of the 6th International Conferenceon Independent Component Analysis and Blind Signal Separa-tion (ICA ’06), vol. 3889 of Lecture Notes in Computer Science,pp. 32–39, Springer, Charleston, SC, USA, March 2006.

[18] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes forMachine Learning, MIT Press, Cambridge, Mass, USA, 2006.

[19] C.-J. Lin, “Projected gradient methods for nonnegative matrixfactorization,” Neural Computation, vol. 19, no. 10, pp. 2756–2779, 2007.

[20] M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R.J. Plemmons, “Algorithms and applications for approximatenonnegative matrix factorization,” Computational Statistics &Data Analysis, vol. 52, no. 1, pp. 155–173, 2007.

[21] D. Kim, S. Sra, and I. S. Dhillon, “Fast Newton-type methodsfor the least squares nonnegative matrix approximation prob-lem,” in Proceedings of the 7th SIAM International Conferenceon Data Mining, pp. 343–354, Minneapolis, Minn, USA, April2007.

[22] J. Eggert and E. Korner, “Sparse coding and NMF,” in Proceed-ings of IEEE International Joint Conference on Neural Networks(IJCNN ’04), vol. 4, pp. 2529–2533, Budapest, Hungary, July2004.

[23] M. N. Schmidt and H. Laurberg, “Nonnegative matrixfactorization with Gaussian process priors,” ComputationalIntelligence and Neuroscience. In press.

[24] M. F. Ochs, R. S. Stoyanova, F. Arias-Mendoza, and T. R.Brown, “A new method for spectral decomposition using abilinear Bayesian approach,” Journal of Magnetic Resonance,vol. 137, no. 1, pp. 161–176, 1999.


Research Article

Extended Nonnegative Tensor Factorisation Models forMusical Sound Source Separation

Derry FitzGerald,1 Matt Cranitch,1 and Eugene Coyle2

1 Department of Electronic Engineering, Cork Institute of Technology, Cork, Ireland2 School of Electrical Engineering Systems, Dublin Institute of Technology, Kevin Street, Dublin, Ireland

Correspondence should be addressed to Derry FitzGerald, [email protected]

Received 18 December 2007; Revised 3 March 2008; Accepted 17 April 2008

Recommended by Morten Morup

Recently, shift-invariant tensor factorisation algorithms have been proposed for the purposes of sound source separation ofpitched musical instruments. However, in practice, existing algorithms require the use of log-frequency spectrograms to allowshift invariance in frequency which causes problems when attempting to resynthesise the separated sources. Further, it is difficultto impose harmonicity constraints on the recovered basis functions. This paper proposes a new additive synthesis-based approachwhich allows the use of linear-frequency spectrograms as well as imposing strict harmonic constraints, resulting in an improvedmodel. Further, these additional constraints allow the addition of a source filter model to the factorisation framework, and anextended model which is capable of separating mixtures of pitched and percussive instruments simultaneously.

Copyright © 2008 Derry FitzGerald et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

The use of factorisation-based approaches for the separationof musical sound sources dates back to the early 1980swhen Stautner used principal component analysis (PCA)to separate different tabla strokes [1]. However, it was notuntil the development of independent component analysis(ICA) [2] and techniques such as sparse coding [3, 4]and nonnegative matrix factorisation (NMF) [5, 6] thatfactorisation-based approaches received much attention forthe analysis and separation of musical audio signals [7–11].

Factorisation-based approaches were initially applied tosingle channel separation of musical sources [7–10], wheretime-frequency analysis was performed on the input signal,yielding a spectrogram X of size n×m. This spectrogram wasthen factorised to yield a reduced rank approximation

X ≈ X = AS, (1)

where A is of size n × r and S is of size r × m, with rless than n and m. In this case, the columns of A containfrequency basis functions, while the corresponding rows of Scontain amplitude basis functions which describe when thefrequency basis functions are active. Typically this is done

on a magnitude or power spectrogram, and this approachmakes the assumption that the spectrograms generated bythe basis function pairs sum together to generate the mixturespectrogram. This does not take into account the effects ofphase when the spectrograms are added together, and inthe case of magnitude spectrograms this assumption is onlytrue if the sources do not overlap in time and frequency,while it holds true on average for power spectrograms. Wherethe various techniques differ is in how this factorisationis achieved. Casey and Westner [7] used PCA to achievedimensional reduction and then performed ICA on theretained principal components to achieve independent basisfunctions, while more recent work has focused on the use ofnonnegativity constraints in conjunction with a suitable costfunction [8, 9].

A commonly used cost function is the generalisedKullback-Leibler divergence proposed by Lee and Seung [5]:

D(

X‖X) =∑

i j

(

Xi j logXi j

Xi j

−Xi j + Xi j

)

(2)

which is equivalent to assuming a Poisson noise modelfor the data [12]. This cost function has been widely useddue to its ease of implementation, lack of parameters, and


the fact that it has been found to give reasonable resultsin many cases [13, 14]. A sparseness constraint can alsobe added to this cost function, and multiplicative updateequations which ensure nonnegativity can be derived forthese cost functions [15]. Other cost functions have beendeveloped for factorisation of audio spectrograms such asthat of Abdallah and Plumbley which assumes multiplicativegamma-distributed noise in power spectrograms [16]. Asimilar cost function recently proposed by Parry and Issaattempts to incorporate phase into the factorisation byusing a probabilistic phase model [17, 18]. Families ofparameterised cost functions have been proposed, such as theBeta divergence [19], and Csiszar’s divergences [20]. The useof the Beta divergence for the separation of speech signalshas been investigated by O’Grady [21], who also proposed aperceptually-based noise to mask ratio as a cost function.

Regardless of the cost function used, the resultantdecomposition is linear, and as a result each basis functionpair typically corresponds to a single note or chord playedby a given pitched instrument. Therefore, in order toachieve sound source separation, some method is required togroup the basis functions by source or instrument. Differentgrouping methods have been proposed in [7, 8], but inpractice it is difficult to obtain the correct clustering forreasons discussed in [22].

1.1. Tensor Notation

When dealing with tensor notation, we use the conventionsdescribed by Bader and Kolda in [23]. Tensors are denotedusing calligraphic uppercase letters, such as A. Rather thanusing subscripts to indicate indexing of elements withina tensor or matrix, such as Xi, j , indexing of elements isinstead notated by X(i, j). When dealing with contractedproduct multiplication of two tensors, if W is a tensor ofsize I1 × · · · × IN × J1 × · · · × JM and Y is a tensor of sizeI1 × · · · × IN × K1 × · · · × KP , then contracted productmultiplication of the two tensors along the first N modes isgiven by

〈WY〉{1:N ,1:N}(

j1, . . . , jm, k1, . . . , kp)

=I1∑

i1=1

· · ·IN∑

iN=1

W(

i1, . . . , iN , j1, . . . , jM)

×Y(

i1, . . . , iN , k1, . . . , kP)

,

(3)

where the modes to be multiplied are specified in thesubscripts that are contained in the angle brackets.

Elementwise multiplication and division are representedby ⊗ and �, respectively, and outer product multiplicationis denoted by ◦. Further, for simplicity of notation, unlessotherwise stated, we use the convention that : k denotes thetensor slice associated with the kth source, with the singletondimension included in the size of the slice.

1.2. Tensor Factorisation

Recently, the above matrix factorisation techniques havebeen extended to tensor factorisation models to deal with

stereo or multichannel signals by FitzGerald et al. [24] andParry and Essa [25]. The signal model can be expressed as

X ≈ X =B∑

b=1

G:b ◦A:b ◦ S:b, (4)

where X is an r × n×m tensor containing the spectrogramsof the r channels, G is an r × B matrix containing the gainsof the B basis functions in each channel, A is a matrix of sizen×B containing a set of frequency basis functions, and S is amatrix of sizem×B containing the amplitude basis functions.In this case, : b is used to denote the bth column of a givenmatrix.

As a first approximation, many commercial stereorecordings can be considered to have been created byobtaining single-channel recordings of each instrument indi-vidually and then summing and distributing these recordingsacross the two channels, with the result that for any giveninstrument, the only difference between the two channels liesin the gain of the instrument [26]. The tensor factorisationmodel provides a good approximation to this case. Theextension to tensor factorisation also provides another sourceof information which can be leveraged to cluster the basisfunctions, namely that basis functions belonging to the samesource should have similar gains. However, as the number ofbasis functions increases it becomes more difficult to obtaingood clustering using this information, as basis functionsbecome shared between sources.

2. Shift-Invariant Factorisation Algorithms

The concept of incorporating shift invariance in factorisationalgorithms for sound source separation was introducedin the convolutive factorisation algorithms proposed bySmaragdis [27] and Virtanen [28]. This was done inorder to address a particular shortcoming of the standardfactorisation techniques, namely that a single frequency basisfunction is unable to successfully capture sounds wherethe frequency content evolves with time, such as spokenutterances and drum sounds. To overcome this limitation,the amplitude basis functions were allowed to shift in time,with each shift capturing a different frequency basis function.When these frequency basis functions were combined, theresult was a spectrogram of a given source that captured thetemporal evolution of the frequency characteristics of thesound source.

Shift invariance in the frequency basis functions was laterdeveloped as a means of overcoming the problem of group-ing the frequency basis functions to sources, particularly inthe case where different notes played by the same instrumentoccurred over the course of a spectrogram [14, 29]. Thisshortcoming had been addressed by Vincent and Rodet usinga nonlinear ISA approach [30], but this technique requiredpretraining of source priors before separation.

When incorporating shift invariance in the frequencybasis functions, it is assumed that all notes played by asingle pitched instrument consist of translated versions of asingle frequency basis function. This single instrument basisfunction is then assumed to represent the typical frequency


characteristics of that instrument. This is a simplificationof the real situation, where in practice, the timbre of agiven instrument does change with pitch [31]. Despitethis, the assumption does represent a valid approximationover a limited pitch range, and this assumption has beenused in many commercial music samplers and synthesisers,where a prerecorded note of a given pitch is used togenerate other notes close in pitch to the original note. Theprincipal advantage of using shift invariance in the frequencybasis functions is that instead of having basis functionswhich must be grouped to their respective sources beforeseparation can occur, as in standard NMF, the frequency shiftinvariant model allows individual instruments or sources tobe modelled explicitly with each source having an individualslice of the tensors to be estimated.

Up till now, the incorporation of shift invariance in thefrequency basis functions required the use of a spectrogramwith log-frequency resolution, such as the constant Q trans-form (CQT) [32]. Alternatively, a log-frequency transformcan be approximated by weighted summation of linear-frequency spectrogram bins, such as obtained from a short-time Fourier transform. This can be expressed as

X = RY, (5)

where Y is a linear-frequency spectrogram with f frequencybins and t time frames. R is a frequency weighting matrixof size c f × f which maps the f linear-frequency binsto c f log-frequency bins, with c f < f and X is a log-frequency spectrogram of size c f × t. It can be seen that Ris a rectangular matrix and so no true inverse exists, makingany mapping back from log-frequency resolution to linearfrequency resolution only an approximate mapping.

If the frequency resolution of the log-frequency trans-form is set so that the center frequencies of the bands aregiven by fx = f0βx−1, where fx denotes the center frequencyof the xth band, β= 21/12, and f0 is a reference frequency,then the spacing of the bands will match that of the equal-tempered scale used in western music. A shift up or downby one bin will then correspond to a pitch change of onesemitone.

In the context of this paper, translation of basis functionsis carried out by means of translation tensors, though otherformulations, such as the shift operator method proposedby Smaragdis [27] can be used. To shift an n × 1 vector, ann×n translation matrix is required. This can be generated bypermuting the columns of the identity matrix. For example,in the case of shifting a basis function up by one, thetranslation matrix can be obtained from I(:, [n, 1 : n − 1]),where the identity matrix is denoted by I and the orderingof the columns is contained in the square brackets where[n, 1 : n − 1] indicates that n is the first element in thepermutation, followed by entries of 1 : n−1. For Z allowabletranslations, these translation matrices are then grouped intoa translation tensor of size n× Z × n.

Research has also been done on allowing more generalforms of invariance, such as that of Eggert et al. ontransformation invariant NMF [33], where all forms oftransformation such as translation and rotation are dealt

with by means of a transformation matrix. However, theirmodel has only been demonstrated on translation or shiftinvariance. Further, while a transformation matrix couldpotentially be used to allow the use of linear frequencyresolution through the use of a matrix that stretches thespectrum, it has been noted elsewhere that this stretchingis difficult to perform using a discrete linear frequencyrepresentation [13].

2.1. Shifted 2D NonnegativeTensor Factorisation

All of the algorithms incorporating shift invariance canbe seen as special cases of a more general model, shifted2D nonnegative tensor factorisation (SNTF), proposed byFitzGerald [34], and separately by [35]. The SNTF model canthen be described as

X ≈K∑

k=1

⟨

G:k

⟨

⟨

T A:k⟩

{3,1}⟨

S:kP⟩

{3,1}⟩

{2:4,1:3}

�

{2,2}, (6)

where X is a tensor of size r × n × m, containing themagnitude spectrograms of each channel of the signal. G isa tensor of size r × K , containing the gains of each of the Ksources in each of the r channels. T is an n×z×n translationtensor, which translates the instrument basis functions inA up or down in frequency, where z is the number oftranslations in frequency, thereby approximating differentnotes played by a given source. A is a tensor of size n×K× p,where p is the number of translations across time. S is atensor of size z × K × m containing the activations of thetranslations of A which indicate when a given note played bya given instrument occurs, thereby generating a transcriptionof the signal. P is an m × p × m translation tensor whichtranslates the time activation functions contained in S acrosstime, thereby allowing time-varying source or instrumentspectra. These tensors, their dimensions, and functions aresummarised in Table 1 for ease of reference, as are all tensorsused in subsequent models. If the number of channels isset to r = 1, and the allowable frequency translations z arealso set to one, then the model collapses to that proposedby Virtanen in [28]. Similarly, setting p = 1 results in themodel proposed in [36], while setting both r and p to oneresults in the model described in (4). In [34], the generalisedKullback-Leibler divergence is used as a cost function, andmultiplicative update equations derived for G, A, and S.

When using SNTF, a given pitched instrument is mod-elled by an instrument spectrogram which is translated upand down in frequency to give different notes played by theinstrument. The gain parameters are then used to positionthe instrument in the correct position in the stereo field.A spectrogram of the kth separated source can then beestimated from (6) using only the tensor slices associatedwith the kth source. This spectrogram can then be inverted toa time-domain waveform by reusing the phase informationof the original mixture signal, or by generating a set ofphase information using the technique proposed by Slaney[37]. Alternatively, the recovered spectrogram can be used


to generate a Wiener-type filter which can be applied to theoriginal complex short-time Fourier transform.

As noted previously, the mapping from log-frequency tolinear-frequency domains is an approximate mapping andthis can have an adverse effect on the sound quality of theresynthesis. Various methods for performing this mappingand obtaining an inverse CQT have been investigated [38,39]. However, a simpler method of overcoming this problemis to incorporate the mapping into the model. This canbe done by replacing T in (6) with 〈RT 〉{2,1}, where Ris an approximate map from log to linear domains. Thismapping can simply be the transpose of R, the mappingused in (5). Shift invariance is still implemented in the log-frequency domain, but the cost function is now measured inthe linear-frequency domain. This is similar to the methodproposed by O’Grady when using noise-to-mask ratio asa cost function [21]. O’Grady included the mapping fromlinear to Bark domain in his algorithm, as the cost functionneeded to be measured in the Bark scale domain. It wasnoted that this resulted in energy spreading in the magnitudespectrogram domain. In the modified SNTF algorithm, theopposite case applies, we wish to measure the cost functionin the linear magnitude spectrogram domain, as opposedto a log-frequency domain, and the incorporation of themapping results in less energy spreading in the frequencybasis functions in the constant Q domain. It also has theadvantage of performing the optimisation in the domainfrom which the final inversion to the time domain will takeplace. Despite this, the use of an approximate mapping stillhas adverse effects on the resynthesis quality.

In order to overcome these resynthesis problems,Schmidt et al. proposed using the spectrograms recoveredto create masks which are then used to refilter the originalspectrogram [40]. Schmidt et al. used a binary maskingapproach where bins were allocated to the source whichhad the highest power at that bin. In this paper, we use arefiltering method where the recovered source spectrogramis multiplied by the original mixture spectrogram as itwas found that this gave better results than the previouslydescribed method.

3. Sinusoidal Shifted 2D NonnegativeTensor Factorisation

While SNTF has been shown to be capable of separatingmixtures of harmonic pitched instruments [34], a potentialproblem with the method is that there is no guarantee thatthe basis functions will be harmonic. A form of harmonicconstraint, whereby the basis functions are only allowedto have nonzero values at regions which correspond to aperfectly harmonic sound, has been proposed by Virtanen[13] and later by Raczynski et al. [11], who used it forthe purposes of multipitch estimation. However, with thistechnique, there is no guarantee that values returned inthe harmonic regions of the basis functions will correspondto the actual shape that a sinusoid would have if present.It has also been noted by Raczynski that the structurereturned when using this constraint may not always be purely

Table 1: Summary of the tensors used, their dimensions, and func-tion, in the various shift-invariant factorisation models included inthis paper. Tensors that occur in multiple models are not repeated.

SNTF

X r × n×m Signal spectrogramsX r × n×m Approximation of X

G r × K Instrument gains

T n× z × n Translation tensor (freq.)

A n× K × p Instrument basis functions

S z × K ×m Note activations

P m× p ×m Translation tensor (time)

SSNTFH n× z × h Harmonic dictionary

W h× K × p Harmonic weights

SF-SSNTF F n× K × n Formant filters

SF-SSNTF + N

M r × L Noise instrument gains

B n× L× q Noise basis functions

C L×m Noise activations

Q m× q ×m Noise translation tensor

harmonic as it is possible for the peaks to occur at points thatare not at the centre of the harmonic regions.

An alternative approach to the problem of imposingharmonicity constraints on the basis functions is to notethat the magnitude spectrum of a windowed sinusoidcan be calculated directly in closed-form as a shifted andscaled version of the window’s frequency response [41]. Forexample, using a Hann window, the magnitude spectrum ofa sinusoid of frequency f0 = h2π/ fs, where h is frequency inHz, fs is the sampling frequency in Hz, and N is the desiredFFT, is given by

X(x) = ∣∣0.5D(g) + 0.25{

D1(g) +D2(g)}∣

∣, (7)

where g = fx − f0, with fx = x2π/N being the centrefrequency of the xth FFT bin and where D is defined as

D(g) = sin(gN/2)sin(g/2)

, (8)

with D1(g) = D(g − 2π/N) and D2(g) = D(g + 2π/N). Itis then proposed to use an additive synthesis type model,where each note is modelled as a sum of sinusoids at integermultiples of the fundamental frequency of the note, with therelative strengths of the sinusoids giving the timbre of thenote played. This spectral domain approach has been usedpreviously to perform additive synthesis, in particular theinverse FFT method of Freed et al. [42].

For a given pitch and a given number of harmonics, themagnitude spectra of the individual sinusoids can be storedin a matrix of size n × h, where n is the number of bins inthe spectrum, and h is the number of harmonics. This can berepeated for each of the allowed z notes, resulting in a tensorof size n × z × h. In effect, this tensor is a signal dictionaryconsisting of the magnitude spectra of individual sinusoidsrelated to the partials of each allowable note. Again taking a


Hann window as an example, the tensor can then be definedas

H(x, i, j) = ∣∣0.5D(

gxi j)

+ 0.25{

D1(

gxi j)

+D2(

gxi j)}∣

∣,(9)

where gxi j = fx − fi, j with fi, j = h0βi−1 j2π/ fs, h0 isthe frequency in hertz of the lowest allowable note and βis as previously defined in Section 2. This assumes equal-tempered tuning, but other tuning systems can also be used.

It is also possible to take into account inharmonicityin the positioning of the partials through the use of inhar-monicity factors. For example, in the case of instrumentscontaining stretched strings, fi, j can be calculated as

fi, j = h0βi−1 j2π

√

1 +(

j2 − 1)

α

fs, (10)

where α is the inharmonicity factor for the instrument inquestion [43]. In practice, the magnitude spectra will be closeto zero except in the regions around fi, j , and so it is usuallysufficient to calculate the values of T (x, i, j) for ten bins oneither side of fi, j and to leave the remaining bins at zero.Further, the frequencies of the lowest partial of the lowestnote, and the highest partial of the highest note place limitson the region of the spectrogram which will be modelled, andso spectrogram frequency bins outside of these ranges can bediscarded. If a small number of harmonics are required, thiscan considerably reduce the number of calculations required,thereby speeding up the algorithm.

H contains sets of harmonic partials all of equal gain.In order to approximate the timbres of different musicalinstruments, these partials must be weighted in differentproportions. These weights can be stored in a tensor of sizeh × K × p, where K is the number of instruments and pis the number of translations across time, thereby allowingthe harmonic weights to vary with time. Labeling the weightstensor as W , the model can be described as

X =K∑

k=1

⟨

G:k

⟨

⟨

HW:k⟩

{3,1}⟨

S:kP⟩

{3,1}⟩

{2:4,1:3}

�

{2,2}.

(11)

Using the generalised Kullback-Leibler divergence as a costfunction, multiplicative update equations can be derived as

G:k=G:k⊗⟨⟨⟨〈DH〉{2,1}W:k

⟩

{4,1}S:k⟩

{3:4,1:2}P⟩

{2:4,3:1}⟨⟨⟨〈OH〉{2,1}W:k

⟩

{4,1}S:k⟩

{3:4,1:2}P⟩

{2:4,3:1},

W:k=W:k⊗⟨⟨(

G:k ◦H)

D⟩

{[1,3],1:2}⟨

S:kP⟩

{3,1}⟩

{[1,2,4],[1,2,4]}⟨⟨(

G:k ◦H)

O⟩

{[1,3],1:2}⟨

S:kP⟩

{3,1}⟩

{[1,2,4],[1,2,4]},

S:k=S:k⊗⟨⟨⟨(

G:k ◦H)

A:k⟩

{[2,5],[2,1]}D⟩

{1:2,1:2}P⟩

{2:3,[2,1]}⟨⟨⟨(

G:k ◦H)

A:k⟩

{[2,5],[2,1]}O⟩

{1:2,1:2}P⟩

{2:3,[2,1]},

(12)

where D =X�X and O is an all-ones tensor with the samedimensions as X, and all divisions are taken as elementwise.

These update equations are similar to those of SNTF, justreplacing T and A, with a sinusoidal signal dictionary H ,and a set of harmonic weights W , respectively. It is proposedto call this new algorithm sinusoidal shifted 2D nonnegativetensor factorisation (SSNTF) as it explicitly models thesignal as the summation of weighted harmonically relatedsinusoids, in effect incorporating an additive synthesis modelinto the tensor factorisation framework. SSNTF can still beconsidered as shift invariant in frequency, as the harmonicweights are invariant to where in the frequency spectrum thenotes occur.

An advantage of SSNTF is that the separation problem isnow completely formulated in the linear-frequency domain,thereby eliminating the need to use an approximate mappingfrom log to linear frequency domains at any point inthe algorithm, which removes the potential for resynthesisartifacts due to the mapping. Resynthesis of the separatedtime-domain waveforms can be carried out in a similarmanner to that of SNTF, or alternatively, one can takeadvantage of the use of the additive synthesis model toreconstruct the separated signal using additive synthesis.

The SSNTF algorithm was implemented in Matlabusing the Tensor Toolbox available from [44], as were allsubsequent algorithms described in this paper. The costfunction was always observed to decrease with each iteration.However, when running SSNTF, it was found that the bestresults were obtained when the algorithm was given anestimate of what frequency region each source was present in.This was typically done by giving an estimate of the pitch ofthe lowest note of each source. For score-assisted separation,such as that proposed by [45], this information will be readilyavailable. The incorporation of this information has theadded benefit of fixing the ordering of the sources in mostcases. In cases where there is no score available, estimatescan be obtained by running SNTF first and determining thepitch information from the recovered basis functions beforerunning SSNTF. At present, research is being undertaken ondevising alternate ways of overcoming this problem.

As an example of the improved reconstruction thatSSNTF can provide, Figure 1 shows the frequency spectrumof a flute note separated from a single channel mixture offlute and piano. SNTF and SSNTF were performed on thisexample using 9 translations in frequency and 5 translationsin time. All other parameters were set as described laterin Section 6. The first spectrum is that of the flute notetaken from the original unmixed flute waveform, the secondspectrum is that of the recovered flute note using SNTF, withthe mapping from log to linear domains included in themodel, while the third spectrum is that returned by SSNTF.It can be appreciated that the spectrum returned by SSNTFis considerably closer to the original than that returned bySNTF. This demonstrates the utility of using an approachwhich is formulated in the linear frequency domain.

Figure 2 shows the original mixture spectrogram ofpiano and flute, while Figure 3(a) shows the unmixed flutespectrogram, with Figures 3(b), 3(c), and 3(d) showingthe SNTF-separated flute spectrogram, the SNTF-separatedflute spectrogram using refiltering, and the SSNTF-separatedspectrogram, respectively. Figure 4(a) shows the unmixed


800 1000 1200 1400 1600 1800 2000

Frequency (Hz)

50

100

Am

plit

ude

(a)

800 1000 1200 1400 1600 1800 2000

Frequency (Hz)

50

100

Am

plit

ude

(b)

800 1000 1200 1400 1600 1800 2000

Frequency (Hz)

50

100

Am

plit

ude

(c)

Figure 1: Spectra of flute note, original, SNTF, and SSNTF, respec-tively.

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

10002000300040005000

Freq

uen

cy(H

z)

Figure 2: Spectrogram of piano and flute mixture.

piano spectrogram, with Figures 4(b), 4(c), and 4(d) show-ing the SNTF-separated piano spectrogram, the SNTF-separated piano spectrogram obtained using refiltering, andthe SSNTF-separated spectrogram, respectively. It can beseen that the spectrograms recovered using SSNTF areconsiderably closer to the original spectrograms than thatrecovered directly from SNTF, where the smearing due to theapproximate mapping from log to linear domains is clearlyevident. Considerably improved recovery of the sources wasalso noted on playback of the separated SSNTF signals incomparison to those obtained using SNTF directly. Thespectrograms obtained using SNTF in conjunction withrefiltering can be also seen to be considerably closer tothe original spectrograms than any of the other methods.However, on listening, the sound quality is still less thanthat obtained using SSNTF. Further, as will be seen later,the SNTF-based methods are not as robust as SSNTF-basedmethods.

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

10002000300040005000

Freq

uen

cy(H

z)

(a)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

10002000300040005000

Freq

uen

cy(H

z)

(b)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

10002000300040005000

Freq

uen

cy(H

z)(c)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

10002000300040005000

Freq

uen

cy(H

z)

(d)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

10002000300040005000

Freq

uen

cy(H

z)

(e)

Figure 3: Spectrogram of flute signal, (a) original unmixed, (b)SNTF, (c) refiltered SNTF, (d) SSNTF, (e) source-filter SSNTF.

It should also be noted that the addition of harmonicconstraints imposes restrictions on the solutions that can bereturned by the factorisation algorithms. This is of consider-able benefit when incorporating additional parameters intothe models, as will be seen in the following sections.

4. Source-Filter Modelling

As noted previously in Section 2, the use of a single shiftedinstrument basis function to model different notes played byan instrument is a simplification. In practice, the timbre ofnotes played by a given instrument changes with pitch, andthis restricts the usefulness of shifted factorisation models.Recently, Virtanen and Klapuri proposed the incorporationof a source-filter model approach in the factorisation methodas a means of overcoming this problem [46]. In the source-filter framework for sound production, the source is typicallya vibrating object, such as a violin string, and the filter


0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

10002000300040005000

Freq

uen

cy(H

z)

(a)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

10002000300040005000

Freq

uen

cy(H

z)

(b)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

10002000300040005000

Freq

uen

cy(H

z)

(c)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

10002000300040005000

Freq

uen

cy(H

z)

(d)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

10002000300040005000

Freq

uen

cy(H

z)

(e)

Figure 4: Spectrogram of piano signal, (a) original unmixed, (b)SNTF, (c) refiltered SNTF, (d) SSNTF, (e) source-filter SSNTF.

600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600

Frequency (Hz)

50100150200250300

Am

plit

ude

Figure 5: Filter returned for flute when using source-filter SSNTF.

accounts for the resonant structure of the instrument,such as the violin body, which alters and filters the soundproduced by the vibrating object. This approach had beenused previously in both sound synthesis and speech coding[47, 48], but not in a factorisation framework.

When applied in the context of shifted instrumentbasis functions, the instrument basis function represents aharmonic excitation pattern which can be shifted up anddown in frequency to generate different pitches. A singlefixed filter is then applied to these translated excitation pat-terns, with the filter representing the instrument’s resonantstructure. This results in a system where the instrumenttimbre varies with pitch, resulting in a more realistic model.The instrument formant filters can be incorporated into theshifted tensor factorisation framework through a formantfilter tensor F of size n × K × n. In this case, the kth sliceof F is a diagonal matrix, with the instrument formant filtercoefficients contained on the diagonal.

Unfortunately, attempts to incorporate the source-filtermodel into the SNTF framework were unsuccessful. Theresultant algorithm had too many parameters to optimiseand it was difficult to obtain good separation results.However, the additional constraints imposed by SSNTF werefound to make the problem tractable. The resultant modelcan then be described as

X ≈ X =K∑

k=1

⟨

G:k

⟨

⟨

R:kW:k⟩

{[2,4],[2,1]}V:k

⟩

{2:4,[2,1,3]}

�

{2,2},

(13)

where R:k = 〈F:kH〉{3,1} and V:k = 〈S:kP 〉{3,1}.Again using the generalised Kullback-Lieber divergence

as a cost function, the following update equations werederived:

G:k=G:k⊗⟨⟨

D⟨

R:kW:k⟩

{[2,4],[2,1]}⟩

{2,1}V:k⟩

{2:5,[4,2,1,3]}⟨⟨

O⟨

R:kW:k⟩

{[2,4],[2,1]}⟩

{2,1}V:k⟩

{2:5,[4,2,1,3]},

F:k=F:k⊗⟨⟨

G:kD⟩

{1,1}⟨⟨

HW:k⟩

{3,1}V:k⟩

{2:4,1:3}⟩

{[1,3],2:3}⟨⟨

G:kO⟩

{1,1}⟨⟨

HW:k⟩

{3,1}V:k⟩

{2:4,1:3}⟩

{[1,3],2:3},

W:k=W:k⊗⟨⟨⟨

G:kR:k⟩

{2,2}D⟩

{[1,3],1:2}V:k⟩

{[1,2,4],[2,1,4]}⟨⟨⟨

G:kR:k⟩

{2,2}O⟩

{[1,3],1:2}V:k⟩

{[1,2,4],[2,1,4]},

S:k=S:k⊗⟨⟨

G:k⟨

R:kW:k⟩

{[2,4],[2,1]}⟩

{2,2}〈DP 〉{3,1}⟩

{[1,3,5],1:3}⟨⟨

G:k⟨

R:kW:k⟩

{[2,4],[2,1]}⟩

{2,2}〈OP 〉{3,1}⟩

{[1,3,5],1:3}.

(14)

Figure 5 shows the filter recovered for the flute from theexample previously discussed in Section 3. It can be seen thatthe recovered filter consists of a series of peaks as opposed toa smooth formant-like filter. This is due to a combination oftwo factors, firstly, the small number of different notes playedin the original signal, and secondly, the harmonic constraintsimposed by SSNTF. This results in a situation where largeportions of the spectrum will have little or no energy, andaccordingly the filter models these regions as having little orno energy.

On listening to the resynthesis, there was a markedimprovement in the sound quality of the flute in comparisonwith SSNTF, with less high-frequency energy present. Theresynthesis of the piano also improved, though less so than


0 1 2 3 4 5 6 7

Time (s)

1000200030004000

Freq

uen

cy(H

z)

(a)

0 1 2 3 4 5 6 7

Time (s)

1000200030004000

Freq

uen

cy(H

z)

(b)

0 1 2 3 4 5 6 7

Time (s)

1000200030004000

Freq

uen

cy(H

z)

(c)

Figure 6: Spectrograms for (a) original flute spectrogram, (b) spec-trogram recovered using source-filter SSNTF, and (c) spectrogramrecovered using SSNTF.

300 400 500 600 700 800 900 1000

Frequency (Hz)

0

1

2

3

4

5

Am

plit

ude

Figure 7: Filter returned for solo flute example in Figure 6 whenusing source-filter SSNTF.

that of the flute. Figures 3(e) and 4(e) show the spectrogramsrecovered using source-filter SSNTF for the flute and piano,respectively. It can be observed that the flute spectrogramis closer to the original than either SNTF or SSNTF, withno smearing and a reduced presence of higher harmonicsin comparison to SSNTF, which is in line with what wasobserved on listening to the resynthesis. In comparisonto the SNTF and refiltering approach, source-filter SSNTFhas retained more high-frequency information than therefiltered approach, and can be seen to be closer to theoriginal spectrogram. In the case of the piano, the refilteredspectrogram contains more high-frequency informationthan the source-filter SSNTF approach, which is closer tothe original piano spectrogram. On listening, the source-filter SSNTF approach also outperforms the refiltered SNTFapproach.

As a further example of source-filter SSNTF, Figure 6(a)shows the spectrogram of a flute signal consisting of 16 notes,

one semitone apart played in ascending order, while Figures6(b) and 6(c) show the spectrogram recovered using source-filter SSNTF and SSNTF, respectively. It can be seen that thesource-filter method has returned a spectrogram closer to theoriginal, with less high-frequency information than SSNTF.Figure 7 shows the source-filter associated with Figure 6(b).It can be seen that in this case, where 16 successive notes areplayed, the source-filter is smoother, as would be expected fora formant-like filter, but as the harmonics get further apart,evidence of peakiness similar to that in Figure 5 becomesmore evident.

The above examples demonstrate the utility of using thesource-filter approach as a means of improving the accuracyof the SSNTF model. This is bourn out in the improvedresynthesis of the separated sources.

5. Separation of Pitched andNonpitched Instruments

Musical signals, especially popular music, typically containunpitched instruments such as drum sounds in additionto pitched instruments. While allowing shift invariance inboth frequency and time is suitable for separating mixturesof pitched instruments, it is not suitable for dealing withpercussion instruments such as the snare and kick drums,or other forms of noise in general. These percussioninstruments can be successfully captured by algorithmswhich allow shift invariance in time only without the useof frequency shift invariance. In order to deal with musicalsignals containing both pitched and percussive instrumentsor contain additional noise, it is necessary to have analgorithm which handles both these cases. This can bedone by simply adding the two models together. Thishas previously been done by Virtanen in the context ofmatrix factorisation algorithms [13], who also noted thatthe resulting model was too complex to obtain good resultswithout the addition of additional constraints. In particular,the use of a harmonicity constraint was required, though inthis case it was based on zeroing instrument basis functionsin areas where no harmonic activity was expected, as opposedto the additive synthesis-based technique proposed in thispaper.

Extending the concept to the case of tensor factorisationtechniques results in a generalised tensor factorisation modelfor the separation of pitched and percussive instruments,which still allows the use of a source-filter model for pitchedinstruments. The model can be described by

X ≈ X =K∑

k=1

⟨

G:k⟨⟨

R:kW:k⟩

{[2,4],[2,1]}V:k⟩

{2:4,[2,1,3]}

⟩

{2,2}

+L∑

l=1

⟨

M:l⟨

B:k⟨

C:lQ⟩

{2,1}⟩

{2:3,1:2}

⟩

{2,2},

(15)

where M is a tensor of size r × L, which contains thegains of each of the L percussive sources, B is a tensorof size n × L × q, where q is the number of allowable


time shifts for the percussive sources, C is a tensor of sizeL × m, and Q is a translation tensor of size m × q × m.Multiplicative update equations, based on the generalisedKullback-Leibler divergence can then be derived for theseadditional parameters, while update equations for all otherparameters are as given in Section 4. The additional updateequations are given by

M:l =M:l ⊗⟨

D⟨

B〈CQ〉{2,1}⟩

{2:3,1:2}⟩

{2:3,[1,3]}⟨

O⟨

B〈CQ〉{2,1}⟩

{2:3,1:2}⟩

{2:3,[1,3]},

B:l = B:l ⊗⟨⟨

M:lD⟩

{1,1}〈CQ〉{2,1}⟩

{[1,3],[1,3]}⟨⟨

M:lO⟩

{1,1}〈CQ〉{2,1}⟩

{[1,3],[1,3]},

C:l = C:l ⊗⟨⟨

M:lB⟩

{2,2}〈DQ〉{3,3}⟩

{[1,3,4],[1,2,4]}⟨⟨

M:lB⟩

{2,2}〈OQ〉{3,3}⟩

{[1,3,4],[1,2,4]}.

(16)

The individual sources can be separated as before, butthe algorithm can also be used to separate the pitchedinstruments from the unpitched percussive instruments orvice-versa by resynthesising the relevant section of the model.It can also be used as a means of eliminating noise frommixtures of pitched instruments by acting as a type of“garbage collector,” which can improve resynthesis qualityin some cases. It can also be viewed as being analogousto the additive plus residual sinusoidal analysis techniquesdescribed by Serra [49] in that it allows the pitched orsinusoidal part of the signal to be resynthesised separatelyfrom the noise part of the signal.

As an example of the use of the combined model,Figure 8 shows the mixture spectrograms obtained from astereo mixture containing three pitched instruments, piano,flute, and trumpet, and three percussion instruments, snare,hi-hats, and kick drum, while Figure 9 shows the originalunmixed spectrograms for those sources, respectively. Thepiano, snare, and kick drum were all panned to the center,with the hi-hats and flute panned midleft and the trum-pet midright. Figure 10 shows the separated spectrogramsobtained using the combined model. It can be seen thatthe sources have been recovered well, with each individualinstrument identifiable, though traces of other sources canbe seen in the spectrograms. This is most evident wheretraces of the hi-hats are visible in the snare spectrogram, butthe snare clearly predominates. On listening to the results,traces of the flute can also be heard in the piano signal,and the timbres of the instruments have been altered, butare still recognisable as being the instrument in question.The example also highlights another advantage of tensorfactorisation models in general, namely the ability to separateinstruments which have the same position in the stereo field.This is in contrast to algorithms such as Adress and DUET,which can only separate sources if they occupy differentpositions in the stereo field [26, 50].

6. Performance Evaluation

The performances of SNTF, SNTF using refiltering, SSNTF,source-filter SSNTF, and source-filter SSNTF with noise basis

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

0

1000

2000

3000

Freq

uen

cy(H

z)

(a)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

0

1000

2000

3000

Freq

uen

cy(H

z)

(b)

Figure 8: Mixture spectrograms of piano, flute, trumpet, snare, hi-hats, and kick drum.

functions in the context of modelling mixtures of pitchedinstruments were compared using a set of 40 test mixtures.In the case of source-filter SSNTF with noise basis functions,two noise basis functions were learned in order to aid theelimination of noise and artifacts from the harmonic sources.The 40 test signals were of 4 seconds duration and containedmixtures of melodies played by different instruments andcreated by using a large library of orchestral samples [51].Samples from a total of 15 different orchestral instrumentswere used. A wide range of pitches were covered, from87 Hz to 1.5 kHz, and the melodies played by the individualinstruments in each test signal were in harmony. This wasdone to ensure that the test signals contained extensiveoverlapping of harmonics, as this occurs in most real worldmusical signals. In many cases, the notes played by oneinstrument overlapped notes played by another instrumentto test if the algorithms were capable of discriminating notesof the same pitch played by different instruments.

The 40 test signals consisted of 20 single channel mixturesof 2 instruments and 20 stereo mixtures of 3 instruments,and these mixtures were created by linear mixing of individ-ual single channel instrument signals. In the case of the singlechannel mixtures, the source signals were mixed with unitygain, and in the case of the stereo mixtures, mixing was doneaccording to

(

x1(t)

x2(t)

)

=(

0.75 0.5 0.25

0.25 0.5 0.75

)

⎛

⎜

⎜

⎝

s1(t)

s2(t)

s3(t)

⎞

⎟

⎟

⎠

, (17)

where x1(t) and x2(t) are the left and right channels of thestereo mixture and s1(t) represents the first single channelinstrument signal and so on.

Spectrograms were obtained for the mixtures, using ashort-time Fourier transform with a Hann window of 4096samples, with a hopsize of 1024 samples between frames.All variables were initialised randomly, with the exceptionof the frequency basis functions for SNTF-based separation,which were initialised with harmonic basis functions at the


0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

0

1000

2000

3000

Freq

uen

cy(H

z)

(a)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

0

1000

2000

3000

Freq

uen

cy(H

z)

(b)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

0

1000

2000

3000

Freq

uen

cy(H

z)

(c)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

0

1000

2000

3000

Freq

uen

cy(H

z)

(d)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

0

5000

10000

Freq

uen

cy(H

z)

(e)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

0

1000

2000

Freq

uen

cy(H

z)

(f)

Figure 9: Original spectrograms of (a) piano, (b) flute, (c) trumpet,(f) snare, (g) hi-hats, and (h) kick drum.

frequency of the lowest note played by each instrument ineach example. This was done to put SNTF on an equalfooting with the SSNTF-based algorithms, where the pitch ofthe lowest note of each source was provided. The number ofallowable notes was set to the largest pitch range covered byan instrument in the test signal and the number of harmonicsused in SSNTF was set to 12. The algorithms were run for300 iterations, and the separated source spectrograms wereestimated by carrying out contracted tensor multiplicationon the tensor slices associated with an individual source.The recovered source spectrograms were resynthesised usingthe phase information from the mixture spectrograms. The

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

0100020003000

Freq

uen

cy(H

z)

(a)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

0100020003000

Freq

uen

cy(H

z)

(b)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

0100020003000

Freq

uen

cy(H

z)

(c)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

0100020003000

Freq

uen

cy(H

z)

(d)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

0

1000

2000

Freq

uen

cy(H

z)

(e)

0 0.5 1 1.5 2 2.5 3 3.5

Time (s)

0

1000

2000

Freq

uen

cy(H

z)

(f)

Figure 10: Separated spectrograms of (a) piano, (b) flute, (c)trumpet, (f) snare, (g) hi-hats and (h) kick drum.

phase of the channel where the source was strongest was usedin the case of the stereo mixtures.

Using the original source signals as a reference, theperformance of the different algorithms were evaluated usingcommonly used metrics, namely the signal-to-distortionratio (SDR), which provides an overall measure of the soundquality of the source separation, the signal-to-interferenceratio (SIR), which measures the presence of other sources inthe separated sounds, and the signal-to-artifacts ratio (SAR),which measures the artifacts present in the recovered signaldue to separation and resynthesis. Details of these metricscan be found in [52] and a Matlab toolbox to calculate


1 1.5 2 2.5 3 3.5 4

Signal duration (s)

−202468

SDR

(dB

)

(a)

1 1.5 2 2.5 3 3.5 4

Signal duration (s)

10

15

20

25

SIR

(dB

)

(b)

1 1.5 2 2.5 3 3.5 4

Signal duration (s)

02468

SAR

(dB

)

(c)

Figure 11: Performance evaluation of SNTF (circle solid), refilteredSNTF (diamond solid), SSNTF (square dash-dotted), source-filterSSNTF (triangle solid), and source-filter SSNTF (star dashed) withnoise basis functions for various signal durations.

these measures is available from [53]. As noted previouslyin Section 3, the provision of the lowest pitch note foreach source was sufficient to determine the correct sourceordering for all the SSNTF-based algorithms. In the caseof the SNTF-based algorithms, the ordering of the sourceswas determined by associating a separated source with theoriginal source which resulted in the best SIR score. Thismatching procedure was then checked manually to ensure noerrors had occurred.

A number of different tests were run to determinethe effect of signal duration on the performance of thealgorithms and to determine the effect of using differentnumbers of allowable shifts in time. For the tests on signalduration, the mixture signals were truncated to lengths of1, 2, 3, and 4 seconds in length, the number of time shiftswas set to 5, and the performance of the algorithms wasevaluated. A summary of the results obtained are shownin Figure 11. The results were obtained by averaging themetrics obtained for each separated source to give an overallscore for each test mixture. The results for each mixturewere then averaged to yield the data shown in the figure.It can be seen that the SSNTF-based algorithms all clearlyoutperform SNTF-based methods in all cases, though the useof refiltering does improve the performance of SNTF. It canalso be seen that signal duration does not have much effect onthe results obtained from SSNTF, with the results remainingrelatively constant with signal duration, showing that SSNTF

1 2 3 4 5 6 7 8 9 10

Number of allowable shifts

−202468

SDR

(dB

)

(a)

1 2 3 4 5 6 7 8 9 10


10

15

20

25

SIR

(dB

)

(b)

1 2 3 4 5 6 7 8 9 10


02468

SAR

(dB

)

(c)

Figure 12: Performance Evaluation of SNTF (circle solid), refilteredSNTF (diamond solid), SSNTF (square dash-dotted), Source-FilterSSNTF (triangle solid) and Source-Filter SSNTF (star dashed) withnoise basis functions for various allowable shifts in time.

can capture harmonic sources even at relatively short signaldurations.

In the case of the algorithms incorporating source filter-ing, performance improved with increased signal duration.This is particularly evident in the case of the SIR metric.This demonstrates that longer signal durations are requiredto properly capture filters for each instrument. This is tobe expected as increased numbers of notes played by eachinstrument provide more information on which to learnthe filter, while the harmonic model with fewer parametersdoes not require as much information for training. It shouldbe noted that this trend was less evident in the stereomixtures than in the mono mixtures, suggesting that thespatial positioning of sources in the stereo field may effect theability to learn the source filters. This can possibly be testedby measuring the separation of the sources while varying themixing coefficients and is an area for future investigation.Nonetheless, it can be seen that at longer durations thesource-filter approaches outperform SSNTF, with the basicsource-filter model performing better in terms of SDR andSAR, while the source-filter plus noise approach performsbetter in terms of SIR.

The results from testing the effect of the number oftime shifts on the separation of the sources are shown inFigure 12. These were obtained using the same procedureused for the previous tests. The number of allowable shiftsranged from 1 to 10, which corresponds to a maximum


shift in time of approximately 0.2 second. Once again,the SSNTF-based algorithms clearly outperform SNTF-based approaches, regardless of the shift. However, it canbe seen that for both SSNTF and the source-filter plusnoise approach, performance is relatively constant with thenumber of allowable shifts, there is a small improvement inperformance up until 7 shifts and beyond this performancedegrades slightly. In the case of source-filter SSNTF, there is anoticeable improvement when going from one to two shifts,but beyond this there is little or no variation in performancewith increased numbers of shifts. On investigating, thiswas found to be mainly evident in the stereo mixtures,with the performance of the mono mixtures remainingrelatively constant, again highlighting the need to investigatethe performance of the algorithms under different mixingcoefficients. Overall, it can be seen that the performance ofthe algorithms is in line with that observed when varyingsignal duration, with the source-filter plus noise approachperforming best in terms of SIR, while source-filter SSNTFperforms better in terms of SDR and SAR. Further, the resultssuggest that in many cases, a single set of harmonic weightscan be used to characterise pitched instruments without theneed to incorporate timbral change with time.

On listening to the separated sources, the SSNTF-basedapproaches clearly outperform SNTF. It should be notedthat in some cases, SNTF using refiltering resulted inaudio quality comparable to the SSNTF-based approaches,however this was only in a small number of examples. In themajority of cases the addition of the source-filter improveson the results obtained by SSNTF. On comparing the source-filter approach to the source-filter plus noise model, it wasobserved that the results varied from mixture to mixture,with a considerable improvement in resynthesis quality ofsome sources and a reduction of quality in other cases,while in a large number of tests no major differences couldbe heard in the results. This shows that in many cases forclean mixture signals of pitched instruments, there is noneed to incorporate noise basis functions. Nevertheless, theuse of noise basis functions is still useful in the presenceof noise or percussion instruments. It should also be notedthat in half of the test mixtures SNTF did not manage tocorrectly separate the sources, which, in conjunction withthe distortion due to the smearing of the frequency binsdue to the mapping from log to linear frequency, goes along way towards explaining the negative SDR and SIRscores. While SNTF using refiltering resulted in improvedresynthesis in the cases where the sources had been separatedcorrectly, it also suffered from the reliablity issues of theunderlying SNTF technique and this is reflected in the poorscores for all metrics. This indicates that the SSNTF-basedtechniques are considerably more robust than SNTF-basedtechniques.

The separated sources can also be resynthesised via anadditive synthesis approach, and on listening, the resultsobtained were comparable to those obtained from thespectrogram-based resynthesis. However, as the additivesynthesis approach uses different phase information than thespectrogram-based resynthesis, the results are not compara-ble using the metrics used in this paper. This highlights the

need to develop a set of perceptually-based metrics for soundsource separation and is an area for future research.

Also investigated was the goodness of fit of the modelsto the original spectrogram data, as measured by the costfunction. It was observed that the results obtained for SSNTFwere on average 64% smaller than those for SNTF, despitethe fact that SSNTF has a smaller number of free parameters,as the number of harmonics was considerably smallerthan the number of frequency bins used in the constantQ spectrogram for SNTF. This highlights the benefits ofusing an approach solely formulated in the linear frequencydomain. Using source-filter SSNTF, with an additional K ×nparameters over SSNTF, resulted in an average reduction inthe cost function of 76% in comparison to SNTF, and areduction of 33% in comparison to SSNTF.

Overall it can be seen that the methods proposed inthis paper offer a considerable improvement over previousseparation methods using SNTF. Large improvements canbe seen in the performance metrics over the previous SNTFmethod, and it can also be seen that the proposed modelsresult in an improved fit to the original data.

7. Conclusions

The use of shift-invariant tensor factorisations for the pur-poses of musical sound source separation, with a particularemphasis on pitched instruments, has been discussed, andproblems with existing algorithms were highlighted. Theproblem of grouping notes to sources can be overcome byincorporating shift invariance in frequency into the factori-sation framework, but comes at the price of requiring the useof a log-frequency representation. This causes considerableproblems when attempting to resynthesise the separatedsources as there is no exact mapping available to map froma log-frequency representation back to a linear-frequencyrepresentation, which results in considerable degradation inthe sound quality of the separated sources. While refilteringcan overcome this problem to some extent, there are stillproblems with resynthesis.

A further problem with existing techniques was alsohighlighted, in particular the lack of a strict harmonic con-straint on the recovered frequency basis functions. Previousattempts to impose harmonicity used an ad hoc constraintwhere the basis functions were zeroed in regions where noharmonic activity was expected. While this does guaranteethat there will be no activity in these regions, it does notguarantee that the basis functions recovered will have theshape that a sinusoid would have if present in these regions.

Sinusoidal shifted 2D nonnegative tensor factorisationwas then proposed as a means of overcoming both of theseproblems simultaneously. It takes advantage of the fact that aclosed form solution exists for calculating the spectrum of asinusoid of known frequency, and uses an additive-synthesisinspired approach for modeling pitched instruments, whereeach note played by an instrument is modelled as the sum ofa fixed number of weighted sinusoids in harmonic relationto each other. These weights are considered to be invariantto changes in the pitch, and so each note is modelledusing the same weights regardless of pitch. The frequency


spectrum of the individual harmonics is calculated in thelinear frequency domain, eliminating the need to use a log-frequency representation at any point in the algorithm, andharmonicity constraints are imposed explicitly by using asignal dictionary of harmonic sinusoid spectra. Results showthat using this signal model results in a better fit to theoriginal mixture spectrogram than algorithms involving theuse of a log-frequency representation, thereby demonstratingthe benefits of being able to perform the optimisation solelyin the linear-frequency domain.

However, it should be noted that the proposed modelis not without drawbacks. In particular, best results wereobtained if the pitch of the lowest note of each pitchedinstrument was provided to the algorithm. In most cases thisinformation will not be readily available, and this necessitatesthe use of the standard shifted 2D nonnegative tensorfactorisation algorithm to estimate these pitches before usingthe sinusoidal model. Research is currently ongoing on othermethods to overcome this problem, but despite this, it isfelt that the advantages of the new algorithm more thanoutweigh this drawback.

Using the same harmonic weights or instrument basisfunction regardless of pitch is only an approximation tothe real world situation where the timbre of an instrumentdoes change with pitch. To overcome this limitation, theincorporation of a source-filter model into the tensor factori-sation framework had previously been proposed by others.Unfortunately, in the context of sound source separation, itwas found that it was difficult to obtain good results usingthis approach as there were too many parameters to optimise.However, the addition of the strict harmonicity constraintproposed in this paper was found to restrict the range ofsolutions sufficiently to make the problem tractable.

It had previously been observed that the addition ofharmonic constraints was required to create a system whichcould handle both pitched and percussive instrumentationssimultaneously. However, previous attempts at such systemssuffered due to the use of log-frequency representations andthe lack of a strict harmonic constraint. The combined modelpresented here extends this earlier work from single channelto multichannel signals, and overcomes these problems byuse of sinusoidal constraints applied in the linear-frequencydomain, as well as incorporating the source filter model intothe system, and so represents a more general model thanthose previously proposed.

In testing using common source separation performancemetrics, the extended algorithms proposed were foundto considerably outperform existing tensor factorisationalgorithms, with considerably reduced signal distortion andartifacts in the resynthesis. The extended algorithms werealso found to be more reliable than SNTF-based approaches.

In conclusion, it has been demonstrated that use of anadditive-synthesis based approach for modelling instrumentsin a factorisation framework overcomes problems associatedwith previous approaches, as well as allowing extensionsto existing models. Future work will concentrate on theimprovement of the proposed models, both in terms ofincreased generality and in improved resynthesis of theseparated sources, as well as investigating the effects of

the mixing coefficients on the separations obtained. It isalso proposed to investigate the use of frequency domainperformance metrics as a means of increasing the perceptualrelevance of source separation metrics.

Acknowledgments

This research was part of the IMAAS project funded byEnterprise Ireland. The authors wish to thank Mikel Gainza,Matthew Hart, and Dan Barry for their helpful discussionsand comments during the preparation of this paper. Theauthors also wish to thank the reviewers for their helpfulcomments which resulted in a much improved paper.

References

[1] J. P. Stautner, Analysis and synthesis of music using theauditory transform, M.S. thesis, MIT Electrical Engineeringand Computer Science Department, Massachusetts Instituteof Technology, Cambridge, Mass, USA, 1983.

[2] P. Comon, “Independent component analysis, a new concept?”Signal Processing, vol. 36, no. 3, pp. 287–314, 1994.

[3] M. S. Lewicki and T. J. Sejnowski, “Learning overcompleterepresentations,” Neural Computation, vol. 12, no. 2, pp. 337–365, 2000.

[4] B. A. Olshausen and D. J. Field, “Sparse coding of sensoryinputs,” Current Opinion in Neurobiology, vol. 14, no. 4, pp.481–487, 2004.

[5] D. Lee and H. Seung, “Learning the parts of objects bynonnegative matrix factorisation,” Nature, vol. 401, no. 6755,pp. 788–791, 1999.

[6] P. Paatero and U. Tapper, “Positive matrix factorization: anon-negative factor model with optimal utilization of errorestimates of data values,” Environmetrics, vol. 5, no. 2, pp. 111–126, 1994.

[7] M. Casey and A. Westner, “Separation of mixed audio sourcesby independent subspace analysis,” in Proceedings of theInternational Computer Music Conference (ICMC ’00), pp.154–161, Berlin, Germany, August-September 2000.

[8] T. Virtanen, “Sound source separation using sparse codingwith temporal continuity objective,” in Proceedings of theInternational Computer Music Conference (ICMC ’03), pp.231–234, Singapore, September 2003.

[9] P. Smaragdis and J. C. Brown, “Non-negative matrix factoriza-tion for polyphonic music transcription,” in Proceedings of theIEEE Workshop on Applications of Signal Processing to Audioand Acoustics (WASPAA ’03), pp. 177–180, New Paltz, NY,USA, October 2003.

[10] D. FitzGerald, B. Lawlor, and E. Coyle, “Sub-band indepen-dent subspace analysis for drum transcription,” in Proceedingsof the 5th International Conference on Digital Audio Effects(DAFX ’02), pp. 65–69, Hamburg, Germany, September 2002.

[11] S. Raczynski, N. Ono, and S. Sagayama, “Multipitch anal-ysis with harmonic nonnegative matrix approximation,” inProceedings of the 8th International Conference on MusicInformation Retrieval (ISMIR ’07), pp. 381–386, Vienna,Austria, September 2007.

[12] P. Sajda, S. Du, and L. Parra, “Recovery of constituent spectrausing non-negative matrix factorization,” in Wavelets: Applica-tions in Signal and Image Processing X, vol. 5207 of Proceedingsof SPIE, pp. 321–331, San Diego, Calif, USA, August 2003.


[13] T. Virtanen, Sound source separation in monaural music signals,Ph.D. thesis, Tampere University of Technology, Tampere,Finland, 2006.

[14] D. FitzGerald, M. Cranitch, and E. Coyle, “Shifted non-negative matrix factorisation for sound source separation,” inProceedings of the 13th IEEE/SP Workshop on Statistical SignalProcessing, pp. 1132–1137, Bordeaux, France, July 2005.

[15] M. Mørup, L. K. Hansen, and S. M. Arnfred, “Sparse higherorder non-negative matrix factorization,” Technical ReportIMM2007-04658, Technical University of Denmark.

[16] S. A. Abdallah and M. D. Plumbley, “Polyphonic transcriptionby non-negative sparse coding of power spectra,” in Proceed-ings of the 5th International Conference on Music InformationRetrieval (ISMIR ’04), pp. 318–325, Barcelona, Spain, October2004.

[17] R. M. Parry and I. Essa, “Incorporating phase informationfor source separation via spectrogram factorization,” in Pro-ceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’07), vol. 2, pp. 661–664,Honolulu, Hawaii, USA, April 2007.

[18] R. M. Parry and I. Essa, “Phase-aware non-negative spectro-gram factorization,” in Proceedings of the 7th InternationalConference on Independent Component Analysis and SignalSeparation (ICA ’07), vol. 4666 of Lecture Notes in ComputerScience, pp. 536–543, London, UK, September 2007.

[19] R. Kompass, “A generalized divergence measure for non-negative matrix factorization,” in Proceedings of the Neuroin-formatics Workshop, Torun, Poland, September 2005.

[20] A. Cichocki, R. Zdunek, and S.-I. Amari, “Csiszar’s diver-gences for non-negative matrix factorization: family of newalgorithms,” in Proceedings of the 6th International Conferenceon Independent Component Analysis and Blind Signal Separa-tion (ICA ’06), vol. 3889 of Lecture Notes in Computer Science,pp. 32–39, Springer, Charleston, SC, USA, March 2006.

[21] P. D. O. Grady, Sparse separation of under-determinedspeech mixtures, Ph.D. thesis, National University of IrelandMaynooth, Kildare, Ireland, 2007.

[22] D. FitzGerald, Automatic drum transcription and source sepa-ration, Ph.D. thesis, Dublin Institute of Technology, Dublin,Ireland, 2004.

[23] B. W. Bader and T. G. Kolda, “Algorithm 862: MATLAB tensorclasses for fast algorithm prototyping,” ACM Transactions onMathematical Software, vol. 32, no. 4, pp. 635–653, 2006.

[24] D. FitzGerald, M. Cranitch, and E. Coyle, “Non-negative ten-sor factorisation for sound source separation,” in Proceedingsof the Irish Signals and Systems Conference, pp. 8–12, Dublin,Ireland, September 2005.

[25] R. M. Parry and I. Essa, “Estimating the spatial positionof spectral components in audio,” in Proceedings of the 6thInternational Conference on Independent Component Analysisand Blind Signal Separation (ICA ’06), vol. 3889 of LectureNotes in Computer Science, pp. 666–673, Charleston, SC, USA,March 2006.

[26] D. Barry, B. Lawlor, and E. Coyle, “Sound source separation:azimuth discrimination and resynthesis,” in Proceedings ofthe 7th International Conference on Digital Audio Effects(DAFX ’04), Naples, Italy, October 2004.

[27] P. Smaragdis, “Non-negative matrix factor deconvolution;extraction of multiple sound sources from monophonicinputs,” in Proceedings of the 5th International Conference onIndependent Component Analysis and Blind Signal Separation,vol. 3195 of Lecture Notes in Computer Science, pp. 494–499,Grenada, Spain, September 2004.

[28] T. Virtanen, “Separation of sound sources by convolutivesparse coding,” in Proceedings of the ISCA Tutorial andResearch Workshop on Statistical and Perceptual Audio Process-ing (SAPA ’04), Jeju, Korea, October 2004.


[30] E. Vincent and X. Rodet, “Music transcription with ISA andHMM,” in Proceedings of the 5th International Conference onIndependent Component Analysis and Blind Signal Separation(ICA ’04), vol. 3195 of Lecture Notes in Computer Science, pp.1197–1204, Granada, Spain, September 2004.

[31] A. B. Nielsen, S. Sigurdsson, L. K. Hansen, and J. Arenas-Garcıa, “On the relevance of spectral features for instrumentclassification,” in Proceedings of the IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP ’07),vol. 2, pp. 485–488, Honolulu, Hawaii, USA, April 2007.

[32] J. C. Brown, “Calculation of a constant Q spectral transform,”Journal of the Acoustical Society of America, vol. 89, no. 1, pp.425–434, 1991.

[33] J. Eggert, H. Wersing, and E. Korner, “Transformation-invariant representation and NMF,” in Proceedings of theIEEE International Joint Conference on Neural Networks(IJCNN ’04), vol. 4, pp. 2535–2539, Budapest, Hungary, July2004.

[34] D. FitzGerald, M. Cranitch, and E. Coyle, “Shifted 2D non-negative tensor factorisation,” in Proceedings of the Irish Signalsand Systems Conference, pp. 509–513, Dublin, Ireland, June2006.

[35] M. Mørup and M. N. Schmidt, “Sparse non-negative ten-sor 2D deconvolution (SNTF2D) for multi channel time-frequency analysis,” Tech. Rep., Technical University of Den-mark, Copenhagen, Denmark, 2006.

[36] D. FitzGerald, M. Cranitch, and E. Coyle, “Sound sourceseparation using shifted non-negative tensor factorisation,” inProceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’06), vol. 5, pp. 653–656,Toulouse, France, May 2006.

[37] M. Slaney, “Pattern playback in the 90s,” in Advances in NeuralInformation Processing Systems 7, MIT Press, Cambridge,Mass, USA, 1996.

[38] D. FitzGerald, M. Cranitch, and E. Coyle, “Resynthesis meth-ods for sound source separation using non-negative factorisa-tion methods,” in Proceedings of the Irish Signals and SystemsConference, Derry, Ireland, September 2007.

[39] D. FitzGerald, M. Cranitch, and M. Cychowski, “Towards aninverse constant Q transform,” in Proceedings of the 120th AESConvention, Paris, France, May 2006.


[41] D. DeFatta, J. Lucas, and W. Hodgkiss, Digital Signal Process-ing: A System Design Approach, John Wiley & Sons, New York,NY, USA, 1988.

[42] A. Freed, X. Rodet, and P. Depalle, “Performance, synthesisand control of additive synthesis on a desktop computerusing FFT-1,” in Proceedings of the 19th International Computer


Music Conference (ICMC ’93), vol. 19, pp. 98–101, WasedaUniversity Center for Scholarly Information, InternationalComputer Music Association, Tokyo, Japan, September 1993.

[43] N. F. Fletcher and T. D. Rossing, The Physics of MusicalInstruments, Springer, New York, NY, USA, 2nd edition, 1998.

[44] Tensor Toolbox for Matlab, http://csmr.ca.sandia.gov/∼tgkolda/TensorToolbox/.

[45] J. Woodruff, B. Pardo, and R. Dannenberg, “Remixing stereomusic with score-informed source separation,” in Proceedingsof the 7th International Symposium on Music InformationRetrieval (ISMIR ’06), Victoria, Canada, October 2006.

[46] T. Virtanen and A. Klapuri, “Analysis of polyphonic audiousing source-filter model and non-negative matrix factoriza-tion,” in Proceedings of the Advances in Models for AcousticProcessing, Neural Information Processing Systems Workshop,Whistler, Canada, December 2006.

[47] V. Valimaki, J. Pakarinen, C. Erkut, and M. Karjalainen,“Discrete-time modelling of musical instruments,” Reports onProgress in Physics, vol. 69, no. 1, pp. 1–78, 2006.

[48] M. R. Schroeder and B. S. Atal, “Code-excited linear predic-tion (CELP): high-quality speech at very low bit rates,” inProceedings of the IEEE International Conference on Acoustics,Speech, and Signal Processing (ICASSP ’85), vol. 10, pp. 937–940, Tampa, Fla, USA, April 1985.

[49] X. Serra, “Musical sound modeling with sinusoids plus noise,”in Musical Signal Processing, G. D. Poli, A. Picialli, S. T.Pope, and C. Roads, Eds., Swets & Zeiltlinger, Lisse, TheNetherlands, 1997.

[50] O. Yilmaz and S. Rickard, “Blind separation of speechmixtures via time-frequency masking,” IEEE Transactions onSignal Processing, vol. 52, no. 7, pp. 1830–1847, 2004.

[51] P. Siedlaczek, Advanced Orchestra Library Set, 1997.[52] E. Vincent, R. Gribonval, and C. Fevotte, “Performance mea-

surement in blind audio source separation,” IEEE Transactionson Audio, Speech and Language Processing, vol. 14, no. 4, pp.1462–1469, 2006.

[53] BSS Eval toolbox, http://bassdb.gforge.inria.fr/bss eval.


Research Article

Gene Tree Labeling Using Nonnegative MatrixFactorization on Biomedical Literature

Kevin E. Heinrich,1 Michael W. Berry,1 and Ramin Homayouni2

1 Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN 37996-3450, USA2 Department of Biology, University of Memphis, Memphis, TN 38152-3150, USA

Correspondence should be addressed to Michael W. Berry, [email protected]

Received 23 October 2007; Accepted 4 February 2008


Identifying functional groups of genes is a challenging problem for biological applications. Text mining approaches can be usedto build hierarchical clusters or trees from the information in the biological literature. In particular, the nonnegative matrixfactorization (NMF) is examined as one approach to label hierarchical trees. A generic labeling algorithm as well as an evaluationtechnique is proposed, and the effects of different NMF parameters with regard to convergence and labeling accuracy are discussed.The primary goals of this study are to provide a qualitative assessment of the NMF and its various parameters and initialization, toprovide an automated way to classify biomedical data, and to provide a method for evaluating labeled data assuming a static inputtree. As a byproduct, a method for generating gold standard trees is proposed.

Copyright © 2008 Kevin E. Heinrich et al. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

High-throughput techniques in genomics, proteomics, andrelated biological fields generate large amounts of data thatenable researchers to examine biological systems from aglobal perspective. Unfortunately, however, the sheer massof information available is overwhelming, and data such asgene expression profiles from DNA microarray analysis canbe difficult to understand fully even for domain experts.Additionally, performing these experiments in the lab can beexpensive with respect to both time and money.

In recent years, biological literature repositories havebecome an alternative data source to examine phenotype.Many of the online literature sources are manually curated,so the annotations assigned to articles are subjectivelyassigned in an imperfect and error-prone manner. Giventhe time required to read and classify an article, automatedmethods may help increase the annotation rate as well asimprove existing annotations.

A recently developed tool that may help improve anno-tation as well as identify functional groups of genes is theSemantic Gene Organizer (SGO). SGO is a software en-vironment based upon latent semantic indexing (LSI) that

enables researchers to view groups of genes in a globalcontext as a hierarchical tree or dendrogram [1]. The low-rank approximation provided by LSI (for the original term-to-document associations) exposes latent relationships sothat the resulting hierarchical tree is simply a visualizationof those relationships that are reproducible and easily inter-preted by biologists. Homayouni et al. [2] have shown thatSGO can identify groups of related genes more accuratelythan term co-occurrence methods. LSI, however, is basedupon the singular value decomposition (SVD) [3], and sincethe input data for SGO is a nonnegative matrix of weightedterm frequencies, the negative values prevalent in the basisvectors of the SVD are not easily interpreted.

On the other hand, the decomposition produced by therecently popular nonnegative matrix factorization (NMF)can be readily interpreted. Paatero and Tapper [4] wereamong the first researchers to investigate this factorization,and Lee and Seung [5] demonstrated its use for bothtext mining and image analysis. NMF is generated by aniterative algorithm that preserves the nonnegativity of theoriginal data; the factorization yields a low-rank, parts-based representation of the data. In effect, common themespresent in the data can be identified simply by inspecting


the factor matrices. Depending on the interpretation, thefactorization can induce both clustering and classification.If NMF can accurately model the input data, it can be usedto both classify data and perform pattern recognition tasks[6]. Within the context of SGO, this means that the groupsof genes presented in the hierarchical trees can be assignedlabels that identify common attributes of protein function.

The interpretability of NMF, however, comes at a price.Namely, convergence and stability are not guaranteed, andmany variations have been proposed [5], requiring differentparameter choices. The goals of this study are (1) to providea qualitative assessment of the NMF and its various parame-ters, particularly as they apply to the biomedical context, (2)to provide an automated way to classify biomedical data, and(3) to provide a method for evaluating labeled data assuminga static input tree. As a byproduct, a method for generating“gold standard” trees is proposed.

2. Methods

As outlined in [7], hierarchical trees can be constructedfor a given group of genes. Once those trees are formed,techniques that label the interior nodes of those trees can beexamined.

2.1. Nonnegative Matrix Factorization

Given an m × n nonnegative matrix A = [ai j], where eachentry ai j denotes the term weight of token i in gene documentj, the rows of A represent term vectors that show how termsare distributed across the entire collection. Similarly, thecolumns of A show which terms are present within a genedocument. Consider the 24 × 9 term-by-document matrixA in Table 1 derived from the sample document collection[7] in Table 2. Here, log-entropy term weighting [8] is usedto define the relative importance of term i for document j.Specifically, ai j = li jgi, where

li j = log2

(1 + fi j

),

gi = 1 +

(∑j

(pi j log2pi j

)

log2n

)

,(1)

fi j is the frequency of token i in document j, and pi j =fi j /∑

j fi j is the probability of token i occurring in documentj. By design, tokens that appear less frequently across thecollection but more frequently within a document will begiven higher weight. That is, distinguishing tokens willtend to have higher weights assigned to them, while morecommon tokens will have weights closer to zero.

If NMF is applied to the sample term-document matrixin Table 1, one possible factorization is given in Tables 3and 4; the approximation to the term-document matrixgenerated by mutliplyingW ×H is given in Table 5. The top-weighted terms for each feature are presented in Table 6. Byinspection, the sample collection has features that representleukemia, alcoholism, anxiety, and autism. If each documentand term is assigned to its most dominant feature, then theoriginal term-document matrix can be reorganized around

those features. The restructured matrix typically resembles ablock diagonal matrix and is given in Table 7.

NMF of A is based on an iterative technique attempts tofind two nonnegative factor matrices, W and H , such that

A ≈WH , (2)

where W and H are m × k and k × n matrices, respectively.Typically, k is chosen so that k � min(m,n). The optimalchoice of k is problem-dependant [9]. This factorizationminimizes the squared Euclidean distance objective function[10]

‖A−WH‖2F =

∑

i j

(Aij − (WH)i j

)2. (3)

Minimizing the objective (or cost) function is convexin either W or H , but not both variables together. Assuch, finding global minima to the problem is unrealistic—however, finding several local minima is within reason. Also,for each solution, the matrices W and H are not unique.This property is evident when examining WDD−1H for anynonnegative invertible matrix D [11].

The goal of NMF is to approximate the original term-by-gene document space as accurately as possible with thefactor matrices W and H . As noted in [12], the singularvalue decomposition (SVD) produces the optimal rank-kapproximation with respect to the Frobenius norm. Unfortu-nately, this optimality frequently comes at the cost of negativeelements. The factor matrices of NMF, however, are strictlynonnegative which may facilitate direct interpretability of thefactorization. Thus, although an NMF approximation maynot be optimal from a mathematical standpoint, it may besufficient and yield better insight into the dataset than theSVD for certain applications.

Upon completion of NMF, the factor matrices W andH will, in theory, approximate the original matrix A andyet contain some valuable information about the dataset inquestion. As presented in [10], if the approximation is closeto the original data, then the factor matrices can uncoversome underlying structure within the data. To reinforce this,W is commonly referred to as the feature matrix containingfeature vectors that describe the themes inherent within thedata while H can be called a coefficient matrix since itscolumns describe how each document spans each feature andto what degree.

Currently, many implementations of NMF rely on ran-dom nonnegative initialization. As NMF is sensitive toits initial seed, this obviously hinders the reproducibilityof results generated. Boutsidis and Gallopoulos [13] pro-pose the nonnegative double singular value decomposition(NNDSVD) scheme as a possible remedy to this concern.NNDSVD aims to exploit the SVD as the optimal rank-kapproximation of A. The heuristic overcomes the negativeelements of the SVD by enforcing nonnegativity wheneverencountered and by iteratively approximating the outerproduct of each pair of singular vectors. As a result, some ofthe properties of the data are preserved in the initial starting


Table 1: Term-document matrix for the sample collection in Table 2.

d1 d2 d3 d4 d5 d6 d7 d8 d9

Alcoholism — 0.4338 — — — 0.2737 — 0.2737 0.4338

Anxiety 0.4745 — — — 0.4745 — — — —

Attack — — — — 0.6931 — — — —

Autism — — — — — — 0.7520 — 0.7520

Airth — — — — — 0.4745 — — 0.4745

Blood — — — 0.3466 0.3466 0.3466 — — —

Bone — — 0.7520 0.7520 — — — — —

Cancer — 0.4745 0.4745 — — — — — —

Cells — — — 0.6931 — — — — —

Children — — — — — — 0.4745 — 0.4745

Cirrhosis — 0.7520 — — — — — 0.7520 —

Damage — — 0.6931 — — — — — —

Defects — — — — — 0.3466 0.3466 — 0.3466

Failure — 0.4745 — — — 0.4745 — — —

Hypertension — — — — — 0.6931 — — —

Kidney — 0.4745 — — — 0.4745 — — —

Leukemia — — 1.0986 — — — — — —

Liver — 0.4745 — — — — — 0.4745 —

Marrow — — 0.7520 0.7520 — — — — —

Pressure — — — — 0.7804 0.4923 — — —

Scarring — — — — — — — 0.6931 —

Speech — — — — — — 0.6931 — —

Stress 0.4923 — — — 0.7804 — — — —

Tuberculosis — — — 0.6931 — — — — —

Table 2: Sample collection with dictionary terms displayed in bold.

Document Text

d1 Work-related stress can be considered a factor contributing to anxiety.

d2 Liver cancer is most commonly associated with alcoholism and cirrhosis. It is well-known that alcoholism can cause cirrhosisand increase the risk of kidney failure.

d3 Bone marrow transplants are often needed for patients with leukemia and other types of cancer that damage bone marrow.Exposure to toxic chemicals is a risk factor for leukemia.

d4 Different types of blood cells exist in bone marrow. Bone marrow procedures can detect tuberculosis.

d5 Abnormal stress or pressure can cause an anxiety attack. Continued stress can elevate blood pressure.

d6 Alcoholism can cause high blood pressure (hypertension) and increase the risk of birth defects and kidney failure.

d7 The presence of speech defects in children is a sign of autism. As of yet, there is no consensus on what causes autism.

d8 Alcoholism, often triggered at an early age by factors such as environment and genetic predisposition, can lead to cirrhosis.Cirrhosis is the scarring of the liver.

d9 Autism affects approximately 0.5% of children in the US. The link between alcoholism and birth defects is well-known;researchers are currently studying the link between alcoholism and autism.

matrices W and H . Once both matrices are initialized, theycan be updated using the multiplicative rule [10]:

Hcj ←− Hcj

(WTA

)c j

(WTWH

)c j

,

Wic ←−Wic

(AHT

)ic(

WHHT)ic

.

(4)

2.2. Labeling Algorithm

Latent semantic indexing (LSI), which is based on theSVD, can be used to create a global picture of the dataautomatically. In this particular context, hierarchical treescan be constructed from pairwise distances generated fromthe low-rank LSI space. Distance-based algorithms such asFastME can create hierarchies that accurately approximatedistance matrices in O(n2) time [14]. Once a tree is built,


Table 3: Feature matrix W for the sample collection.

f1 f2 f3 f4

Alcoholism 0.0006 0.3503 — —

Anxiety — — 0.4454 —

Attack — — 0.4913 —

Autism — 0.0030 — 0.8563

Birth — 0.1111 0.0651 0.2730

Blood 0.0917 0.0538 0.3143 —

Bone 0.5220 — 0.0064 —

Cancer 0.1974 0.1906 — —

Cells 0.1962 — 0.0188 —

Children — 0.0019 — 0.5409

Cirrhosis 0.0015 0.5328 — —

Damage 0.2846 — — —

Defects — 0.0662 — 0.4161

Failure 0.0013 0.2988 — —

Hypertension — 0.1454 0.1106 —

Kidney 0.0013 0.2988 — —

Leukemia 0.4513 — — —

Liver 0.0009 0.3366 — —

Marrow 0.5220 — 0.0064 —

Pressure — 0.066 0.6376 —

Scarring — 0.208 — —

Speech — — — 0.4238

Stress — — 0.6655 —

Tuberculosis 0.1962 — 0.0188 —

a labeling algorithm can be applied to identify branches ofthe tree. Finally, a “gold standard” tree and a standard per-formance measure that evaluates the quality of tree labelsmust be defined and applied.

Given a hierarchy, few well-established automated label-ing methods exist. To apply labels to a hierarchy, one canassociate a weighted list of terms with each taxon. Once theselists have been determined, labeling the hierarchy is simply amatter of recursively inheriting terms up the tree from eachchild node; adding weights of shared terms will ensure thatmore frequently used terms are more likely to have a largerweight at higher levels within the tree. Intuitively, these termsare often more general descriptors.

This algorithm is robust in that it can be slightly mod-ified and applied to any tree where a ranked list can beapplied to each taxon. For example, by querying the SVD-generated vector space for each document, a ranked list ofterms can be created for each document and the tree la-beled accordingly. As a result, assuming the initial rankingprocedure is accurate, any ontological annotation can beenhanced with terms from the text it represents.

To create a ranked list of terms from NMF, the dominantcoefficient Hij in H is extracted for document j. The cor-responding feature Wi is then scaled by Hij and assigned tothe taxon representing document j, and the top 100 terms arechosen to represent the taxon. This method can be expandedto incorporate branch length information, thresholds, ormultiple features.

2.3. Recall Measure

Once labelings are produced for a given hierarchical tree,a measure of “goodness” must be calculated to determinewhich labeling is the “best.” When dealing with simple returnlists of documents that can be classified as either relevantor not relevant to a user’s needs, information retrieval(IR) methods typically default to using precision and recallto describe the performance of a given retrieval system.Precision is the ratio of relevant returned items to totalnumber of returned items, while recall is the percentage ofrelevant returned items with respect to the total number ofrelevant items. Once a group of words is chosen to labelan entity, the order of the words carries little meaning, soprecision has limited usefulness in this application. Whencomparing a generated labeling to a “correct” one, recall isan intuitive measure.

Unfortunately in this context, one labelled hierarchymust be compared to another. Surprisingly, relatively littlework has been done that addresses this problem. Kiritchenkoin [15] proposed the hierarchical precision and recall mea-sures, denoted as hP and hR, respectively. These measurestake advantage of hierarchical consistency to compare twolabelings with a single number. Unfortunately, condensing allthe information held in a labeled tree into a single numberloses some information. In the case of NMF, the effectsof parameters on labeling accuracy with respect to nodedepth is of interest, so a different measure would be moreinformative. One such measure finds the average recall ofall the nodes at a certain depth within the tree. To generatenonzero recall, however, common terms must exist betweenthe labelings being compared. Unfortunately, many of theterms present in MeSH headings are not strongly representedin the text. As a result, the text vocabulary must be mappedto the MeSH vocabulary to produce significant recall.

2.4. Feature Vector Replacement

When working with gene documents, many cases exist wherethe terminology used in MeSH is not found within the genedocuments themselves. Even though a healthy percentage ofthe exact MeSH terms may exist in the corpus, the term-document matrix is so heavily overdetermined (i.e., thenumber of terms is significantly larger than the number ofdocuments) that expecting significant recall values at anylevel within the tree becomes unreasonable. This is not toimply that the terms produced by NMF are without value.On the contrary, the value in those terms is exactly that theymay reveal what was previously unknown. For the purposesof validation, however, some method must be developedthat enables a user to discriminate between labelings eventhough both have little or no recall with the MeSH-labeledhierarchy. In effect, the vocabulary used to label the tree mustbe controlled for the purposes of validation and evaluation.

To produce a labeling that is mapped into the MeSHvocabulary, the top r globally-weighted MeSH headings arechosen for each document; these MeSH headings can beextracted from the MeSH metacollection [7]. By inspectionof H , the dominant feature associated with each document


Table 4: Coefficient matrix H for the sample collection.

d1 d2 d3 d4 d5 d6 d7 d8 d9

f1 — 0.0409 1.6477 1.1382 0.0001 0.0007 — — —

f2 — 1.3183 — — 0.0049 0.6955 0.0003 0.9728 0.2219

f3 0.3836 — — 0.0681 1.1933 0.3327 — — —

f4 — — — — — 0.1532 0.9214 — 0.799

Table 5: Approximation to sample term-document matrix given in Table 1.

d1 d2 d3 d4 d5 d6 d7 d8 d9

Alcoholism — 0.4618 0.0010 0.0007 0.0017 0.2436 0.0001 0.3408 0.0777

Anxiety 0.1708 — — 0.0303 0.5315 0.1482 — — —

Attack 0.1884 — — 0.0334 0.5863 0.1635 — — —

Autism — 0.0040 — — — 0.1333 0.7890 0.0029 0.6848

Birth 0.0250 0.1464 — 0.0044 0.0783 0.1407 0.2516 0.1080 0.2428

Blood 0.1206 0.0746 0.1511 0.1258 0.3754 0.1420 — 0.0523 0.0119

Bone 0.0025 0.0214 0.8602 0.5946 0.0077 0.0025 — — —

Cancer — 0.2593 0.3252 0.2247 0.001 0.1327 0.0001 0.1854 0.0423

Cells 0.0072 0.0080 0.3233 0.2246 0.0224 0.0064 — — —

Children — 0.0025 — — — 0.0842 0.4984 0.0019 0.4326

Cirrhosis — 0.7025 0.0024 0.0017 0.0026 0.3705 0.0002 0.5183 0.1183

Damage — 0.0116 0.4689 0.3239 — 0.0002 — — —

Defects — 0.0873 — — 0.0003 0.1098 0.3834 0.0644 0.3472

Failure — 0.3939 0.0022 0.0015 0.0015 0.2078 0.0001 0.2906 0.0663

Hypertension 0.0424 0.1916 — 0.0075 0.1327 0.1379 — 0.1414 0.0323

Kidney — 0.3939 0.0022 0.0015 0.0015 0.2078 0.0001 0.2906 0.0663

Leukemia — 0.0185 0.7437 0.5137 — 0.0003 — — —

Liver — 0.4437 0.0015 0.0011 0.0017 0.2341 0.0001 0.3274 0.0747

Marrow 0.0025 0.0214 0.8602 0.5946 0.0077 0.0025 — — —

Pressure 0.2445 0.0870 — 0.0434 0.7612 0.2580 — 0.0642 0.0147

Scarring — 0.2742 — — 0.0010 0.1446 0.0001 0.2023 0.0462

Speech — — — — — 0.0649 0.3905 — 0.3386

Stress 0.2553 — — 0.0453 0.7942 0.2214 — — —

Tuberculosis 0.0072 0.0080 0.3233 0.2246 0.0224 0.0064 — — —

Table 6: Top 5 words for each feature from the sample collection.

f1 f2 f3 f4

Bone Cirrhosis Stress Autism

Marrow Alcoholism Pressure Children

Leukemia Liver Attack Speech

Damage Kidney Anxiety Defects

Cancer Failure Blood Birth

is chosen and assigned to that document. The correspondingtop r MeSH headings are then themselves parsed into tokensand assigned to a new MeSH feature vector appropriatelyscaled by the corresponding coefficient in H . The featurevector replacement algorithm is given in Algorithm 1. Notethatm′ is distinguished fromm since the dictionary of MeSH

headings will likely differ in size and composition fromthe original corpus dictionary. The number of documents,however, remains constant.

Once full MeSH feature vectors have been constructed,the tree can be labeled via the procedure outlined in [7]. As aresult of this replacement, better recall can be expected, andthe specific word usage properties inherent in the MeSH (orany other) ontology can be exploited.

2.5. Alternative Labeling Method

An alternative method to label a tree is to vary the parameterk from (2) with node depth. In theory, more pertinent andaccurate features will be preserved if the clusters inherentin the NMF coincide with those in the tree generated viathe SVD space. For smaller clusters and more specific terms,


Table 7: Rearranged term-document matrix for the sample collection.

d3 d4 d2 d6 d8 d1 d5 d7 d9

Bone 0.7520 0.7520 — — — — — — —

Cancer 0.4745 — 0.4745 — — — — — —

Cells — 0.6931 — — — — — — —

Damage 0.6931 — — — — — — — —

Leukemia 1.0986 — — — — — — — —

Marrow 0.7520 0.7520 — — — — — — —

Tuberculosis — 0.6931 — — — — — — —

Alcoholism — — 0.4338 0.2737 0.2737 — — — 0.4338

Cirrhosis — — 0.7520 — 0.7520 — — — —

Failure — — 0.4745 0.4745 — — — — 0.4745

Hypertension — — — 0.6931 — — — — —

Kidney — — 0.4745 0.4745 — — — — 0.4745

Liver — — 0.4745 — 0.4745 — — — —

Scarring — — — — 0.6931 — — — —

Anxiety — — — — — 0.4745 0.4745 — —

Attack — — — — — — 0.6931 — —

Blood — 0.3466 — 0.3466 — — 0.3466 — —

Pressure — — — 0.4923 — — 0.7804 — —

Stress — — — — — 0.4923 0.7804 — —

Autism — — — — — — — 0.7520 0.7520

Birth — — — 0.4745 — — — — 0.4745

Children — — — — — — — 0.4745 0.4745

Defects — — — 0.3466 — — — 0.3466 0.3466

Speech — — — — — — — 0.6931 —

Input: MeSH Term-by-Document Matrix A′m′×nFactor Matrices Wm×k and Hk×n of original Term-by-Document Matrix Am×nGlobal weight vector g′,Threshold r number of MeSH headings to represent each document

Output: MeSH feature matrix W ′

for i = 1 : n doChoose r top globally-weighted MeSH headings from ith column of A′

Determine j = arg maxj<k

Hji

for h = 1 : r doParse MeSH heading h into tokensAdd each token t with index p to w′j , the jth column of W ′

i.e., W ′p j =W ′

p j + g′p ×Hji

end forend for

Algorithm 1: Feature vector replacement algorithm.

higher k should be necessary; conversely, the ancestor nodesshould require smaller k and more general terms sincethey cover a larger set of genes spanning a larger set oftopics. Inheritance of terms can be performed once again byinheriting common terms—however, an upper threshold ofinheritance can be imposed. For example, for all the nodes inthe subtree induced by a node p, high k can be used. If all thegenes induced by p are clustered together by NMF, then allthe nodes in the subtree induced by p will maintain the same

labels. For the ancestor of p, a different value of k can be used.Although this method requires some manual curation, it canpotentially produce more accurate labels.

3. Results

The evaluation of the factorization produced by NMF isnontrivial as there is no set standard for examining thequality of basis vectors produced. In several studies thus far,


the results of NMF runs have been evaluated by domainexperts. For example, Chagoyen et al. [16] performed severalNMF runs and then independently asked domain expertsto interpret the resulting feature vectors. This approach,however, limits the usefulness of NMF, particularly indiscovery-based genomic studies for which domain expertsare not readily available. Here, two different automatedprotocols are presented to evaluate NMF results. First, themathematical properties of the NMF runs are examined,then the accuracy of the application of NMF to hierarchicaltrees is scrutinized.

3.1. Input Parameters

To test NMF, the 50TG collection presented in [2] was used.This collection was constructed manually by selecting genesknown to be associated with at least one of the followingcategories: (1) development, (2) Alzheimer’s disease, and (3)cancer biology. Each gene document is simply a concate-nation of all titles and abstracts of the MEDLINE citationscross-referenced in the mouse, rat, and human EntrezGene(formerly LocusLink) entries for each gene.

Two different NMF initialization strategies were used: theNNDSVD [17] and randomization. Five different randomtrials were conducted while four were performed usingthe NNDSVD method. Although the NNDSVD producesa static starting matrix, different methods can be appliedto remove zeros from the initial approximation to preventthem from getting “locked” throughout the update process.Initializations that maintained the original zero elementsare denoted NNDSVDz, while NNDSVDa, NNDSVDe, andNNDSVDme substitute the average of all elements of A,ε, or εmachine, respectively, for those zero elements; ε wasset to 10−9 and was significantly smaller than the smallestobserved value in either H or W (typically around 10−3),while εmachine was the machine epsilon (the smallest positivevalue the computer could represent) at approximately 10−324.Both NNDSVDz and NNDSVDa were described previouslyin [13], whereas NNDSVDe and NNDSVDme are added inthis study as natural extensions to NNDSVDz that wouldnot suffer from the restrictions of locking zeros due to themultiplicative update. The parameter k was assigned thevalues of 2, 4, 6, 8, 10, 15, 20, 25, and 30.

Each of the NMF runs iterated until it reached 1,000iterations or a stationary point in both W and H . That is,at iteration i, when ‖Wi−1−Wi‖F < τ and ‖Hi−1−Hi‖F < τ,convergence is assumed. The parameter τ was set to 0.01.Since convergence is not guaranteed under all constraints,if the objective function increased between iterations, thefactorization was stopped and assumed not to converge.Log-entropy term-weighting scheme (see [8]) was used togenerate the original token weights for each collection.

3.2. Relative Error and Convergence

The SVD produces the mathematically optimal low-rankapproximation of any matrix with respect to the Frobeniusnorm, and for all other unitarily-invariant matrix norms.Whereas NMF can never produce a more accurate approx-

40

50

60

70

80

90

100

110

120

130

||A−WH||

5 10 15 20 25 30

k

SVDBest NMFAverage NMF

Figure 1: Error measures for the SVD, best NMF run, and averageNMF run for the 50TG collection.

imation than the SVD, its proximity to A relative to the SVDcan be measured. Namely, the relative error, computed as

RE = ‖A−WH‖F −∥∥A−USVT

∥∥F∥∥A−USVT

∥∥F

, (5)

where both factorizations are truncated after k dimensions(or factors), can show how close the feature vectors producedby the NMF are to the optimal basis [18].

Intuitively, as k increases, the NMF factorization shouldmore closely approximate A. As shown in Figure 1, this isexactly the case. Surprisingly, however, the average of allconverging NMF runs is under 10% relative error comparedto the SVD, with that error tending to rise as k increases. Theproximity of the NMF to the SVD implies that, for this smalldataset, NMF can accurately approximate the data.

Next, several different initialization methods (discussedin Section 3.1) were examined. To study the effects on con-vergence, one set of NMF parameters must be chosen as thebaseline against which to compare. By examining the NMFwith no additional constraints, the NNDSVDa initializationmethod consistently produces the most accurate approx-imation when compared to NNDSVDe, NNDSVDme,NNDSVDz, and random initialization [7]. The relative errorNNDSVDa generates less than 1% for most tested valuesof k. Unfortunately, NNDSVDa requires several hundrediterations to converge.

NNDSVDe performs comparably to NNDSVDa withregard to relative error, often within a fraction of a percent.For smaller values of k, NNDSVDe takes significantlylonger time to converge than NNDSVDa although the exactopposite is true for the larger value of k. NNDSVDz, onthe other hand, converges much faster for smaller values of


k at the cost of accuracy as the locked zero elements havean adverse effect on the best solution that can be convergedupon. Not surprisingly, NNDSVDme performed comparablyto NNDSVDz in many cases, however, it was able to achieveslightly more accurate approximations as the number ofiterations increased. In fact, NNDSVDme was identicalto NNDSVDz in most cases and will not be mentionedhenceforth unless noteworthy behavior is observed. Randominitialization performs comparably to NNDSVDa in termsof accuracy and favorably in terms of speed for small k,but as k increases, both speed and accuracy suffer. A graphillustrating the convergence rates when k = 25 is depicted inFigure 2.

In terms of actual elapsed time, the improved perfor-mance of the NNDSVD does not come without a cost. Inthe context of SGO, the time spent computing the initialSVD of A for the first step of the NNDSVD algorithm isassumed to be zero since the SVD is needed a priori forquerying purposes However, the initialization time requiredto complete the NNDSVD when k = 25 is nearly 21seconds, while the cost for random initialization is relativelynegligible. All runs were performed on a machine runningDebian Linux 3.0 with an Intel Pentium III 1-GHz processorand 256-MB memory. Since the cost per each NMF iterationis nearly.015 seconds per k (when k = 25), the cost of per-forming the NNDSVD is (approximately) equivalent to 55NMF iterations. Convergence taking into account this cost isshown in Figure 3.

3.3. Labeling Recall

Measuring recall is a quantitative way to validate “known”information within a hierarchy. Here, a method was devel-oped to measure recall at various branch points in a hierar-chical tree (described in Section 2.3). The gold standard usedfor measuring recall included the MeSH headings associatedwith gene abstracts. The mean average recall (MAR) denotesthe value attained when the average recall at each level isaveraged across all branches of the tree. Here, a hierarchylevel refers to all nodes that share the same distance (numberof edges) from the root. This section discusses the parametersettings that provided the best labelings, both in the localand global sense to the tree generated in [2] with 47 interiornodes spread across 11 levels.

After applying the labeling algorithm described inSection 2.2 to the factors produced by NMF, the MARgenerated was very low (under 25%). Since the NMF-generated vocabulary did not overlap well with the MeSHdictionary, the NMF features were mapped into MeSHfeatures via the procedure outlined in Algorithm 1, wherethe most dominant feature represented each document onlyif the corresponding weight in the H matrix was greaterthan 0.5. Also, the top 10 MeSH headings were chosen torepresent each document, and the top 100 correspondingterms were extracted to formulate each new MeSH featurevector. Consequently, the resulting MeSH feature vectorsproduced labelings with greatly increased MAR.

With regard to the accuracy of the labelings, severaltrends exist. As k increases, the achieved MAR increases as

50

60

70

80

90

100

110

120

130

||A−WH||

0 10 20 30 40 50

Iteration

NNDSVDaNNDSVDeNNDSVDz

NNDSVDmeRandomSVD

Figure 2: Convergence graph comparing the NNDSVDa, NND-SVDe, NNDSVDme, NNDSVDz, and best random NMF runs ofthe 50TG collection for (k = 25).

well. This behavior could be predicted since increasing thenumber of features also increases the size of the effectivelabeling vocabulary, thus enabling a more robust labeling.When k = 25, the average MAR across all runs is approxi-mately 68%.

Since the NNDSVDa initialization provided the best con-vergence properties, it will be used as a baseline against whichto compare. If k is not specified, assume k = 25. In termsof MAR, NNDSVDa produced below average results, withboth NNDSVDe and NNDSVDz consistently outperformingNNDSVDa for most values of k; NNDSVDe and NNDSVDzattained similar MAR values as depicted in Figure 4. Therecall of the baseline case using NNDSVDa and k = 25depicted by node level is shown in Figure 6.

The 11 node levels of the 50TG hierarchical tree [2]shown in Figure 5 can be broken into thirds to analyze theaccuracy of a labeling within a depth region of the tree. TheMAR for NNDSVDa for each of the thirds is approximately58%, 63%, and 54%, respectively. With respect to thetopmost third of the tree, any constraint applied to anyNNDSVD initialization other than smoothing W applied toNNDSVDa provided an improvement over the 58% MAR.In all cases, the resulting MAR was at least 75%. NNDSVDaperformed slightly below average over the middle third at63%. Overall, nearly any constraint improved or matchedrecall over the base case over all thirds with the exception thatenforcing sparsity on H underperformed NNDSVDa in thebottom third of the tree; all other constraints achieved at least54% MAR for the bottom third.

With respect to different values of k, similar tendenciesexist over all thirds. NNDSVDa is among the worst in terms


50

60

70

80

90

100

110

120

130

||A−WH||

0 5 10 15 20 25 30 35 40

Elapsed time (s)

NNDSVDaNNDSVDeNNDSVDz

NNDSVDmeRandomSVD

Figure 3: Convergence graph comparing the NNDSVDa,NNDSVDe, NNDSVDme, NNDSVDz, and best random NMFruns of the 50TG collection for (k = 25) taking into accountinitialization time.

Table 8: Genes comprising each leaf node of the tree shown inFigure 7.

A B C D E

a2m apoe dab1 atoh1 cdk5

apba1 app lrp8 dll1 cdk5r

apbb1 psen1 reln jag1 cdk5r2

aplp1 psen2 vldlr notch1 fyn

aplp2 — — — mapt

lrp1 — — — —

shc1 — — — —

of MAR with the exception that it does well in the topmostthird when k is either 2 or 4. There was no discernableadvantage when comparing NNDSVD initialization to itsrandom counterpart. Overall, the best NNDSVD (and hencereproducible) MAR was achieved using NNDSVDe and k =30 (also shown in Figure 6).

3.4. Labeling Evaluation

Although relative error and recall are measures that can auto-matically evaluate a labeling, ultimately the final evaluationstill requires some manual observation and interpretation.For example, assuming the tree given in Figure 7 withleaf nodes representing the gene clusters given in Table 8,one possible labeling using MeSH headings generated fromAlgorithm 1 is given in Table 9, and a sample NMF-generatedlabeling is given in Table 10.

0

0.2

0.4

0.6

0.8

1

MA

R

5 10 15 20 25 30

k

NNDSVDzNNDSVDeNNDSVDa

Figure 4: MAR as a function of k under the various NNDSVDinitialization schemes with no constraints for the 50TG collection.

As expected, many of the MeSH terms were too generaland were also associated with many of the 5 gene clusters,for example, genetics, proteins, chemistry, and cell. However,some MeSH terms were indeed useful in describing the func-tion of the gene clusters. For example, Cluster A MeSH labelsare suggestive of LDL and alpha macroglobulin receptorprotein family; Cluster B MeSH labels are associated withAlzheimer’s disease and Amyloid beta metabolism; ClusterC labels are associated with extracellular matrix and celladhesion; Cluster D labels are associated with embryologyand inhibotrs; and Cluster E labels are associated with tauprotein and lymphocytes.

In contrast to MeSH labeling, the text labeling by NMFwas much more specific and functionally descriptive. Ingeneral, the first few terms (highest ranking terms) in eachcluster defined either the gene name or alias. Interestingly,each cluster also contained terms that were functionallysignificant. For example, rap (Cluster A) is known to be aligand for a2m and lrp1 receptors. In addition, the 4 genesin Cluster C are known to be part of a molecular signalingpathway involving Cajal-retzius cells in the brain thatcontrol neuronal positioning during development. Lastly, thephysiological effects of Notch1 (Cluster D) have been linkedto activation of intracellular transcription factors Hes1 andHes5.

Importantly, the specific nature of text labeling by NMFallows identification of previously unknown functional con-nections between genes and clusters of genes. For example,the term PS1 appeared in both Cluster B and Cluster D.This finding is very interesting in that PS1 encodes a proteinwhich is part of a protease complex called gamma secretases.


Table 9: Top 10 MeSH terms for the leaf nodes of the tree shown in Figure 7.

A B C D E

Metabolism Protein Genetics Genetics Metabolism

Genetics Amyloid Molecules Proteins Proteins

Protein Beta Neuronal Metabolism Genetics

Proteins Genetics Adhesion Membrane Tau

Receptor Metabolism Cell Cell Protein

Related Precursor Metabolism Physiology Lymphocyte

ldl Chemistry Proteins Cytology p56

Macroglobulins Apolipoproteins Extracellular Embryology Specific

Alpha Disease Matrix Biosynthesis lck

Chemistry Alzheimer Biosynthesis Inhibitors Tyrosine

Table 10: Top 10 terms for the leaf nodes of the tree shown in Figure 7.

A B C D E

lrp Apoe reelin Notch fyn

Receptor-related ps1 reeler notch1 Tau

Lipoprotein Amyloid dab1 jagged1 cdk5

fe65 Abeta vldlr notch-1 lck

app Presenilin apoer2 hes5 sh3

Alpha Epsilon Positioning Fringe nmda

rap Apolipoprotein Cajal-retzius hes-1 Ethanol

Abeta Alzheimer apoe hes1 Phosphorylation

Beta-amyloid ad Apolipoprotein hash1 Alcohol

Receptor Gamma-secretase Lipoprotein ps1 tcr

In addition to cleaving the Alzheimer protein APP, gammasecretases have been shown to cleave the developmentallyimportant Notch protein. Therefore, these results indicatethat NMF labeling provides a useful tool for discovering newfunctional associations between genes in a cluster as well asacross multiple gene clusters.

4. Discussion

While comparing NMF runs, several trends can be observedboth with respect to mathematical properties and recalltendencies. First, and as expected, as k increases, the ap-proximation achieved by the SVD with respect to A is moreaccurate; the NMF can provide a relatively close approx-imation to A in most cases, but the error also increaseswith k. Second, NNDSVDa provides the fastest convergencein terms of number of iterations to the closest approx-imations. Third, applying additional constraints such assmoothing and sparsity [7] has little noticeable effect onboth convergence and recall, and in many cases greatlydecreases the likelihood that a stationary point will bereached. Finally, to generate relatively “good” approximationerror (within 5%), about 20–40 iterations are recommendedusing either NNDSVDa or NNDSVDe initialization withno additional constraints when k is reasonably large (abouthalf the number of documents). For smaller k, performing

approximately 25 iterations under random initialization willusually accomplish 5% relative error, with the number ofiterations required decreasing as k decreases.

While measuring error norms and convergence is usefulto expose mathematical properties and structural tendenciesof the NMF, the ultimate goal of this application is to providea useful labeling of a hierarchical tree from the NMF. In manycases, the “best” labeling may be provided by a suboptimalrun of NMF. Overall, more accurate labelings resulted fromhigher values of k because more feature vectors increasedthe vocabulary size of the labeling dictionary. Generallyspeaking, the NNDSVDe, NNDSVDme, and NNDSVDzschemes outperformed the NNDSVDa initialization. Overall,the accuracy of the labelings appeared to be more a functionof k and the initial seed rather than the constraints applied.

Much research is being performed concerning the NMF,and this work examines three methods based on the multi-plicate update (see Section 2.1). Many other NMF variationsexist and more are being developed, so their application tothe biological realm should be studied. For example, [19]proposes a hybrid least squares approach called GD-CLS tosolve NMF and overcomes the problem of “locking” zeroedelements encountered by MM, [20, 21] propose nonsmoothNMF as an alternative method to incorporate sparseness, and[22] proposes an NMF technique that generates three factormatrices and has shown promising clustering results. NMFhas been applied to microarray data [23], but efforts need to


cdk5

cdk5rcdk5r2maptfynatoh1dll1jag1notch1dab1relnlrp8vldlrpsen1psen2appapoeapbb1

lrp1a2mapba1aplp1aplp2shc1brca1brca2dnmt1pax2pax3wnt1wnt3wnt2robo1gligli2gli3ptchsmoshhegfrsrcerbb2myctrp53tgfb1fosets1kitnrasabl1

Figure 5: Hierarchical tree for a 50 test gene (50TG) collectiondescribed in [2] using updated MEDLINE abstracts.

be made to combine the text information with microarraydata; some variation of tensor factorization could possiblyshow how relationships change over time [24].

With respect to labeling methods, MeSH heading labelswere generally useful, but provided little specific detailsabout the functional relationship between the genes in acluster. On the other hand, text labeling provided specificand detailed information regarding the function of the genesin a clusters. Importantly, term labels provided some specificconnections between groups of genes that were not readilyapparent. Thus, term labeling offers a distinct advantage fordiscovering new relationships between genes and can aid ininterpretation of high throughput data.

Regardless of the techniques employed, one of the issuesthat will always be prevalent regarding biological data isthat of quality versus quantity. Inherently related to this

0

0.2

0.4

0.6

0.8

1

Ave

rage

reca

ll

2 4 6 8 10

Node level

BaselineBest run

Figure 6: Recall as a function of node level for the NNDSVDinitialization on the 50TG collection. The achieved MAR for thebaseline case is 58.95%, while the best achieved MAR for theNNDSVD initialization is 74.56%.

A B C D E

F G

H

I

Figure 7: A hierarchical tree containing a set of genes related toAlzheimer’s disease (leaf nodes A and B), brain development (leafnodes C and D), or both Alzheimer’s disease and brain development(leaf node E).

problem is the establishment of standards within the fieldespecially as they pertain to hierarchical data. Efforts suchas gene ontology (GO) are being built and refined [25], butstandard datasets for comparing results and clearly defined(and accepted) evaluation measures could facilitate moremeaningful comparisons between methods.

In the case of SGO, developing methods to derive“known” data is a major issue (even GO does not producea “gold standard” hierarchy given a set of genes). Accessto more data and to other hierarchies would help test therobustness of the method, but that remains one of the prob-lems inherent in the field. In general, approximations thatare more mathematically optimal do not always produce the“best” labeling. Often, factorizations provided by the NMFcan be deemed “good enough,” and the final evaluation willremain subjective. In the end, if automated approaches can


approximate that subjectivity, then greater understanding ofmore data will result.

Acknowledgments

This work was supported by the Center for InformationTechnology Research and the Science Alliance Computa-tional Sciences Initiative at the University of Tennesseeand by the National Institutes of Health under Grantno. HD52472-01. The authors would like to thank theanonymous referees for their comments and suggestions forimproving the manuscript.

References

[1] K. E. Heinrich, “Finding functional gene relationships usingthe semantic gene organizer (SGO),” M.S. thesis, Departmentof Computer Science, University of Tennessee, Knoxville,Tenn, USA, 2004.

[2] R. Homayouni, K. Heinrich, L. Wei, and M. W. Berry, “Geneclustering by latent semantic indexing of MEDLINE abstracts,”Bioinformatics, vol. 21, no. 1, pp. 104–115, 2005.

[3] G. Golub and C. Van Loan, Matrix Computations, JohnsHopkins University Press, Baltimore, Md, USA, 3rd edition,1996.



[6] L. Weixiang, Z. Nanning, and Y. Qubo, “Nonnegative matrixfactorization and its applications in pattern recognition,”Chinese Science Bulletin, vol. 51, no. 1, pp. 7–18, 2006.

[7] K. E. Heinrich, “Automated gene classification using nonnega-tive matrix factorization on biomedical literature,” Ph.D. the-sis, Department of Computer Science, University of Tennessee,Knoxville, Tenn, USA, 2007.

[8] M. W. Berry and M. Browne, Understanding Search Engines:Mathematical Modeling and Text Retrieval, SIAM, Philadel-phia, Pa, USA, 1999.

[9] S. Wild, J. Curry, and A. Dougherty, “Motivating nonnegativematrix factorizations,” in Proceedings of the 8th SIAM Con-ference on Applied Linear Algebra (LA ’03), Williamsburg, Va,USA, June 2003.

[10] D. D. Lee and H. S. Seung, “Algorithms for nonnegativematrix factorization,” in Advances in Neural and InformationProcessing Systems, T. K. Leen, T. G. Dietterich, and V. Tresp,Eds., vol. 13, pp. 556–562, MIT Press, Cambridge, Mass, USA,2001.

[11] M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, and R.J. Plemmons, “Algorithms and applications for approximatenonnegative matrix factorization,” Computational Statisticsand Data Analysis, vol. 52, no. 1, pp. 155–173, 2007.

[12] C. Eckart and G. Young, “The approximation of one matrix byanother of lower rank,” Psychometrika, vol. 1, no. 3, pp. 211–218, 1936.

[13] C. Boutsidis and E. Gallopoulos, “On SVD-based initializationfor nonnegative matrix factorization,” Tech. Rep. HPCLAB-SCG-6/08-05, University of Patras, Patras, Greece, 2005.

[14] R. Desper and O. Gascuel, “Fast and accurate phylogenyreconstruction algorithms based on the minimum-evolutionprinciple,” Journal of Computational Biology, vol. 9, no. 5, pp.687–705, 2002.

[15] S. Kiritchenko, “Hierarchical text categorization and its appli-cations to bioinformatics,” Ph.D. thesis, University of Ottawa,Ottawa, Canada, 2005.

[16] M. Chagoyen, P. Carmona-Saez, H. Shatkay, J. M. Carazo, andA. Pascual-Montano, “Discovering semantic features in theliterature: a foundation for building functional associations,”BMC Bioinformatics, vol. 7, article 41, pp. 1–19, 2006.

[17] C. Boutsidis and E. Gallopoulos, “SVD based initialization:a head start for nonnegative matrix factorization,” Tech.Rep. HPCLAB-SCG-02/01-07, University of Patras, Patras,Greece, 2007.

[18] A. Langville, C. Meyer, and R. Albright, “Initializations for thenonnegative matrix factorization,” preprint, 2006.

[19] F. Shahnaz, M. W. Berry, V. P. Pauca, and R. J. Plemmons,“Document clustering using nonnegative matrix factoriza-tion,” Information Processing & Management, vol. 42, no. 2, pp.373–386, 2006.

[20] A. Pascual-Montano, J. M. Carazo, K. Kochi, D. Lehmann,and R.D. Pascual-Marqui, “Nonsmooth nonnegative matrixfactorization (nsNMF),” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 28, no. 3, pp. 403–415, 2006.

[21] P. Carmona-Saez, R. D. Pascual-Marqui, F. Tirado, J. M.Carazo, and A. Pascual-Montano, “Biclustering of gene ex-pression data by non-smooth nonnegative matrix factoriza-tion,” BMC Bioinformatics, vol. 7, article 78, pp. 1–18, 2006.

[22] C. Ding, T. Li, W. Peng, and H. Park, “Orthogonal nonnegativematrix tri-factorizations for clustering,” in Proceedings ofthe 12th ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, pp. 126–135, ACM Press,Philadelphia, Pa, USA, August 2006.

[23] J.-P. Brunet, P. Tamayo, T. R. Golub, and J. P. Mesirov,“Metagenes and molecular pattern discovery using matrixfactorization,” Proceedings of the National Academy of Sciencesof the United States of America, vol. 101, no. 12, pp. 4164–4169,2004.

[24] A. Cichocki, R. Zdunek, S. Choi, R. Plemmons, and S.-I. Amari, “Novel multi-layer nonnegative tensor factoriza-tion with sparsity constraints,” in Proceedings of the 8thInternational Conference on Adaptive and Natural ComputingAlgorithms (ICANNGA’07), vol. 4432 of Lecture Notes inComputer Science, pp. 271–280, Warsaw, Poland, April 2007.

[25] M. Ashburner, C. A. Ball, J. A. Blake, et al., “Gene ontology:tool for the unification of biology,” Nature Genetics, vol. 25,no. 1, pp. 25–29, 2000.


Research Article

Single-Trial Decoding of Bistable PerceptionBased on Sparse Nonnegative Tensor Decomposition

Zhisong Wang,1 Alexander Maier,2 Nikos K. Logothetis,3 and Hualou Liang1

1 School of Health Information Sciences, University of Texas Health Science Center at Houston, 7000 Fannin,Suite 600, Houston, TX 77030, USA

2 Unit on Cognitive Neurophysiology and Imaging, National Institute of Health, Building 49, Room B2J-45, MSC-4400,49 Convent Dr., Bethesda, MD 20892, USA

3 Max Planck Institut fur biologische Kybernetik, Spemannstrasse 38, 72076 Tubingen, Germany

Correspondence should be addressed to Hualou Liang, [email protected]

Received 13 November 2007; Accepted 13 March 2008

Recommended by Paris Smaragdis

The study of the neuronal correlates of the spontaneous alternation in perception elicited by bistable visual stimuli is promisingfor understanding the mechanism of neural information processing and the neural basis of visual perception and perceptualdecision-making. In this paper, we develop a sparse nonnegative tensor factorization-(NTF)-based method to extract featuresfrom the local field potential (LFP), collected from the middle temporal (MT) visual cortex in a macaque monkey, for decodingits bistable structure-from-motion (SFM) perception. We apply the feature extraction approach to the multichannel time-frequency representation of the intracortical LFP data. The advantages of the sparse NTF-based feature extraction approach liesin its capability to yield components common across the space, time, and frequency domains yet discriminative across differentconditions without prior knowledge of the discriminating frequency bands and temporal windows for a specific subject. We employthe support vector machines (SVMs) classifier based on the features of the NTF components for single-trial decoding the reportedperception. Our results suggest that although other bands also have certain discriminability, the gamma band feature carries themost discriminative information for bistable perception, and that imposing the sparseness constraints on the nonnegative tensorfactorization improves extraction of this feature.

Copyright © 2008 Zhisong Wang et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

The question of cortex is of central importance to manyissues in cognitive neuroscience. To answer this question, oneimportant experimental paradigm is to dissociate perceptsfrom the visual inputs using bistable stimuli. The study ofbistable perception holds great promise for understandingthe neural correlates of visual perception [1]. Spiking activityhas been extensively studied in brain research to determinethe relationship between perceptual reports during ambigu-ous visual stimulation in the middle temporal area (MT) ofmacaque monkeys [2, 3]. However, spiking data as collectedwith standard neurophysiological techniques only provideinformation about the outputs of a small number of neuronswithin a given brain area. The local field potential (LFP)has recently attracted increasing attention in the analysis ofthe neuronal population activity [4, 5]. LFP is thought to

largely arise from the dendritic activity of local populationsof neurons and is dominated by the excitatory synapticinputs to a cortical area as well as intra-areal local processing.The investigation of the correlations between perceptualreports and LFP oscillations during physically identical butperceptually ambiguous conditions may shed new lights onthe mechanism of neural information processing and theneural basis of visual perception and perceptual decision-making.

One important research direction in the field of neu-roscience is to study the rhythmic brain activity duringdifferent tasks. For example, it is discovered that the beta andmu bands are associated with event-related desynchroniza-tion and the gamma band is associated with event-relatedsynchronization for movement and motor imaginary tasks[6, 7], and that the gamma band is also associated with mem-ory and attention [4, 8]. The brain oscillations for bistable


perceptual discrimination, on the other hand, are not easy todistinguish and it remains largely unknown which band is themost discriminative for bistable perception. In line with therecent literature, in this paper, we discover that the gammaoscillation is particularly discriminative for distinguishingdifferent percepts. For neurobiological time series, the under-lying processes are often nonstationary. To reveal the tem-poral structure of LFP, the LFP spectrum at a certain timeand frequency is often analyzed. For example, the short-timeFourier transform (STFT) provides a means of joint time-frequency analysis by applying moving windows to the signaland Fourier transforming the signal within each window[9]. With technological advances, multichannel intracorticalrecordings become available nowadays and they provide newopportunities to study how populations of neurons interactto produce a certain perceptual outcome. However, differentchannels of LFP may record not only brain activity correlatedwith the percept but also background ongoing activity thatis not percept-correlated. It is of interest to decomposethe multichannel time-varying LFP spectrum into multiplecomponents with distinct modalities in the space, time, andfrequency domains to identify among them the componentscommon across different domains and at the same timediscriminative across different conditions.

The conventional two-way decomposition approachesinclude principal component analysis (PCA), independentcomponent analysis (ICA), and linear discriminant analysis(LDA), which extract features from two-way data (matrices)by decomposing them into different factors (modalities)based on orthogonality, independence, and discriminability,respectively. However, PCA, ICA, or LDA all represent data ina holistic way with their factors both additively and subtrac-tively combined. For two-way decomposition of nonnegativedata matrices, it is intuitive to allow only nonnegative factorsto achieve an easily interpretable parts-based representationof data. Such an approach is called nonnegative matrixfactorization (NMF) [10, 11]. In practical applications,multiway data (tensors) with three or more modalities oftenexist. If two-way decomposition approaches are to be usedunder these circumstances, tensors have to be first convertedinto matrices by unfolding several modalities. However,such unfolding may lose some information specific to theunfolded modalities and make it less easy to interpretthe decomposed components. Therefore, to obtain a morenatural representation of the original data structure, it isrecommended to use tensor decomposition approaches tofactorize multiway data. PARAFAC and TUCKER models aretypical models for tensor factorization [12–14]. Their differ-ence lies in that the TUCKER model permits the interactionswithin each modality while the PARAFAC model does not.The PARAFAC model is often used due to two advantagesit possesses. First, it is the simplest and most parsimoniousmultiway model and hence its parameter estimation is easierthan all the other multiway models. Second, it can achieveunique tensor decomposition up to trivial permutation, signchanges, and scaling as long as several weak conditionsare satisfied [15, 16]. In neuroscientific applications, thePARAFAC model was used to analyze the three-way space-time-frequency representation of the EEG data [17, 18].

However, the original PARAFAC model does not assumenonnegative constraints on its factors. As a result, in somecases the estimated PARAFAC model for the nonnegativetensor data may be difficult to interpret. The nonnegativetensor factorization (NTF), as its name implies, enforcesthe nonnegative constraint on each modality and is moreappropriate for decomposing nonnegative tensor data. Infact, NTF has been widely used in diverse fields rangingfrom chemometrics, image analysis, signal processing, toneuroscience [19–25]. For example, in [23], the PARAFACmodel with nonnegative constraints was used to decomposethe multiway intertrial phase coherence (ITPC) defined in[26], which is the average of the normalized space-time-frequency representation of data across trials. For single-trial decoding, however, features have to be extracted fromeach single trial and hence ITPC cannot be used. It isworthy to mention that there is a possible expense associatedwith the imposition of the nonnegative constraints onthe the PARAFAC model, namely the loss of uniquenessin the decomposition [27]. Nevertheless, sparseness con-straints can be enforced to improve the uniqueness of thenonnegatively constrained PARAFAC decomposition andremarkably, sparseness constraints can enhance the parts-based representation of the data [28, 29].

In this paper, we develop a sparse NTF-based methodto extract features from the LFP responses for decoding thebistable structure-from-motion (SFM) perception. We applythe feature extraction approach to the multichannel time-frequency representation of intracortical LFP data collectedfrom the MT visual area in a macaque monkey performinga SFM task, aiming to identify components common acrossthe space, time, and frequency domains and at the sametime discriminative across different conditions. To determinethe best LFP band for bistable perceptual discrimination,we first cluster each NTF component using the K-meansclustering algorithm based on its frequency modality thatmeasures the spectral characteristics of the component, andthen employ a support vector machines (SVMs) classifierto decode the monkey’s perception on a single-trial basisto determine the discriminability of each cluster. In doingso, we have discovered that although other bands also havecertain discriminability, the gamma band feature carries themost discriminative information for bistable perception, andthat imposing the sparseness constraints on the nonnegativetensor factorization improves extraction of this feature. Therest of the paper is organized as follows. In Section 2, we firstpresent the experimental paradigm and then introduce thesparse NTF approach, the K-means clustering algorithm, andthe SVM classifier. In Section 3, we explore the applicationof the NTF-based approach for decoding the bistable SFMperception. Finally, Section 4 contains the conclusions.

2. Materials and Methods

2.1. Subjects and NeurophysiologicalRecordings

Electrophysiological recordings were performed in a healthyadult male rhesus monkey. After behavioral training was


complete, occipital recording chambers were implantedand a craniotomy was made. Intracortical recordings wereconducted with a multielectrode array while the monkeywas viewing structure-from-motion (SFM) stimuli, whichconsisted of an orthographic projection of a transparentsphere that was covered with randomly distributed dots onits entire surface. Stimuli rotated for the entire period ofpresentation, giving the appearance of three-dimensionalstructure. The monkey was well trained and required toindicate the choice of rotation direction (clockwise orcounterclockwise) by pushing one of two levers. Correctresponses for disparity-defined stimuli were acknowledgedwith application of a fluid reward. In the case of fullyambiguous (bistable) stimuli, where the stimuli can beperceived in one of two possible ways and no correct responsecan be externally defined, the monkey was rewarded bychance. Only the trials of data corresponding to bistablestimuli are analyzed in the paper. The recording site wasthe middle temporal area (MT) of the monkey’s visualcortex, which is commonly associated with visual motionprocessing. LFP was obtained by filtering the collected databetween 1 to 100 Hz.

2.2. Sparse Nonnegative Tensor Factorization

In [11], two algorithms with multiplicative factor updateswere proposed to solve the NMF problem. One algorithmis based on minimization of the squared error, while theother is based on minimization of the generalized Kullback-Leibler (KL) divergence. These algorithms were extendedto the NTF problem using the PARAFAC model in [21].Sparseness constraints originally proposed for NMF [28, 29]can also be incorporated in NTF to enhance the uniquenessof the nonnegatively constrained PARAFAC decompositionand improve the parts-based representation of the data. Inthe paper, we focus on a sparse NTF algorithm based onthe nonnegatively and sparsely constrained PARAFAC modeland minimization of the generalized KL divergence. Thesparseness constrains imposed are similar to those of [28].

Let X ∈ RI1×I2×···×IN denote an N-way tensor with Nindices (i1i2 · · · iN ). Let Xi1i2···iN represent an element with1 ≤ in ≤ In. Assume that the PARAFAC model decomposesthe tensor X into K components, each of which is the outerproduct of vectors that span different modalities,

Xi1i2···iN ≈K∑

k=1

A(1)i1k

A(2)i2k· · ·A(N)

iN k, (1)

where A(n) ∈ RIn×K is the matrix corresponding to the nthmodality.

A tensor can be converted into a matrix. Let the matrixX(n) ∈ RIn×I1···In−1In+1···IN denote the mode-n matricizationof X. Then it follows

X(n) ≈ A(n)Z(n) (2)

with

Z(n) =(

A(N)|⊗| · · · |⊗|A(n+1)|⊗|A(n−1)|⊗| · · · |⊗|A(1))T ,(3)

where | ⊗ | denotes the Khatri-Rao product (column-wiseKronecker product) and (·)T means transpose.

The cost function for the sparse NTF approach basedon minimization of the generalized KL divergence can bewritten as

∑

i j

((X(n)

)i j log

(X(n)

)i j(

A(n)Z(n))i j

− (X(n))i j

+(

A(n)Z(n))i j

)+ λ∑

i j

(A(n)

)i j ,

(4)

where λ is the regularization parameter for the sparseconstraints. Note that if λ = 0, this corresponds to thenonsparse NTF approach. The factor update for the sparseNTF approach is the same as that in [11] except an extraregularization term;

A(n) = A(n) � (X(n) �(

A(n)Z(n)))

� ZT(n) �

(A(n)Z(n)ZT

(n) + λE),

(5)

where E is a matrix of ones, � and � denote element-wise multiplication and division, respectively. We can firstrandomly initialize A(n),n = 1, 2, . . . ,N and then alternatelyupdate them in an iterative way until convergence. In [11],it was proved that such iterative multiplicative update can beregarded as a special kind of gradient descent update usingthe optimal step size at each iteration, which is guaranteed toreach a locally optimal factorization.

2.3. K -means Clustering

The K-means clustering algorithm partitions a data set intoK clusters with each cluster represented by its mean such thatthe data within each cluster are similar but the data acrossdistinct clusters are different [30]. Initially, the K-meansclustering algorithm generates K random points as clustermeans. Then it iterates two steps namely the assignment stepand update step until convergence. In the assignment step,each data point is assigned to the cluster so that the distancefrom the data point to the mean of the cluster is smaller thanthat from the data point to the means of other clusters. Inthe update step, the means of all clusters are recomputedand updated based on the data points assigned to them. Theconvergence criterion can be that the cluster assignment doesnot change. The K-means clustering algorithm is simple andfast but the clustering results depend on the initial randomassignments. To overcome this problem, we can take the bestclustering from multiple random starts.

We use the silhouette value to determine the number ofclusters [31]. The silhouette value measures how similar adata point is to points in its own cluster compared to pointsin other clusters and is defined as follows:

s(i) =(

minlb(i, l)− a(i)

)/max

(a(i), min

lb(i, l)

), (6)

where a(i) is the average distance from the ith data pointto the other points in its cluster, and b(i, l) is the average


distance from the ith point to points in another cluster l.The silhouette value ranges from −1 to +1 with 1 meaningthat data are separable and correctly clustered, 0 denotingpoor clustering, and −1 meaning that the data are wronglyclustered.

2.4. Support Vector Machines Classifier

Support vector machines (SVMs) is a popular classifierthat minimizes the empirical classification error and atthe same time maximizes the margin by determining alinear separating hyperplane to distinguish different classesof data [32, 33]. SVM is robust to outliers and has goodgeneralization ability. Consequently, it has been used in awide range of applications.

Assume that xk, k = 1, . . . ,K are the K training featurevectors for decoding and the class labels are yk ∈ {−1, +1},then SVM solves the following optimization problem:

min‖w‖2 + CK∑

k=1

ξk subject to

yk(

w′xk + b) ≥ 1− ξk,

ξk ≥ 0,

(7)

where w is the weight vector, C > 0 is the penalty parameterof the error term chosen by cross-validation, ξk is the slackvariable, and b is the bias term. It turns out that the margin ofthe two classes is inversely proportionally to ‖w‖2. Therefore,the first term in the objective function of SVM is used tomaximize the margin. The second term in the objectivefunction is the regularization term that allows for trainingerrors for the inseparable case.

The Lagrange multiplier method can be used to findthe optimal solution for w and b in the above optimizationproblem. Assume that t is the testing feature vector. Thentesting is done simply by determining on which side of theseparating hyperplane t lies, that is, if w′t + b ≥ 0, the labelof t is classified as +1, otherwise, the label is classified as −1.SVM can also be used as a kernel-based method when thefeature vectors are mapped into a higher dimensional space[32].

3. Experimental Results

In this section, we provide experimental examples to demon-strate the performance of the proposed feature extractionapproach for predicting perceptual decisions from the neu-ronal data. Simultaneously, collected 4-channel LFP datawere used for demonstration. Gabor transform (STFT witha Gaussian window) is used to obtain the time-frequencyrepresentation of the data. The number of trials is 96. Thetime window used is from stimulus onset to 1 second afterthat. We find that the performance does not change muchif a different time window, for example, from stimulusonset to 800 milliseconds after that, is used. We use bothnonsparse and sparse NTF approaches based on minimizingthe generalized KL divergence and choose the number of

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Silh

ouet

teva

lue

2 3 4 5 6

Number of clusters

Figure 1: The silhouette value obtained by clustering the nonsparseNTF components using the K-means algorithm as a function of thenumber of clusters.

NTF components to be 20 with random initialization forall modalities. The regularization parameter λ for the sparseNTF approach is chosen to be 0.5 and the sparseness con-straint is applied to each modality. We apply the nonsparseand sparse NTF approaches to the nonnegative four-waydata (channel by frequency by time by trial) and use themodality corresponding to the trials as the features. We useK-means clustering to cluster the features with 50 randomstarts to find the best clustering and adopt the correlationbetween the spectral modalities of the NTF components asthe distance metric. The NTF and clustering are performedon all the data since they are unsupervised and does notrequire any label information. On the other hand, if a featureextraction method requires label information, it should bedone on the training data only. We employ the linear SVMclassifier from the LIBSVM package [34] and use decodingaccuracy as the performance measure, calculated via leave-one-out cross-validation (LOOCV). In particular, for a dataset with N trials, we choose N − 1 trials for training anduse the remaining 1 trial for testing. This is repeated for Ntimes with each trial serving for testing once. The decodingaccuracy is obtained as the ratio of the number of correctlydecoded trials to N . It is also possible to split the data intothree disjoint sets: one for parameter estimation, one formodel selection, and one for testing the end result. We haveconsidered this option in the past but we decided to usethe LOOCV procedure due to the limited number of trialsavailable.

Figure 1 shows the silhouette value obtained by clusteringthe nonsparse NTF components using the K-means algo-rithm as a function of the number of clusters. Note that thesilhouette value increases with the number of clusters untilthe number of clusters is equal to four. Hence we choose thenumber of clusters to be four. Figure 2 shows the frequencymodalities of the 20 nonsparse NTF components clusteredby the K-means algorithm. The color of each curve denotes


0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0

5

10×104

0

1

2×105

0

5×104

0

2

4×105

0

1

2×105

0

1

2×105

0

1

2×105

0

2

4×105

0

2

4×104

0

5

10×104

0

5×104

0

1

2×105

0

5

10×104

0

5

10×104

0

1

2×105

0

2

4×105

0

2

4×104

0

5

10×104

0

5

10×104

0

1

2×105

Figure 2: Comparison of the frequency modalities of the 20 nonsparse NTF components clustered by the K-means algorithm. The color ofeach curve denotes to which cluster the component belongs. Blue, green, red, and black correspond to clusters 1–4, respectively.

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Silh

ouet

teva

lue

2 3 4 5 6

Number of clusters

Figure 3: The silhouette value obtained by clustering the sparseNTF components using the K-means algorithm as a function of thenumber of clusters.

to which cluster the component belongs. Blue, green, red,and black correspond to clusters 1–4, respectively. Figures 3and 4 are the same as Figures 1 and 2, respectively, exceptthat sparse NTF components are used. For comparison, weuse the same range for y axis in Figure 3 as in Figure 1. Notethat the silhouette values of Figure 3 follow a similar trend tothat in Figure 1. Hence the number of clusters for the sparseNTF components is also chosen to be four. In addition, itis clear that for a given number of clusters, the silhouettevalue of Figure 3 is always larger than that of Figure 1. Thisindicates that the clustering of the sparse NTF components isbetter than the clustering of the nonsparse NTF components,though the main purpose of these two figures is to show thatwith NTF method, either sparse or nonsparse, the numberof clusters converges to 4. It can be seen from Figures 2and 4 that both the sparse and nonsparse NTF componentsare well clustered by the K-means algorithm and thatdifferent clusters may have different number of components.Furthermore, in both cases, the clusters generally fall intodistinct spectral bands: the first cluster mainly in the high


0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0 50 100

0

200

400

0

100

200

0

100

200

0

500

1000

0

200

400

0

500

1000

0

200

400

0

500

1000

0

200

400

0

100

200

0

500

1000

0

500

0

50

100

0

200

400

0

200

400

0

500

1000

0

100

200

0

200

400

0

500

1000

0

50

100

Figure 4: Comparison of the frequency modalities of the 20 sparse NTF components clustered by the K-means algorithm. The color of eachcurve denotes to which cluster the component belongs. Blue, green, red, and black correspond to clusters 1–4, respectively.

0

10

20

30

40

50

60

70

80

90

100

Freq

uen

cy(H

z)

0 200 400 600 800 1000

Time (ms)

(a)

0

10

20

30

40

50

60

70

80

90

100

Freq

uen

cy(H

z)

0 200 400 600 800 1000

Time (ms)

(b)

Figure 5: The time-frequency plot for (a) the first nonsparse NTF component and (b) the second nonsparse NTF component of cluster 1.Red and blue represent strong and weak activity, respectively. Note that the first component has localized time-frequency representation inthe high gamma band, while the second component contains strong activity in both high gamma band and other bands. In addition, thesetwo components occupy different time windows.


0

10

20

30

40

50

60

70

80

90

100

Freq

uen

cy(H

z)

0 200 400 600 800 1000

Time (ms)

(a)

0

10

20

30

40

50

60

70

80

90

100

Freq

uen

cy(H

z)

0 200 400 600 800 1000

Time (ms)

(b)

0

10

20

30

40

50

60

70

80

90

100

Freq

uen

cy(H

z)

0 200 400 600 800 1000

Time (ms)

(c)

0

10

20

30

40

50

60

70

80

90

100

Freq

uen

cy(H

z)

0 200 400 600 800 1000

Time (ms)

(d)

Figure 6: The representative time-frequency plot for (a) cluster 1, (b) cluster 2, (c) cluster 3, and (d) cluster 4, respectively, of the sparseNTF components. Red and blue represent strong and weak activity, respectively. Note that the first cluster for the sparse NTF componentscontains only one component in the high gamma band (50–60 Hz) with well-localized time-frequency representation, and that clusters 2–4 have concentrated time-frequency distributions in the delta band (1–4 Hz), alpha band (10–20 Hz), and low gamma band (30–40 Hz),respectively.

gamma band (50–60 Hz), the second cluster in the delta band(1–4 Hz), the third cluster in the alpha band (10–20 Hz), andthe fourth cluster mainly in the low gamma band (30–40 Hz).

To have a closer look at the NTF components, weconstruct the time-frequency representation for each com-ponent based on the outer product of its frequency modalityand time modality. Figures 5(a) and 5(b) show the time-frequency plot for the two nonsparse NTF components ofcluster 1. Red and blue in the figures represent strong andweak activity, respectively. Note that the first nonsparse NTFcomponent has localized time-frequency representation inthe high gamma band, while the second component containsstrong activity in both the high gamma band and otherbands. In addition, these two components cover differenttime windows with the first component in both an early

window and a late window and the second componentmainly in an early window. Figures 6(a) and 6(b) showthe representative time-frequency plot for (a) cluster 1, (b)cluster 2, (c) cluster 3, and (d) cluster 4, respectively, ofthe sparse NTF components. Red and blue represent strongand weak activity, respectively. Note the similarity betweenFigures 6(a) and 5(a). However, unlike the first cluster forthe nonsparse NTF components, the first cluster for thesparse NTF components has only one component with well-localized time-frequency representation in the high gammaband (50–60 Hz). From Figures 6(b) to 6(d), we can observeconcentrated time-frequency distributions for the secondcluster in the delta band (1–4 Hz), the third cluster in thealpha band (10–20 Hz), and the fourth cluster mainly in thelow gamma band (30–40 Hz).


Table 1: Comparison of the decoding accuracy based on the combination of all features from each of cluster 1–4 (denoted as c1 (combined)–c4 (combined), resp.), and the single best feature from each of cluster 1–4 (denoted as c1 (best)–c4 (best), resp.). Clusters 1–4 correspond tohigh gamma band (50–60 Hz), delta band (1–4 Hz), alpha band (10–20 Hz), and low gamma band (30–40 Hz), respectively. The nonsparseNTF approach based on minimization of the generalized KL divergence is used.

Feature c1 (combined) c2 (combined) c3 (combined) c4 (combined)

Decoding accuracy 0.70 0.61 0.63 0.63

Feature c1 (best) c2 (best) c3 (best) c4 (best)


Table 2: Comparison of the decoding accuracy based on the combination of all features from each of cluster 1–4 (denoted as c1 (combined)–c4 (combined), resp.), and the single best feature from each of cluster 1–4 (denoted as c1 (best)–c4 (best), resp.). Clusters 1–4 correspond tohigh gamma band (50–60 Hz), delta band (1–4 Hz), alpha band (10–20 Hz), and low gamma band (30–40 Hz), respectively. The sparse NTFapproach based on minimization of the generalized KL divergence is used.

Feature c1 (combined) c2 (combined) c3 (combined) c4 (combined)


Feature c1 (best) c2 (best) c3 (best) c4 (best)


We next compare the SVM decoding accuracy basedon different features of the nonsparse and sparse NTFcomponents in Tables 1 and 2, respectively. In particular, wecompare the decoding accuracy based on the combinationof all features from each of clusters 1–4 (denoted as c1(combined)–c4 (combined), resp.) and the single best featurefrom each of clusters 1–4 (denoted as c1 (best)–c4 (best),resp.). It is clear that cluster 1 significantly outperformsclusters 2–4 in terms of decoding accuracy. Therefore, thehigh gamma band feature is more discriminative than thefeatures in the other bands for bistable perception. Notethat the combination of all features within one clustersometimes results in lower decoding accuracy than the singlebest feature from that cluster. This is probably due to theredundancy of features within the same cluster. ComparingTables 1 and 2, we can see that the high gamma bandfeature of the sparse NTF approach is better than thatof the nonsparse NTF approach. The former has the bestdecoding accuracy of 0.76 (corresponding to the sparseNTF component in Figure 6(a)), while the latter has thebest decoding accuracy of 0.72 (corresponding to the firstnonsparse NTF component in Figure 5(a)). The decodingaccuracy for the second nonsparse NTF component of cluster1 (corresponding to Figure 5(b)) is only 0.61. The decodingperformances reveal that although Figures 6(a) and 5(a)appear quite similar, the high gamma band features extractedby the sparse and nonsparse NTF approaches are different.This is due to the fact that the sparseness constraints enhancethe parts-based representation of the data and contributeto a better extraction of the high gamma band feature,leading to the improvement of decoding accuracy. We haveperformed the statistical tests to compare the performancesof the sparse and nonsparse NTF methods. Although in mostcases, there is no significant difference between them, thesparse NTF significantly outperforms the nonsparse NTF in

the case of the combination of features for the high gammafrequency band. Furthermore, the results of both the sparseNTF and nonsparse NTF show significant difference betweenthe high gamma frequency band and the other bands. Asa benchmark, we have also calculated the SVM decodingaccuracy based on the power of bandpass filtered LFP inthe frequency bands of commonly used ranges; delta band(1–4 Hz), theta band (5–8 Hz), alpha band (9–14 Hz), betaband (15–30 Hz), and gamma band (30–80 Hz), and foundthat the maximum decoding accuracy of all is 0.61. Takentogether, our results suggest that NTF is useful for LFPfeature extraction and that although other bands also havecertain discriminability, the gamma band feature carries themost discriminative information for bistable perception, andthat imposing the sparseness constraints on the nonnegativetensor factorization improves extraction of this feature.

4. Conclusions

In this paper, we have developed a sparse nonnegative tensorfactorization-(NTF)-based method to extract features fromthe local field potential (LFP) in the middle temporal area(MT) of a macaque monkey performing a bistable structure-from-motion (SFM) task. We have applied the featureextraction approach to the multichannel time-frequencyrepresentation of the LFP data to identify componentscommon across the space, time, and frequency domains andat the same time discriminative across different conditions.To determine the most discriminative band of LFP forbistable perception, we have clustered the NTF componentsusing the K-means clustering algorithm and employed asupport vector machines (SVMs) classifier to determinethe discriminability of each cluster based on single-trialdecoding of the monkey’s perception. Using these tech-niques, we have demonstrated that although other bands


also have certain discriminability, the gamma band featurecarries the most discriminative information for bistableperception, and that imposing the sparseness constraints onthe nonnegative tensor factorization improves extraction ofthis feature.

Acknowledgment

The work was supported by the NIH Grant and the MaxPlanck Society.

References

[1] R. Blake and N. K. Logothetis, “Visual competition,” NatureReviews Neuroscience, vol. 3, no. 1, pp. 13–23, 2002.

[2] K. H. Britten, W. T. Newsome, M. N. Shadlen, S. Celebrini,and J. A. Movshon, “A relationship between behavioral choiceand the visual responses of neurons in macaque MT,” VisualNeuroscience, vol. 13, no. 1, pp. 87–100, 1996.

[3] J. V. Dodd, K. Krug, B. G. Cumming, and A. J. Parker, “Per-ceptually bistable three-dimensional figures evoke high choiceprobabilities in cortical area MT,” Journal of Neuroscience,vol. 21, no. 13, pp. 4809–4821, 2001.

[4] B. Pesaran, J. S. Pezaris, M. Sahani, P. P. Mitra, and R. A.Andersen, “Temporal structure in neuronal activity duringworking memory in macaque parietal cortex,” Nature Neuro-science, vol. 5, no. 8, pp. 805–811, 2002.

[5] G. Kreiman, C. Hung, A. Kraskov, R. Q. Quiroga, T. Poggio,and J. DiCarlo, “Object selectivity of local field potentialsand spikes in the macaque inferior temporal cortex,” Neuron,vol. 49, no. 3, pp. 433–445, 2006.

[6] G. Pfurtscheller, B. Graimann, J. E. Huggins, S. P. Levine,and L. A. Schuh, “Spatiotemporal patterns of beta desyn-chronization and gamma synchronization in corticographicdata during self-paced movement,” Clinical Neurophysiology,vol. 114, no. 7, pp. 1226–1236, 2003.

[7] G. Pfurtscheller, C. Brunner, A. Schlogl, and F. H. Lopes daSilva, “Mu rhythm (de)synchronization and EEG single-trialclassification of different motor imagery tasks,” NeuroImage,vol. 31, no. 1, pp. 153–159, 2006.

[8] H. Liang, S. L. Bressler, E. A. Buffalo, R. Desimone, and P.Fries, “Empirical mode decomposition of field potentials frommacaque V4 in visual spatial attention,” Biological Cybernetics,vol. 92, no. 6, pp. 380–392, 2005.

[9] L. Cohen, Time Frequency Analysis, Prentice Hall, EnglewoodCliffs, NJ, USA, 1995.


[11] D. Lee and H. Seung, “Algorithms for nonnegative matrixfactorization,” in Proceedings of the 13th Annual Conference onAdvances in Neural Information Processing Systems (NIPS ’00),vol. 13, pp. 556–562, Denver, Colo, USA, December 2000.

[12] R. A. Harshman, “Foundations of the PARAFAC procedure:models and conditions for an “explanatory” multi-modalfactor analysis,” in UCLA Working Papers in Phonetics, vol. 16,pp. 1–84, University of California, Los Angeles, Calif, USA,1970.

[13] J. D. Carroll and J.-J. Chang, “Analysis of individual differencesin multidimensional scaling via an n-way generalization of

“Eckart-Young” decomposition,” Psychometrika, vol. 35, no. 3,pp. 283–319, 1970.

[14] L. R. Tucker, “Some mathematical notes on three-mode factoranalysis,” Psychometrika, vol. 31, no. 3, pp. 279–311, 1966.

[15] J. B. Kruskal, “Three-way arrays: rank and uniqueness oftrilinear decompositions, with application to arithmetic com-plexity and statistics,” Linear Algebra and Its Applications,vol. 18, no. 2, pp. 95–138, 1977.

[16] N. D. Sidiropoulos and R. Bro, “On the uniqueness of multi-linear decomposition of n-way arrays,” Journal of Chemomet-rics, vol. 14, no. 3, pp. 229–239, 2000.

[17] F. Miwakeichi, E. Martınez-Montes, P. A. Valdes-Sosa, N.Nishiyama, H. Mizuhara, and Y. Yamaguchi, “DecomposingEEG data into space-time-frequency components using paral-lel factor analysis,” NeuroImage, vol. 22, no. 3, pp. 1035–1045,2004.

[18] M. Mørup, L. K. Hansen, C. S. Herrmann, J. Parnas, and S.M. Arnfred, “Parallel factor analysis as an exploratory tool forwavelet transformed event-related EEG,” NeuroImage, vol. 29,no. 3, pp. 938–947, 2006.

[19] R. Bro and S. de Jong, “A fast nonnegativity-constrained leastsquares algorithm,” Journal of Chemometrics, vol. 11, no. 5, pp.393–401, 1997.

[20] A. Smilde, R. Bro, and P. Geladi, Multi-Way Analysis: Applica-tions in the Chemical Sciences, John Wiley & Sons, New York,NY, USA, 2004.

[21] T. Hazan, S. Polak, and A. Shashua, “Sparse image codingusing a 3D nonnegative tensor factorization,” in Proceedingsof the 10th IEEE International Conference on Computer Vision(ICCV ’05), vol. 1, pp. 50–57, Beijing, China, October 2005.

[22] T. G. Kolda, “Multilinear operators for higher-order decom-positions,” Tech. Rep. SAND2006-2081, Sandia National Lab-oratories, Livermore, Calif, USA, 2006.

[23] M. Mørup, L. K. Hansen, J. Parnas, and S. M. Arnfred,“Decomposing the time-frequency representation of EEGusing nonnegative matrix and multi-way factorization,” Tech.Rep. IMM2006-04144, Technical University of Denmark,Informatics and Mathematical Modelling, Copenhagen, Den-mark, 2006.

[24] A. Cichocki, R. Zdunek, S. Choi, R. Plemmons, and S.-I.Amari, “Nonnegative tensor factorization using alpha andbeta divergences,” in Proceedings of IEEE International Confer-ence on Acoustics, Speech and Signal Processing (ICASSP ’07),vol. 3, pp. 1393–1396, Honolulu, Hawaii, USA, April 2007.

[25] A. Cichocki, R. Zdunek, S. Choi, R. Plemmons, and S.-I. Amari, “Novel multi-layer nonnegative tensor factoriza-tion with sparsity constraints,” in Proceedings of the 8thInternational Conference on Adaptive and Natural ComputingAlgorithms (ICANNGA ’07), vol. 4432 of Lecture Notes inComputer Science, pp. 271–280, Warsaw, Poland, April 2007.

[26] C. Tallon-Baudry, O. Bertrand, C. Delpuech, and J. Pernier,“Stimulus specificity of phase-locked and non-phase-locked40 Hz visual responses in human,” Journal of Neuroscience,vol. 16, no. 13, pp. 4240–4249, 1996.

[27] L.-H. Lim and G. Golub, “Nonnegative decomposition andapproximation of nonnegative matrices and tensors,” Tech.Rep. 06-01, Society of Critical Care Medicine (SCCM), MountProspect, Ill, USA, 2006.

[28] J. Eggert and E. Korner, “Sparse coding and NMF,” in Pro-ceedings of IEEE International Conference on Neural Networks(IJCNN ’04), vol. 4, pp. 2529–2533, Budapest, Hungary, July2004.



[30] J. B. MacQueen, “Some methods for classification and analysisof multivariate observations,” in Proceedings of the 5th BerkeleySymposium on Mathematical Statistics and Probability, vol. 1,pp. 281–297, Berkeley, Calif, USA, 1967.

[31] L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: AnIntroduction to Cluster Analysis, John Wiley & Sons, New York,NY, USA, 1990.

[32] V. N. Vapnik, The Nature of Statisical Learning Theory,Springer, New York, NY, USA, 1995.

[33] C. Cortes and V. Vapnik, “Support-vector networks,” MachineLearning, vol. 20, no. 3, pp. 273–297, 1995.

[34] C.-C. Chang and C.-J. Lin, “LIBSVM: a library for sup-port vector machines,” 2001, http://www.csie.ntu.edu.tw/cjlin/libsvm/.


Research Article

Pattern Expression Nonnegative Matrix Factorization:Algorithm and Applications to Blind Source Separation

Junying Zhang,1 Le Wei,1 Xuerong Feng,2 Zhen Ma,1 and Yue Wang3

1 School of Computer Science and Engineering, Xidian University, Xi’an 710071, China2 Department of Mathematics and Computer Science, Valdosta State University, Valdosta, GA 31698, USA3 The Bradley Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University,VA 24061, USA

Correspondence should be addressed to Junying Zhang, [email protected]

Received 1 November 2007; Accepted 18 April 2008


Independent component analysis (ICA) is a widely applicable and effective approach in blind source separation (BSS), withlimitations that sources are statistically independent. However, more common situation is blind source separation for nonnegativelinear model (NNLM) where the observations are nonnegative linear combinations of nonnegative sources, and the sources maybe statistically dependent. We propose a pattern expression nonnegative matrix factorization (PE-NMF) approach from the viewpoint of using basis vectors most effectively to express patterns. Two regularization or penalty terms are introduced to be addedto the original loss function of a standard nonnegative matrix factorization (NMF) for effective expression of patterns with basisvectors in the PE-NMF. Learning algorithm is presented, and the convergence of the algorithm is proved theoretically. Threeillustrative examples on blind source separation including heterogeneity correction for gene microarray data indicate that thesources can be successfully recovered with the proposed PE-NMF when the two parameters can be suitably chosen from priorknowledge of the problem.

Copyright © 2008 Junying Zhang et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Blind source separation (BSS) is a very active topic recentlyin signal processing and neural network fields [1, 2]. It isan approach to recover the sources from their combina-tions (observations) without any understanding of how thesources are mixed. For a linear model, the observations arelinear combinations of sources, that is, X = AS, where S is anr×nmatrix indicating r source signals each in n-dimensionalspace, X is an m × n matrix showing m observations inn-dimensional space, and A is an m × r mixing matrix.Therefore, BSS problem is a matrix factorization, that is, tofactorize observation matrix V into mixing matrix A andsource matrix S.

Independent component analysis (ICA) has been foundvery effective in BSS for the cases where the sources arestatistically independent. In fact, it factorizes the observationmatrix V into mixing matrix A and source matrix S bysearching the most nongaussianity directions in the scatterplot of observations, and has a very good estimation

performance of the recovered sources when the sources arestatistically independent. This is based on the Central LimitTheorem, that is, the distribution of a sum (observations)of independent random variables (sources) tends toward aGaussian distribution under certain conditions. This inducesthe two serious constraints of ICA to the application ofBSS: (1) the sources should be statistically independent toeach other; (2) the sources should not follow Gaussiandistribution. The performance of the recovered sources withICA approach depends on the satisfactory of these twoconstraints, and decreases very rapidly when either of themis not satisfied. However in real world, there are manyapplications of blind source separation where the observa-tions are nonnegative linear combinations of nonnegativesources, and the sources are statistically dependent to someextent. This is the model referred to as nonnegative linearmodel (NNLM), that is, X = AS with elements in bothA and S nonnegative, and the rows in S (the sources)may be statistically dependent to some extent. One of theapplications of this model is gene expression profiles, where


each of the profiles, which is only in nonnegative values,represents a composite of more than one distinct but partiallydependent sources [3], the profiles from normal tissue andfrom cancer tissue. What needs to be developed is analgorithm to recover dependent sources from the compositeobservations.

It is easy to recognize that BSS for NNLM is anonnegative matrix factorization, that is, to factorize Xinto nonnegative A and nonnegative S, where nonnegativematrix factorization (NMF) technique is applicable. Severalapproaches have been developed on applying NMF-basedtechnique for BSS of NNLM. For example, we proposed amethod for decomposition of molecular signatures based onBSS of nonnegative dependent sources with direct usage ofstandard NMF [3]; Chichocki and his colleagues proposeda new algorithm for nonnegative matrix factorization inapplications to blind source separation [4] by adding twosuitable regularizations or penalty terms in the originalobjective function of the NMF to increase sparseness and/orsmoothness of the estimated components. In addition, mul-tilayer NMF was proposed by Cichocki and Zdunek for blindsource separation [5], and nonsmooth nonnegative matrixfactorization was proposed aiming at finding localized, part-based representations of nonnegative multivariate data items[6]. Some other researches include the work of Zdunek andCichocki, who proposed to take advantage of the second-order terms of a cost function to overcome the disadvantagesof gradient (multiplicative) algorithms for NMF for tacklingthe slow convergence problem of the standard NMF learningalgorithms [7]; the work by Ivica Kopriva and his colleagues,who proposed a single-frame blind image deconvolutionapproach with nonnegative sparse matrix factorization forblind image deconvolution [8]; and the work by Liu andZheng who proposed nonnegative matrix factorization-based methods for object recognition [9].

In this paper, we extend NMF to pattern expression NMF(PE-NMF) from the view point that the basis vector is desiredto be the one which can express the data most efficiently. Itssuccessful application to blind source separation of extendedbar problem, nonnegative signal recovery problem, andheterogeneity correction problem for real gene microarraydata indicates that it is of great potential in blind separationof dependent sources for NNLM model. The loss functionfor the PE-NMF proposed here is a special case of thatproposed in [4], and here not only the learning algorithmfor the proposed PE-NMF approach is provided, but alsothe convergence of the learning algorithm is proved byintroducing some auxiliary function. For speeding up thelearning procedure, a technique based on independentcomponent analysis (ICA) is proposed, and has been verifiedto be effective for the learning algorithm to converge todesired solutions.

2. Pattern Expression NMF and BSSfor NNLM Model

NMF problem is given a nonnegative n × m matrix V , findnonnegative n×r and r×mmatrix factorsW andH such that

the difference measure between V and WH is the minimumaccording to some cost function, that is,

V ≈WH. (1)

NMF is a method to obtain a representation of data usingnonnegative constraints. These constraints lead to a part-based representation because they allow only additive, notsubtractive, combinations of the original data. For the ithcolumn of (1), that is, vi = Whi, where vi and hi are theith column of V and H , the ith datum (observation) is anonnegative linear combination of the columns of W =(W1,W2, . . . ,Wr), while the combinatorial coefficients arethe elements of hi. Therefore, the columns of W , that is,{W1,W2, . . . ,Wr}, can be viewed as the basis of the data Vwhen V is optimally estimated by its factors.

2.1. Pattern Basis

Let W1,W2, . . . ,Wr be linearly independent n-dimensionalvectors. We refer to the space spanned by arbitrarily non-negatively linear combination of these r vectors the positivesubspace spanned by W1,W2, . . . ,Wr . Then, W1,W2, . . . ,Wr

is the pattern expression of the data in this subspace, andis called the basis of the subspace. Evidently, the basisW1,W2, . . . ,Wr derived from NMF is the pattern expressionof the observation data in columns of V , but this expressionmay not be unique. Figure 1(a) shows an example of thedata V which have two pattern expressions of {W1,W2}and {W ′

1,W ′2}. Hence, we have the following questions:

which basis is more effective in expressing the pattern ofthe observations in V ? In order for the basis to expressthe pattern in V effectively, in our opinion, following threerequirements should be satisfied:

(1) the angle between the vectors in the basis should beas large as possible, such that each data in V is anonnegatively linear combination of the vectors;

(2) the angles between the vectors in the basis should beas small as possible to make the vectors clamp thedata as tightly as possible, such that no space is leftfor expression of what is not included in V ;

(3) each vector in the basis should be of the most efficientin expression of the data in V, and the same efficientin this expression compared with any other vector inthe basis.

The vectors defined with the above three requirements arewhat we call the pattern basis of the data, and the numberof vectors in the basis, r, is called the pattern dimensionof the data. Figures 1(a), 1(b), and 1(c) show, respectively,the too large between-angle, too small between-angles, andtoo unequally important basis situation with {W ′

1,W ′2} as

basis, where data in Figures 1(a) and 1(b) are assumed tobe uniformly distributed in the gray area while those inFigure 1(c) are assumed to be nonuniformly distributed (thedata in the dark gray area is denser compared with those inthe light gray area). For these three cases, {W1,W2} is a betterbasis to express the data.


W ′1

0

W1

W2

W ′2

V

(a)

W ′1

0

W1

W2

W ′2V

(b)

W1 =W ′1

0

W ′2

W2

V

(c)

Figure 1: The basis {W1,W2}/{W ′1,W ′

2} which obeys/violates (a) the first point; (b) the second point; (c) the third point in the definitionof the pattern basis.

(a) (b) (c) (d)

Figure 2: Bar problem solution obtained from NMF: (a) source images, (b) mixed images, (c) recovered images from ICA, and (d) recoveredimages from NMF.

Notice that the second requirement in the definitionof the pattern basis readily holds from the constraint ofNMF that the elements in H are nonnegative. Then, wecan get the three constraints as follows: (1) due to therequirement that the between-angle between each pair ofvectors in the basis should be as large as possible, we haveWT

i Wj→ min, for i /= j, where Wi is the ith column of thematrix W ; (2) due to the requirement that each vector inthe basis should be equally efficient in expression of the datain V, while the efficiency of the vector in this expression ismeasured by the summation of the projection coordinatesof all the data in V to this vector, that is, samples vj , j =1, 2, . . . ,n if expressed in the vector Wi, the efficiency of thevector Wi for expression of vj , j = 1, 2, . . . ,n is

∑nj=1hji,

we have∑r

i=1

∑nj=1hji→ min. Hence, we formulate PE-NMF

problem as minimizing the loss function E(W ,H ;α,β) in thefollowing equation subject to nonnegativity constraints.

PE-NMF problem

Given an n by m nonnegative observation matrix V, find ann by r and an r by m nonnegative matrix factors W and H,such that

minW ,H

E(W ,H ;α,β) = 12‖V −WH‖2 + α

∑

i, j, j /= iWT

i Wj

+ β∑

i, j

hi j , s.t. W ≥ 0, H ≥ 0,

(2)

where W ≥ 0, H ≥ 0 indicates that both W and Hare nonnegative matrices, respectively, Wi is ith column ofmatrix W, and hi j is theelement in the ith row and jth columnof the matrix H.

This problem is a special case of the constrained opti-mization problem proposed in [4]:

minW ,H

E(W ,H ;α,β) = 12‖V −WH‖2 + αJW (W)

+ βJH(H), s.t. W ≥ 0, H ≥ 0.(3)

2.2. PE-NMF Algorithm and its Convergence

For the derivation of learning algorithm for W and H, wefirst present and prove the following lemma.

Lemma 1. For any r by r symmetric nonnegative matrix Qand for any r-dimensional nonnegative row vector w, the r by rmatrix

F = δab

(QwT

)a

wa−Q (4)

is always semipositive definite, where δab((QwT)a/wa) repre-sents a diagonal matrix with diagonal element in the ath rowand ath column being (QwT)a/wa.

Proof. By noticing that wa and wb are nonnegative, thedefinition of the matrix F is the same as that of the matrix S in


which the ath row and bth column element is Sab = waFwb.Hence, we consider proving the semipositive definition of thematrix S in the following context.

For any r-dimensional vector V, we have the followingformula:

VTSV =∑

ab

VaSabVb =∑

ab

VawaFwbVb

=∑

ab

[

Vawaδab

(QwT

)a

wawbVb −VawaQabwbVb

]

=∑

ab

[

Vawaδab

(QwT

)a

wawbVb

]

−∑

ab

VawaQabwbVb

= A− B,(5)

where A denotes the first term and B denotes the secondterm in the above formula. By noticing that Q is a symmetricmatrix, we have (QwT)a = (wQ)a, and hence the first term Abecomes

A =∑

a

Vawa

(wQ)a

wawaVa =

∑

a

(wQ)awaV

2a

=∑

a

(∑

b

wbQba

)

waV2a =

∑

b

(∑

a

waQab

)

wbV2a

=∑

ab

waQabwbV2a ,

(6)

we substitute the above A into formula (5), and obtain

VTSV =∑

ab

(waQabwbVa −waQabwbVaVb

)

=∑

ab

waQabwb[V 2a −VaVb

]

=∑

ab

waQabwb

[12V 2a +

12V 2b −VaVb

]

= 12

∑

ab

waQabwb(Va −Vb

)2.

(7)

Due to the fact that w and Q are a row vector and a nonneg-ative matrix, respectively, hence for any r-dimensional rowvector V, we have

VTSV = 12

∑

ab

waQabwb(Va −Vb

)2 ≥ 0. (8)

Hence, the matrix S and therefore F = δab((QwT)a/wa) − Qis a semipositive definite matrix.

Now in Theorem 1, we derive learning algorithm andprove its convergence for updating each row w in W whenH is set to be a fixed nonnegative matrix. The learningalgorithm for updating each column h in H when W is set tobe a fixed nonnegative matrix is depicted in Theorem 2, andcan be proved similarly but skipped due to the limitation ofthe space.

Theorem 1. For the quadratic optimization problem,

minw

E(w;H ,α) = 12‖v −wH‖2 +

12αwMwT , s.t. w ≥ 0,

(9)

where w is an r-dimensional row vector, v is a given m-dimensional nonnegative row vector, H is an r by m fixednonnegative matrix, M is an r by r constant matrix with allelements being 1 except diagonal elements being zeros, and α isa fixed nonnegative parameter. The following update algorithm

wt+1a = wt

a

(vHT

)a(

wtHHT + αwtM)a

(10)

converges to its optimal solution from any initialized nonnega-tive vector w0.

Proof. The convergence proof will be performed by introduc-ing an appropriate auxiliary function F(w,wt) that satisfies

F(wt,wt

) = E(wt), (11)

F(w,wt

) ≥ E(w). (12)

If such a function can be found, then the update of w bysetting

wt+1 = arg minw

F(w,wt

)(13)

will make

E(wt+1) ≤ F

(wt+1,wt

) ≤ F(wt,wt

) = E(wt)

(14)

which always makes the objective function E(w) to bedecreased with respect to iterations in the algorithm, indicat-ing that the algorithm converges with the updating formula(13).

Now we construct the auxiliary function to be

F(w,wt

) = E(wt)

+(w −wt

)∇E(wt)

+12

(w −wt

)J(wt)(w −wt

)T,

(15)

where J(wt) is a diagonal matrix

J(wt) = δab

(HHTwtT + αMwtT

)a(

wtT)a

. (16)

Obviously, F(wt,wt) = E(wt), so formula (11) holds.The Taylor expansion of the loss function E(w), when w

approaches wt, can be written to be

E(w) = E(wt)

+(w −wt

)∇E(wt)

+12

(w −wt

)(HHT + αM

)(w −wt

)T.

(17)


By subtracting F(w,wt) in (8)–(16) to E(w) in (7)–(17), wehave

F(w,wt

)− E(w)

= 12

(w −wt

)[

δab

(HHTwtT + αMwtT

)a

(wtT)a

−HHT − αM](w −wt

)T

= 12

(w −wt

)[

δab

(QwtT

)a(

wtT)a

−Q](w −wt

)T,

(18)

whereQ = HHT+αM. Due to the fact that Q is a nonnegativesymmetric matrix since H is the nonnegative factor of V,and α is always a nonnegative parameter, and the fact thatwt is a nonnegative vector, we have, from Lemma 1, thatthe matrix δab((QwtT)a/(w

tT)a)−Q is semipositive definite,and therefore we always have F(w,wt) − E(w) ≥ 0. Hence,updating w according to wt+1 = arg minwF(w,wt) alwaysleads the iteration process to converge.

We employ the steepest descent search strategy foroptimal w. For this purpose, we have wt+1 to satisfy(∂F(w,wt))/∂w|w=wt+1 = 0, from which we get ∇E(wt) +J(wt)(wt+1 −wt)T = 0, or equally

wt+1T = wtT − J−1(wt)∇E(wt

). (19)

By the definition of the loss function E(w), we have

∇E(wt) = H

(HTwt − vT) + αMwtT

= (HHT + αM)wtT −HvT.

(20)

Since J(wt) is a diagonal matrix, we only need to computeinversion of each diagonal element in J for J−1. Hence, wehave the following updating formula for the ath element ofw:

wt+1a = wt

a −(wtT

)a(

HHTwtT + αMwtT)a

·(HHTwtT + αMwtT −HvT)a

= wta

(HvT

)a(

HHTwtT + αMwtT)a

= wta

(vHT)a(

wtHHT + αwtM)a

.

(21)

Theorem 2. For the quadratic optimization problem.

minhE(h;W ,β) = 1

2‖v −Wh‖2 + βITh, s.t. h ≥ 0,

(22)

where h is r-dimensional column vector, v is a given n-dimen-sional nonnegative column vector, W is an n by r fixed nonneg-

Algorithm parameters: α, β;Input: an n by m nonnegative observation matrix V ;Output: an n by r nonnegative matrix W and an r by mnonnegative matrix H.Step 1: set t = 0, and generate nonnegative matrix W0 andH0 at random;Step 2: Update H from Ht to Ht+1 by

Ht+1 = Ht ⊗(Wt)TV

(Wt)T(

Wt)(Ht)

+ βI,

Wt+1 =Wt ⊗ V(Ht)T

Wt(Ht)(Ht)T

+ αWtM,

where I is an r by m matrix full of elements being 1s, and Mis an r by r matrix with all elements being 1s except diagonalelements being zeros.Step 3: Increment t by t = t + 1 and go to step 2 until Ht+1

and Wt+1 converge.

Algorithm 1: Learning algorithm.

ative matrix, I is an r ×m matrix with all the elements being1s, and β is a fixed nonnegative parameter. The following rule

ht+1a = hta

(WTv

)a(

WTWht + βI)a

(23)

converges to its optimal solution from any initialized nonnega-tive vector h0.

This theorem can be proved similarly as the proof ofTheorem 1.

By representing (10) and (23) in (elementwise) Had-amard product, one has the following learning algorithmfor updating both W and H for the PE-NMF optimizationproblem in (1).

Theorem 3. For the optimization problem shown in (1), theabove learning algorithm converges to locally optimal solutionfrom any initialized nonnegative vector H0 and W0.

It is evident that the portion relating to the row w inthe objective function E(W ,H ;α,β) in (1) is just E(w;H ,α)in (9), and the portion relating to the column h in theobjective function E(W ,H ;α,β) in (1) is just E(h;W ,β) in(22). Hence, using formula to update w and h alternativelywill make the learning process to converge to the solutionof the objective function E(W ,H ;α,β). Hence, the abovetheorem can be easily proved on the basis of Theorems 1 and2.

The update of the W and H can also be expressed withMatLab command of W = W.∗(V∗H′)./(W∗H∗H′ +alfa∗W∗M) and H = H.∗(W ′∗V)./(W ′∗W∗H + beta).

2.3. Initialization of the Algorithm

To our knowledge, it seems that there are two main reasonsfor NMF to converge to undesired solutions. One is that


(a) (b) (c)

Figure 3: Extended bar problem solution obtained from PE-NMF:(a) source images, (b) mixed images, (c) recovered images from PE-NMF.

(a) (b)

Figure 4: Recovered images from (a) ICA, and (b) NMF for theextended bar problem.

the basis of a space may not be unique theoretically, andtherefore separate runs of NMF may lead to different results.Another reason may come from the algorithm itself, thatthe loss function sometimes gets stock into local minimumduring its iteration. By revisiting the loss function of theproposed PE-NMF, it is seen that similar to NMF, theabove PE-NMF still sometimes gets stock into local mini-mum during its iteration, and/or the number of iterationsrequired for obtaining desired solutions is very large. Forthe sake of these, an ICA-based technique was proposedfor initializing source matrix instead of setting it to be anonnegative matrix at random: we performed ICA on theobservation signals, and set the absolute of the independentcomponents obtained from ICA to be the initialization of thesource matrix. In fact, there are reasons that the resultantindependent components obtained from ICA are generallynot the original sources. One reason is the nonnegativityof the original sources but centering preprocess of the ICAmakes each independent component be both positive andnegative in its elements: the means of each independentcomponent is zero. Another reason is possibly dependent

or partially independent original sources which does notfollow the independence requirement of sources in the ICAstudy. Hence, the resultant independent components fromICA could not be considered as the recovery of the originalsources. Even so, they still provide clues of the originalsources: they can be considered as very rough estimationsof the original sources. From this perspective, and bynoticing that the initialization of the source matrix shouldbe nonnegative, we set the absolute of the independentcomponents obtained from ICA as the initialization of thesource matrix for the proposed PE-NMF algorithm. Ourexperiments indicate that such an initialization technique isvery effective in speeding up the learning process for gettingdesired solutions.

3. Experiments and Results

The proposed PE-NMF algorithms have been extensivelytested for many difficult benchmarks for signals and imageswith various statistical distributions. Three examples will begiven in the following context for demonstrating the effec-tiveness of the proposed method compared with standardNMF method and/or ICA method. In ICA approach here, wedecenteralize the recovered signals/images/microarrays forits nonnegativity property for compensating the centeringpreprocessing of the ICA approach. The NMF algorithm issimply the one proposed in [10] and the ICA algorithmis simply the FastICA algorithm generally used in manyapplications in [11]. The examples include blind sourceseparation of extended bar problem, mixed signals, andreal microarray gene expression data in which heterogeneityeffect occurs.

3.1. Extended Bar Problem

The linear bar problem [12] is a blind separation of bars fromtheir combinations. 8 nonnegative feature images (sources)sized 4 × 4 including 4 vertical and 4 horizontal thinbar images, shown in Figure 2(a), are randomly mixturedto form 1000 observation images, the first 20 shown inFigure 2(b). The solution obtained from ICA and NMF withr = 8 are shown in Figures 2(c) and 2(d), respectively,indicating that NMF can fulfill the task very well comparedwith ICA . However, when we extended this bar problem intothe one which is composed of two types of bars, thin one andthick one, NMF failed to estimate the original sources. Forexample, fourteen source images sized 4 × 4 with four thinvertical bars, four thin horizontal bars, three wide verticalbars, and three wide horizontal bars, shown in Figure 3(a),are nonnegative and evidently statistically dependent. Thesesource images were randomly mixed with mixing matrix ofelements arbitrarily chosen in [0, 1] to form 1000 mixedimages, the first 20 shown in Figure 2(b). The PE-NMF withparameter α = 4 and β = 1 was performed on these mixedimages for r = 14. The resultant images, which are shownin Figure 2(c), indicate that the sources were recoveredsuccessfully with the proposed PE-NMF. For comparison,many times we tried using ICA and NMF on this problemfor avoiding obtaining local minimum solutions, but always


0

2

4

0

2

4

0

2

4

0

2

4

05

10

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

Original signal

(a)

0

5

10

0

5

10

0

5

10

0

5

10

0

5

10

0

5

10

0

5

10

0

5

10

0

5

10

500 1000 500 1000 500 1000

500 1000 500 1000 500 1000

500 1000 500 1000 500 1000

Observations

(b)

0

2

4

0

2

4

0

2

4

0

2

4

05

10

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

NMF recovered signal

(c)

0

2

4

0

2

4

0

2

4

0

2

4

05

10

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

100 200 300 400 500 600 700 800 900 1000

PE-NMF recovered signal

(d)

Figure 5: Blind signal separation example: (a) 5 original signals, (b) 9 observations, (c) recovered signals from NMF, and (d) recoveredsignals from PE-NMF.

failed to recover the original sources. Shown in Figures 4(a)and 4(b) are the examples of the recovered images with thesetwo approaches. Notice that both the ones recovered fromICA and NMF are very far from the original sources, andeven the number of sources estimated from the ICA is only 6,rather than 14. It is noticeable that the recovered images fromthe PE-NMF with some other parameter such as α = 4.2 andβ = 0.1 are comparable to the ones shown in Figure 3(c),indicating that the proposed method is not very sensitive tothe parameter selection for this example.

3.2. Recovery of Mixed Signals

We performed experiments on recovering 5 nonnegative sig-nals from 9 mixtures of 5 nonnegative dependent source sig-

nals, which is the one in [4]. The 9 mixture observation sig-nals come from arbitrarily nonnegative linear combinationsof the 5 nonnegative source signals shown in Figure 5(a).The difficulty to recover the sources is a very small numberof observations compared with the number of sources. BothNMF and our proposed PE-NMF (where α and β are takento be 0.001 and 17.6, resp.) were employed for recovery of thesources. By comparison of the resultant signals obtained byNMF shown in Figure 5(c) and these obtained by PE-NMFshown in Figure 5(d), it is evident that the PE-NMF canrecover the sources with a higher recovery performance. Infact, the signal-to-interference ratios (SIRs) for the recoveredsources from NMF is only 22.17, 11.13, 10.98, 14.91, and14.15 while that from PE-NMF increases to 47.10, 28.89,26.67, 83.44, and 28.75 for the 5 source signals.


Observations

(a)

Recovered sources (PE-NMF)

(b)

Real sources

(c)

Figure 6: Heterogeneity correction result: (a) observations, (b) recovered sources from PE-NMF, and (c) real sources.

0

1

2

3

4

5

6

s2

0 2 4 6

s1

RealBy PE-NMF

(a)

0

1

2

3

4

5

6

s2

0 2 4 6

s1

RealBy NMF

(b)

Figure 7: The scatter plots of the real sources (blue stars) and the recovered sources (red dots) from (a) PE-NMF, and (b) NMF.

3.3. Heterogeneity Correctionof Gene Micrroarrys

Gene expression microarrays promise powerful new toolsfor the large-scale analysis of gene expression. Using thistechnology, the relative mRNA expression levels derived fromtissue samples can be assayed for thousands of genes simul-taneously. Such global views are likely to reveal previously

unrecognized patterns of gene regulation and generate newhypotheses warranting further study (e.g., new diagnosticor therapeutic biomarkers). However, as a common featurein microarray profiling, gene expression profiles represent acomposite of more than one distinct but partially dependentsources (i.e., the observed signal intensity will consist of theweighted sum of activities of the various sources). Morespecifically, in the case of solid tumors, the related issue is


called partial volume effect (PVE), that is, the heterogeneitywithin the tumor samples caused by stromal contamination.Blind application of microarray profiling could result inextracting signatures reflecting the proportion of stromalcontamination in the sample, rather than underlying tumorbiology. Such “artifacts” would be real, reproducible, andpotentially misleading, but would not be of biological orclinical interest, while can severely decrease the sensitivityand specificity for the measurement of molecular signaturesassociated with different disease processes. Despite theircritical importance to almost all the followup analysis steps,this issue, called partial volume correction (PVC), is oftenless emphasized or at least has not been rigorously addressedas compared to the overwhelming interest and effort inpheno/gene-clustering and class prediction.

The effectiveness of the proposed PE-NMF method wastested with real-world data set, microarray gene expressiondata set, for PVC. The data set consists of 2308 effectivegene expressions from two samples of neuroblastoma andnon-Hodgkin lymphoma cell tumors [13]. Two observationmicroarrays, recovered microarrays from PE-NMF, and twopure source microarrays are shown in Figures 6(a), 6(b),and 6(c), respectively. Notice that the true sources aredetermined, in our present case, by separately profilingthe pure cell lines that provide the ground truth of thegene expression profiles from each cell populations. In ourclinical case, we use laser-capture microdissection (LCM)technique to separate cell populations from real biopsysamples. By comparison of Figures 6(b) and 6(c), the blindsource separation by PE-NMF method recovered the puremicroarray successfully. Figures 7(a) and 7(b) show thescatter plots of the recovered microarrays from PE-NMF andfrom NMF compared with these of the pure microarrays.These scatter plots and the SIRs of being 56.79 and 31.73 forthe PE-NMF approach and of being only 21.20 and 32.81 forthe NMF approach also indicate that the proposed PE-NMFis effective in recovering the sources successfully. Many otherindependent trials using other gene sets reached a similarresult.

4. Conclusions

This paper proposes a pattern expression nonnegative ma-trix factorization (PE-NMF) approach for efficient patternexpression and applies it to blind source separation fornonnegative linear model (NNLM). Its successful applicationto blind source separation of extended bar problem, nonneg-ative signal recovery problem, and heterogeneity correctionproblem for real microarray gene data indicates that it isof great potential in blind source separation problem forNNLM model. The loss function for the PE-NMF proposedhere is in fact an extension of the multiplicative updatealgorithm proposed in [10], with the two terms introducedwith parameters α and β, respectively, in which β is forupdate rule for the matrix H , which is similar to some sparseNMF algorithms [14], and α as the regularization term addedto HHT in the update rule for matrix W . The loss functionfor the PE-NMF is a special case of that proposed in [4].However, in this approach, not only the learning algorithm is

motivated by expressing patterns more effectively and moreefficiently, and experimented successfully in a wide rangeof applications, but also the convergence of the learningalgorithm is proved by introducing some auxiliary function.In addition, a technique based on independent componentanalysis (ICA) is proposed for speeding up the learningprocedure, and has been verified to be effective for thelearning algorithm to converge to desired solutions.

Same as what has been mentioned in [4], the optimalchoice of PE-NMF parameters depends on the distributionof data and a priori knowledge about the hidden (latent)components. However, our experimental results on extendedbard problem indicate that the parameter choice is not sosensitive to some problems.

Acknowledgment

This work was supported by the National Science Fund ofChina under Grant nos. 60574039, 60371044, and Sino-Italian joint cooperation fund, and the US National Institutesof Health under Grants EB000830 and CA109872.

References

[1] A. Hyvarinen, J. Karhunen, and E. Oja, Independent Compo-nent Analysis, John Wiley & Sons, New York, NY, USA, 2001.

[2] P. O. Hoyer and A. Hyvarinen, “Independent componentanalysis applied to feature extraction from colour and stereoimages,” Network: Computation in Neural Systems, vol. 11, no.3, pp. 191–210, 2000.

[3] J. Zhang, L. Wei, and Y. Wang, “Computational decompositionof molecular signatures based on blind source separation ofnon-negative dependent sources with NMF,” in Proceedings ofthe 13th IEEE Workshop on Neural Networks for Signal Pro-cessing (NNSP ’03), pp. 409–418, Toulouse, France, September2003.

[4] A. Cichocki, R. Zdunek, and S. Amari, “New algorithmsfor non-negative matrix factorization in applications toblind source separation,” in Proceedings of the IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP ’06), vol. 5, pp. 621–624, Toulouse, France, May 2006.

[5] A. Cichocki and R. Zdunek, “Multilayer nonnegative matrixfactorization,” Electronics Letters, vol. 42, no. 6, pp. 947–948,2006.

[6] A. Pascual-Montano, J. M. Carazo, K. Kochi, D. Lehmann,and R. D. Pascual-Marqui, “Nonsmooth nonnegative matrixfactorization (nsNMF),” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 28, no. 3, pp. 403–415, 2006.

[7] R. Zdunek and A. Cichocki, “Nonnegative matrix factor-ization with constrained second-order optimization,” SignalProcessing, vol. 87, no. 8, pp. 1904–1916, 2007.

[8] I. Kopriva, D. J. Garrood, and V. Borjanovic, “Single frameblind image deconvolution by non-negative sparse matrixfactorization,” Optics Communications, vol. 266, no. 2, pp.456–464, 2006.

[9] W. Liu and N. Zheng, “Non-negative matrix factorizationbased methods for object recognition,” Pattern RecognitionLetters, vol. 25, no. 8, pp. 893–897, 2004.


[11] http://www.cis.hut.fi/projects/ica/fastica/.


[12] P. Foldiak, “Forming sparse representations by local anti-Hebbian learning,” Biological Cybernetics, vol. 64, no. 2, pp.165–170, 1990.

[13] J. Khan, J. S. Wei, M. Ringner, et al., “Classification and diag-nostic prediction of cancers using gene expression profilingand artificial neural networks,” Nature Medicine, vol. 7, no. 6,pp. 673–679, 2001.

[14] M. Mørup, L. K. Hansen, and S. M. Arnfred, “Algo-rithms for sparse non-negative TUCKER (also namedHONMF),” Tech. Rep., Technical University of Denmark, Lyn-gby, Denmark, 2007, http://www2.imm.dtu.dk/pubdb/views/edoc download.php/ 4658/pdf/imm4658.pdf.


Research Article

Robust Object Recognition under Partial OcclusionsUsing NMF

Daniel Soukup and Ivan Bajla

Smart systems division, ARC Seibersdorf research GmbH, 2444 Seibersdorf, Austria

Correspondence should be addressed to Daniel Soukup, [email protected]

Received 2 October 2007; Revised 18 December 2007; Accepted 10 March 2008

Recommended by Morten Morup

In recent years, nonnegative matrix factorization (NMF) methods of a reduced image data representation attracted the attentionof computer vision community. These methods are considered as a convenient part-based representation of image data forrecognition tasks with occluded objects. A novel modification in NMF recognition tasks is proposed which utilizes the matrixsparseness control introduced by Hoyer. We have analyzed the influence of sparseness on recognition rates (RRs) for variousdimensions of subspaces generated for two image databases, ORL face database, and USPS handwritten digit database. We havestudied the behavior of four types of distances between a projected unknown image object and feature vectors in NMF subspacesgenerated for training data. One of these metrics also is a novelty we proposed. In the recognition phase, partial occlusions in thetest images have been modeled by putting two randomly large, randomly positioned black rectangles into each test image.

Copyright © 2008 D. Soukup and I. Bajla. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

1. Introduction

Subspace methods represent a separate branch of high-dimensional data analysis, such as in areas of computervision and pattern recognition. In particular, these methodshave found efficient applications in the fields of face identi-fication and recognition of digits and characters. In general,they are characterized by learning a set of basis vectors from aset of suitable image templates. The subspace spanned by thisvector basis captures the essential structure of the input data.Having found the subspace (offline phase), the classificationof a new image (online phase) is accomplished by projectingit on the subspace in some way and by finding the nearestneighbor of templates projected onto this subspace.

In 1999, Lee and Seung [1] showed for the first timethat for a collection of face images an approximativerepresentation by basis vectors, encoding the mouth, nose,and eyes, can be obtained using a nonnegative matrixfactorization (NMF). NMF is a method for generating alinear representation of data using nonnegativity constraintson the basis vector components and the coefficients. It canformally be described as follows:

V ≈ W·H , (1)

where V ∈ Rn×m is a positive image data matrix withn pixels and m image sample templates (template imagesare usually represented in lexicographic order of pixels ascolumn vectors), W ∈ Rn×r are reduced r basis columnvectors of an NMF subspace, and H ∈ Rr×m containscoefficients of the linear combinations of the basis vectorsneeded to reconstruct the original data. Usually, r is chosenby the user so that (n + m)r < nm. Then each columnof the matrix W represents a basis vector of the generated(NMF)subspace. Each column of H represents the weightsneeded to approximate the corresponding column in V(image template) by means of the vector basis W. Variouserror functions were proposed for NMF, such as in the papersof Lee and Seung [2] or Paatero and Tapper [3].

The main idea of NMF application in visual objectrecognition is that the NMF algorithm identifies localizedparts describing the structure of that object type. Theselocalized parts can be added in a purely additive way withvarying combination coefficients to form the individualobjects. The original algorithm of Lee and Seung could notachieve this locality of essential object parts in a proper way.Thus other authors investigated the possibilities to controlthe sparseness of the basis images (columns in W) and


the coefficients (matrix H). The first attempts consisted inaltering the norm that measures the approximation accuracy,like LNMF [4, 5]. Hoyer introduced a method for steeringthe sparsenesses of both factor matrices, W and H, withtwo sparseness parameters [6, 7]. In their work, Pascual-Montano et al. briefly summarized and described all NMFalgorithms used in this topic [8]. Their approach also ledto a sparseness control parameter, but only one for bothmatrices. The optimization algorithm remained equal to theone already introduced by Lee and Seung.

One important problem by using NMF for recognitiontasks is how to obtain NMF subspace projections for newimage data that are comparable with the feature vectorsdetermined in NMF coded in matrix H. Guillamet andVitria [9] propose one method in their work that consists ofrerunning the NMF algorithm for new image data keepingW constant. However, in the conventional method, trainingimages and new images are orthogonally projected ontothe determined subspace. Both methods have advantagesand drawbacks. We will discuss them in more detail andpropose a modification of the NMF task that comprises theadvantages of both methods.

An important aspect in measuring distances in NMFsubspaces, which is necessary in recognition tasks, is theused metric. NMF subspace basis vectors do not form anorthogonal system. Due to this fact, it is not convenient toapply the natural Euclidean metric. Guillamet and Vitria[9] experimented with several alternative metrics: L1, L2,cos, and EMD. They lined out that solely EMD takes thepositive aspects of NMF into account. As this metric iscomputationally demanding, Ling and Okada [10] proposeda new dissimilarity measure, the diffusion concept, whichis as accurate as EMD, but computationally much moreefficient. Liu et al. [11, 12] proposed to replace the Euclideandistance in NMF recognition tasks by a weighted Euclideandistance (a version of Riemannian distance). These authorsalso experimented with orthogonalized bases. However, ascommented by authors, these modified NMF bases are notpart-based anymore.

In our research, we focus on studying the influenceof matrix sparseness parameters, subspace dimension, andthe use of distance measures on the recognition rates, inparticular for partially occluded objects. We use Hoyer’salgorithms to achieve sparseness control. Additionally, wepropose a modification of the entire NMF task similarto the methods of Yuan and Oja [13] and Ding et al.[14]. The implementation of our modification additionallycomprises Hoyer’s sparseness control mechanisms. In thecase of studying proper distance measures, we propose a newmetric.

In Section 2, we briefly review Hoyer’s method(Section 2.1). Section 2.2 contains a presentation of themotivation and a detailed description of our modificationof the NMF task. Section 2.3 is about distance measuringin NMF subspaces. We present the metrics we used forour experiments and propose the anew distance measure.Then we present the setup and results of our experimentsin Section 3. Section 4 contains conclusions and a futureoutlook.

2. NMF with Sparseness Constraints

The aim of the work of Hoyer [7] is to constrain NMF tofind a solution with prescribed degrees of sparseness of thematrices W and H. The author claims that the balance ofthe sparseness between these two matrices depends on thespecific application and no general recommendation can begiven. The modified NMF problem and its solution is givenby Hoyer as follows.

2.1. Hoyer’s Method---Nmfsc

2.1.1. Problem Definition

Given a nonnegative data matrix V of size n × m, find thenonnegative matrices W and H of sizes n×r and r×m (resp.,)such that

E(W, H) = ‖V−WH‖2 (2)

is minimized, under optional constraints

s(

wi) = sW , ∀i, i = 1, . . . , r,

s(

hi) = sH , ∀i, i = 1, . . . , r,

(3)

where wi is the ith column of W, hi is the ith row of H.Here, r denotes the dimensionality of an NMF subspacespanned by the column vectors of the matrix W, and sWand sH are their desired sparseness values. The sparsenesscriteria proposed by Hoyer [7] use a measure based on therelationship between L1 and L2 norm of the given vectors wi

or hi. In general, for the give n-dimensional vector x with thecomponents xi, its sparseness measure s (x) is defined by theformula:

s(x) :=√n− L1/L2√n− 1

=√n−∑∣∣xi

∣∣/√∑

x2i√

n− 1. (4)

This measure quantifies how much energy of the vector ispacked into a few components. This function evaluates to1 if and only if the given vector contains a single nonzerocomponent. Its value is 0 if and only if all components areequal. It should be noted that the scales of the vectors wi

or hi have not been constrained yet. However, since wi·hi =(wiλ)·(hi/λ), we are free to arbitrarily fix any norm of eitherone. In Hoyer’s algorithm, the L2 norm of hi is fixed to unity.

2.1.2. Factorization Algorithm

The projected gradient descent algorithm for NMF withsparseness constraints essentially takes a step in the directionof the negative gradient, and subsequently projects onto theconstraint space, making sure that the taken step is smallenough that the objective function is reduced at every step.The main muscle of the algorithm is the projection operatorproposed by Hoyer [7], which enforces the required degreeof sparseness.


2.2. Modified NMF Concept: modNMF

In the papers mentioned up to now, the attention wasconcentrated on methodological aspects of NMF as a part-based representation of image data, as well as on numericalproperties of the developed optimization algorithms appliedto the matrix factorization problem. It turned out that thenotion of matrix sparseness involved in NMF plays the cen-tral role in part-based representation. However, little efforthas been devoted to systematic analysis of the behavior ofthe NMF algorithms in actual pattern recognition problems,especially for partially occluded data.

For a particular recognition, task of objects representedby a set of training images (V) we need: (i) to calculatein advance (in an offline mode) projection vectors of thetraining images onto the obtained vector basis (W)—the so-called feature vectors—, and then (ii) to calculate (in anonline mode) a projection vector onto the obtained vectorbasis (W) for each unknown input vector y. Guillamet andVitria [9] propose to use the feature vectors determined inthe NMF run, that is, columns of matrix H. The problemof determining projected vectors for new input vectors ina way that they are comparable with the feature vectors issolved by the authors by rerunning the NMF algorithm. Inthis second run, they keep the basis matrix W constant andthe matrix Vtest contains the new input vectors instead of thetraining image vectors. The results of the second run are thesearched projected vectors in the matrix Htest. However, thismethod has some drawbacks. We investigated the functionof NMF exemplarily for 3D point data instead of high-dimensional images. These points have been divided intotwo classes based on point proximity. The two classes arecalled A and B and are illustrated in Figure 1. We ran NMFto get a two dimensional subspace visualized as yellow gridin Figure 1 spanned by the two vectors w1 and w2, whichtogether build matrix W. Additionally, we show the featurevectors of the input point sets (HA and HB in Figure 1) andconnected each input point with its corresponding featurevector in the subspace plane (projection rays). Especiallyfor the point set A, it can be observed that the projectionrays are all nonorthogonal, with respect to, the plane andthat their mutual angles significantly differ (even for featurevectors belonging to the same class). Thus the feature vectorsof set A and set B are not separated clusters anymore. Wehave doubts that a reliable classification based on proximityof feature vectors is achievable in this case. A secondpossibility to determine proper feature vectors for an NMFsubspace, which is conventionally used (e.g., mentioned byBuciu [15]), is to recompute the training feature vectorsfor the classification phase entirely new by orthogonallyprojecting the training points (images) onto to NMF sub-space. Unknown input data to be classified are similarlyorthogonally projected to the subspace. This method is alsovisualized in Figure 1: from each input point an orthogonaldotted line is drawn to the orthogonal projections of thepoints into the subspace plane. It can be noticed that thefeature vectors determined in this way preserve a separationof the feature vector clusters, corresponding to the clusterseparation in the original data space (point sets W†A and

A

B

w1

w2

HA HB W†A W†B

Figure 1: Visualization of the Nmfsc results for a low-dimensionexample (3D data sets A and B as training points). The planespanned by w1 and w2 represents the NMF subspace due to thistraining set. HA and HB are the training set projections to thesubspace implicitly given by matrix H in the NMF algorithm. W†Aand W†B are the orthogonal projections of the training sets ontothe NMF subspace.

W†B). In view of these observations, we propose to favor theorthogonal projection method.

Nonetheless, both methods have their disadvantages. Themethod of Guillamet and Vitria operates with nonorthogo-nal projected feature vectors that directly stem from the NMFalgorithm and do not reflect the data cluster separation in thesubspace. On the other hand, the conventional method doesnot accommodate the optimal data approximation resultdetermined in NMF because one of the two optimal factormatrices is substituted by a different one in the classificationphase. Our intention was to combine the benefits of bothmethods into one, that is, benefits of orthogonal projectionsof input data and preservation of the optimal training dataapproximation of NMF. We achieve this by changing theNMF task itself. Before we present this modification, werecall in more detail how the orthogonal projections of theinput data are computed.

As the basis matrix W is rectangular, matrix inversion isnot defined. Therefore, one has to use a pseudo-inverse ofW to multiply it from the left onto V (cf. [15]). Orthogonalprojections of data points y onto a subspace defined by abasis vector matrix W are realized by solving the followingoverdetermined equation system:

W b = y (5)

for the coefficient vector b. This can, for instance, be achievedvia the Moore-Penrose (M-P) pseudoinverse (This may notbe the numerically stable way, but in our investigations wecould not observe differences to other usually more appro-priate methods.) W† giving the result for the projection as

b = W† y. (6)

Similarly, for the NMF feature vectors (in the offline mode)we determine HLS = W†V, where HLS are projectioncoefficients obtained in the least squares (LS) manner. Thesecoefficients can differ severely from the NMF feature vectorsimplicitly given by H (see Figure 1). It is important to state


that the entries of HLS are not nonnegative anymore, HLS alsocontains negative values.

If one has decided to use the orthogonal projections ofinput data onto the subspace as feature vectors, the fact thatthe matrix H is not used anymore in the classification phaseand that the used substitute for H—HLS—is not nonnegativeanymore, gives rise to the questions whether matrix H is nec-essary at all in NMF and whether corresponding coefficientcoding necessarily has to be nonnegative. Moreover, usingthe orthogonal projection method, we do not make use ofthe optimal factorization achieved by NMF, as the coefficientmatrix is altered for classification. Consequently, we proposethe following modification of the NMF task itself:

given the training matrix V, we search for a matrix W suchthat

V ≈ W(W†V). (7)

Within this novel concept (modNMF), W is updatedin the same way as in common NMF algorithms. Eventhe sparseness of W can be controlled by the standardmechanisms, for example, those of Hoyer’s method. Onlythe coding matrix H is substituted by the matrix W†V todetermine the current approximation error. Thus this newconcept can be applied to all existing NMF algorithms.In our research, we implemented and analyzed modNMFcomprising the sparseness control mechanisms of Hoyer.

There are two existing methods that are related tomodNMF in two complementary ways. In projective NMF(pNMF) of Yuan and Oja [13], the independent factor matrixH is given up and, similarly to modNMF, substituted bya coding matrix derived from W and V, namely, WTV.Ding et al. [14] realize the second change incorporated inmodNMF. They keep two independent factor matrices intheir semi-NMF method, but give up the nonnegativityrestriction for one of them. Unlike modNMF, the nonneg-ativity constraint is kept for the coding matrix H while thesigns in the subspace basis W are not restricted. FollowingDing et al. notion, modNMF is also only semi-nonnegative.The resulting subspaces of semi-NMF and modNMF arenot classical NMF subspaces anymore. However, in thetraditional NMF methods, as we have outlined above, thetraining images have to be orthogonally projected to thedetermined subspace in preparation of the object recognitionphase and this also results in mixed sign subspace vectors.(Due to this very fact that for traditional NMF methodsas well as for modNMF the actually used subspace featuresin the recognition phase are not purely nonnegative andto the fact that the determined subspace bases are allnonorthogonal in general, we face the same problems inthe recognition phase for modNMF and classical NMFsubspaces. To simplify matters in this paper, we summarizeall these subspaces in the notion NMF subspaces in ourfurther discussions, which address the issues related torecognition experiments in such subspaces.) The differencein the case of modNMF is that the orthogonal projectionsof the training images onto the subspace that are usedin the recognition task (W†V) are also those for whichthe factorization that optimally approximates the trainingdata V is achieved. In semi-NMF, this is not guaranteed,

that is, extra orthogonal projection of the training imagesonto the subspace has to be done to prepare an objectrecognition phase. These extra projections do not comprisethe structure of the optimally approximating factor matrixdetermined in the factorization run, just like in classicalNMF methods. Similarly to modNMF, pNMF assures thatthe subspace features are the orthogonal projections of thetraining images onto the subspace, while these very subspacefeatures simultaneously constitute the optimal factorizationmatrix in the sense of approximating V. Actually, pNMFis in some sense a special case of modNMF. Both tryto optimize W with the goal to approximate an identitymatrix as close as possible in form of the factor in frontof V-modNMF in the case of WW† and pNMF for WWT .Thus although orthogonality of W in pNMF may not beexplicitly demanded, within the factorization process, Whas to approximate an orthogonal matrix more and moreas the approximation improves. Thanks to the fact thatthe more general modNMF model does not contain suchstructural restrictions on W (except nonnegativity), thereare more degrees of freedom in modNMF to approximate Vaccurately. Moreover, the sparseness of W can be controlledin modNMF via the sparseness parameter.

2.3. Distances in NMF Subspaces

Having solved the NMF task for the given training images(matrix V), the vector basis of an NMF subspace (of theoriginal data space) is generated as columns of the matrixW. Depending on the sparseness of W and H controlledin the algorithms, the basis vectors in W manifest differentmutual angles, that is, the basis is not orthogonal. Withincreasing sparseness of W or decreasing sparseness of H,the mutual angles tend to be closer to orthogonality. If bothsparseness parameters are adjusted, dependence on them isnot so obvious and straightforward.

As outlined by various authors mentioned in Section 1,suitable metrics for measuring the distances of NMF sub-space points have to be defined, due to the non-orthogonalityof NMF subspace bases. In our work, we compared the fourmetrics Euclidean, diffusion, Riemannian, and ARC-distance.

For comparison reasons, we also included the Euclideanmetric (d2(x1, x2) = (x1−x2)T(x1−x2)), which is commonlysupposed not to be suitable in vector spaces with nonorthog-onal basis. The diffusion distance is derived from the EMDmetric, for which Guillamet and Vitria [9] argued that itis well suited to the positive aspects of NMF. The completederivative of the diffusion distance can be found in the workof Ling and Okada, who developed this dissimilarity conceptto achieve a computationally more efficient algorithm.

The third metric, Riemannian distance, will be describedin more detail, as it is the basis of our proposal, ARC-distance. Liu and Zheng [11] defined the Riemanniandistance as a weighted Euclidean distance as d2

G(x1, x2) =(x1 − x2)TG(x1 − x2), where G is a similarity matrix definedas G = WT·W. They claimed that adopting this Riemannianmetric is more suitable than the Euclidean distance forclassification when using nearest neighbor classifiers.


Figure 2: An example of face images of one person selected from the ORL face database—two top lines. An example of different randomlyoccluded faces—the bottom line.

Figure 3: An example of handwritten digit images selected from the USPS database—two top lines. An example of different randomlyoccluded digits—the bottom line.

For the standard Euclidean metric d2 and Riemannianmetric d2

G of two vectors x, y from a subspace, the followingformulas can be drawn: d2

G(x, y) = (x − y)TWTW(x − y) =(W(x − y))TW(x − y) = d2(Wx, Wy). This proves that theRiemannian distance measures the Euclidean distance of theback-projected subspace vectors, that is, the subspace pointsrepresented in the orthogonal image super space bases. Thusthe Riemannian distance takes the angle structure of theNMF subspace bases into account.

To be able to deal with partial occlusions, the correctlychosen distance measure should also be able to discriminatetwo specific cases of vectors: (i) a case for which the valueof the Riemannian distance of two vectors is large because ofgreat deviations in all components of these vectors, and (ii)a case when only a few components contribute to the greatvalue of the Riemannian distance, that is, when the errorof recognition is sparsely distributed over the feature vectorcomponents. Therefore, to define a modified Riemannian(shortly “ARC-distance”) distance, we introduce a sparsenessterm into the Riemannian metric formula, that is, d2

G(x, y) =(x − y)TG (x − y)(1 − s(|x − y|)), where s measures thesparseness (compare Section 2) of the absolute differenceof the feature vectors. Note that the sparseness should be

measured in the feature space, as each component in thisspace representation is optimized to reflect one essential partof the training image objects.

3. Results of Computer Experiments

The goal of our study was to investigate influences ofsparseness control parameters and subspace metrics on recog-nition rates of unoccluded and occluded images. In massivecomputer experiments, we have varied the dimensions of theNMF subspaces from r = 25 up to r = 250, similarly to thepapers of Guillamet and Vitria [9] and Liu and Zheng [11].The method of nearest neighbor classification has been usedfor object recognition.

For our experiments, we chose three widely used imagedatabases: (i) the Cambridge ORL face database (cited inpaper of Li et al. [5]; grey-level images with resolution92 × 112, which were down sampled for our experiments tothe size 46 × 58 = 2668 pixels) and (ii) USPS handwrittendigit database (cited in the paper of Liu and Zheng [11];grey-level images with resolution 16 × 16 = 256 pixels),and (iii) CBCL image database available at the web address:http://cbcl.mit.edu/cbcl/software-datasets/FaceData2.html


Figure 4: An example of face images of two persons selected fromthe CBCL face database—two top lines. An example of differentrandomly occluded faces—the bottom line.

(cited in the paper of Hoyer [7]) that contains grey-level faceimages with resolution 19 × 19 = 361 pixels. We simulatedobject occlusions in test images as two rectangles of random(but limited) sizes with random super positioning on anoriginal image (see Figures 2, 3, and 4).

In the case of ORL database, the number of trainingimages was 222, and the number of testing images was 151.These two sets of images were chosen as disjunctive sets. Forthe experiments with USPS database, we chose 2000 trainingimages and 1000 testing images (different from the trainingones again). (In the USPS recognition rate plots (Figures 6,8), data points for r = 175 are missing. This is due to aMatlab problem that could not be solved. For some reason,all subspace files containing matrix H with dimension 175×2007 were corrupted and could not be opened anymore. Wewere able to reproduce the error in simplified configurations,however, we were not able to solve it. As the recognitioncurves do not oscillate a lot, we found it justified to justinterpolate between the two neighboring points of the pointin r = 175.) For the case of the CBCL image database, weused 1620 training images and 809 testing images.

3.1. Nmfsc---Unoccluded versusOccluded Test Images

The results of the first set of our experiments, accomplishedfor all three image bases, and for unoccluded, as well asoccluded images are displayed in Figures 5, 6, and 7. Theacronym “Nmfsc” stands here for Hoyer’s NMF methodwith coded sparseness (sW , sH). In this set of tests, Hoyer’sNmfsc-algorithm was applied consecutively to ORL faceimages, USPS digits, and CBCL face images. The algorithmshave been trained for various combinations of sparsenessparameter values. The resulting NMF subspaces, calculatedfor different dimensions r = 25, 50, . . . , 250 were used forrecognition experiments. We used four types of distancesto measure the distance of each projected test image tothe nearest feature vector (of the templates) in the givensubspace. For each NMF subspace, a recognition rate (RR)

over all test images was calculated. The plots show RRversus subspace dimension r (unoccluded—(a), (c), (e), andoccluded—(b), (d), (f)). The plots with the best recognitionresults have been chosen.

For unoccluded images, all three data sets show similarRR behavior in the cases of the Riemannian-like metrics(Riemannian and ARC-distance), only CBCL RR are slightlysmaller. The Euclidean and diffusion curves for the ORLand CBCL data are almost as high as for the Riemannian-like measures, but also, as one would expect. Their behaviorfor USPS data even more fulfills these expectations, as theyare much smaller than the Riemannian-like RR curves and,moreover, decrease with increasing dimension. This behavioris expectable, as more (nonorthogonal) basis vectors intro-duce more error components into the distance computation.This happens due to the fact that Euclidean and diffusiondistance do not take into account the mutual basis vectorangles. The dimension reduction for all datasets is very high,as for Riemannian-like metric all three achieve the maximalRR at about r = 50. Remarkable is that ARC-distance doesnot differ from Riemannian distance. It can be seen (Figures7(a), 7(c), 7(e)) that the RR values for all types of distancesare lower (below 0.9) than those achieved for ORL faces.There are only small differences in RR between the casescorresponding to application of different distances, but ingeneral, Riemannian distance yields the maximum values.

The RR behavior for occluded data differs severelybetween ORL and USPS data. First, RR maxima for USPSdata are higher than for ORL data—below 0.7 in the ORLcase versus about 0.75 for USPS data. Second, for ORLdata the RR curves of the metrics do not behave in theexpected way. Euclidean and diffusion distance generatemuch better results than the Riemannian-like. For USPS, RRbehave qualitatively in the same way as in the unoccludedcase, RR values are only smaller. Finally, RR maxima areachieved for higher dimension values in the ORL case, thatis, a much smaller dimension reduction. In the case ofCBCL image database, the situation changes dramaticallyin comparison to that of ORL face images: in average theRR are 50% smaller, they are reaching approximately thevalue of 0.3 (comparing to 0.7 maximum for ORL). For twovalue combinations of the sparseness parameters in Hoyer’smethod (Figures 7(b), 7(f)) the Euclidean distance yieldshigher RR, though it is not strictly monotone; however, forthe case D, the Riemannian distance outperforms Euclideanand diffusion ones. The difference between RR values forRiemannian distances on one side, and Euclidean anddiffusion distances on the other side are apparent but not solarge as is the case of ORL face images.

3.2. Occluded Test Images---Nmfscversus modNMF

In the second part of our study, we were interested in acomparison of the RR of Nmfsc and modNMF, latter onebeing implemented with Hoyer’s sparseness control mecha-nisms. Of course, since the NMF methodology is intendedmainly to generate part-based subspace representation of


00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250

Subspace dimension (r)

Un-occluded imagesNmfsc, orl-datasW = 0.5 sH = 0.1

RiemannDiffusion

EuclideanARC-Riemann

(a)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250


Occluded imagesNmfsc, orl-datasW = 0.5 sH = 0.1

RiemannDiffusion


(b)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(c)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(d)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(e)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(f)

Figure 5: Classification results for ORL training image data using Hoyer’s method. (a), (c), (e): unoccluded test images for sW = 0.5,sH = 0.1, 0.5, 0.9. (b), (d), (f): occluded test images for the identical values of the sparseness parameters.


00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250


Un-occluded imagesNmfsc, usps-datasW = 0.5 sH = 0.1

RiemannDiffusion


(a)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250


Occluded imagesNmfsc, usps-datasW = 0.5 sH = 0.1

RiemannDiffusion


(b)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(c)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(d)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(e)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(f)

Figure 6: Classification results for USPS training image data using Hoyer’s method. (a), (c), (e): unoccluded test images for sW = 0.5,sH = 0.1, 0.5, 0.9. (b), (d), (f): occluded test images for the identical values of the sparseness parameters.


00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250


Un-occluded imagesNmfsc, cbcl-datasW = 0.5 sH = 0.1

RiemannDiffusion


(a)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250


Occluded imagesNmfsc, cbcl-datasW = 0.5 sH = 0.1

RiemannDiffusion


(b)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(c)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(d)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(e)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(f)

Figure 7: Classification results for CBCL training image data using Hoyer’s method. (a), (c), (e): unoccluded test images for sW = 0.5,sH = 0.1, 0.5, 0.9. (b), (d), (f): occluded test images for the identical values of the sparseness parameters.


00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250


Occluded imagesNmfsc, orl-datasW = 0.1 sH = [ ]

RiemannDiffusion


(a)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250


Occluded imagesmodNMF-W, orl-datasW = 0.1 sH = [ ]

RiemannDiffusion


(b)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(c)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(d)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(e)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(f)

Figure 8: Classification results for ORL training image data. (a), (c), (e): Hoyer’s Nmfsc algorithm applied to occluded test images forsW = 0.1, 0.5, 0.9, sH = [ ]. (b), (d), (f): our modified modNMF algorithm applied to occluded test images for the identical values of thesparseness parameters.


00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250


Occluded imagesNmfsc, usps-datasW = 0.1 sH = [ ]

RiemannDiffusion


(a)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250


Occluded imagesmodNMF-W, usps-datasW = 0.1 sH = [ ]

RiemannDiffusion


(b)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(c)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(d)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(e)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(f)

Figure 9: Classification results for USPS training image data. (a), (c), (e): Hoyer’s Nmfsc algorithm applied to occluded test images forsW = 0.1, 0.5, 0.9, sH = [ ]. (b), (d), (f): our modified modNMF algorithm applied to occluded test images for the identical values of thesparseness parameters.


00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250


Occluded imagesNmfsc, cbcl-datasW = 0.1 sH = [ ]

RiemannDiffusion


(a)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250


Occluded imagesmodNMF-W, cbcl-datasW = 0.1 sH = [ ]

RiemannDiffusion


(b)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(c)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(d)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(e)

00.10.20.30.40.50.60.70.80.9

1

Rec

ogn

itio

nra

te

50 100 150 200 250



RiemannDiffusion


(f)

Figure 10: Classification results for CBCL training image data. (a), (c), (e): Hoyer’s Nmfsc algorithm applied to occluded test images forsW = 0.1, 0.5, 0.9, sH = [ ]. (b), (d), (f): our modified modNMF algorithm applied to occluded test images for the identical values of thesparseness parameters.


template images, our further interest was concentrated onlyon occluded images. These results, obtained for optimumvalues of sparseness parameter sW , are displayed in Figures 8,9, and 10. The plots also show RR versus subspace dimensionr but the columns now discriminate the used algorithms(Nmfsc—(a), (c), (e), and modNMF—(b), (d), (f)). Theplots with the best recognition results have been chosen.

The qualitative behavior of the RR curves of ORLfaces according to the distance measures is the same asdescribed in Section 3.1. Euclidean and diffusion distancesunexpectedly dominate the riemannian-like metrics. Excepta break-in of RR values for the Euclidean and diffusiondistances in the case of Nmfsc with sW = 0.1, bothalgorithms, Nmfsc and modNMF achieve approximately thesame results (Figure 8). The qualitatively more expected andquantitatively better results (w.r.t., RR maxima) are obtainedin the case of the USPS data. For Nmfsc with only the sWparameter set, the Riemannian-like RR curves dominate theEuclidean and diffusion distances, whereas—as expectable—the latter decrease with increasing dimension and decreasingsparseness sW (Section 2.3). Remarkable is that the novelmodNMF algorithm increases and stabilizes the performanceof the Euclidean and diffusion distances. The plots show thatthe curves of these two metrics are close to the Riemannian-like ones. The CBCL image data comprise face images whichhave significantly lower spatial resolution than the face datain the ORL image base, while the structure of their partsis similarly complex. These characteristics are reflected inapparent decrease of recognition rates for occluded imagesfor both methods being compared. In general, the behaviorof the recognition rates manifests in this case very lowsensitivity to the choice of the sparseness parameters. Noneof the distances applied exhibits unique prevalence.

4. Conclusions

In this paper, we have analyzed the influence of the matrixsparseness, controlled in NMF tasks via Hoyer’s algorithm[7], from the viewpoint of object recognition efficiency. Aspecial interest was devoted to partially occluded images,since images without occlusions can similarly well behandled by all NMF methods. Besides, Hoyer’s algorithm,we introduced a modified version of the NMF concept—modNMF—using a term containing the Moore-Penrosepseudoinverse of the basis matrix W instead of the coefficientmatrix H. Among the discussed important theoretical advan-tages, this method provides the computational benefit thatthe subspace projections of the training images do not haveto be calculated after subspace generation in an additionalstep. The novel concept was implemented comprising thesparseness modification mechanism of Nmfsc. A further goalof the paper was to analyze and compare RR achieved forfour different metrics used in the recognition tasks. As NMFsubspace bases are nonorthogonal, distance measuring isa crucial aspect. The computer experiments were accom-plished for three different image databases, ORL, USPS,and CBCL. In the classification tasks, we used the nearestneighbor method. In the unoccluded cases, Riemannian-likedistances dominate RR quality in maxima and stability over

all subspace dimensions and all parameter settings. ORL andUSPS only differ slightly in the behavior of Euclidean anddiffusion distances. In the case of CBCL, small differencesof RR are manifested between the cases using differentdistances. The conclusions related to the results for theoccluded test images can be summarized as follows.

(1) The ability of NMF methods to solve recognitiontasks is dependent on the kind of used images and thedatabases as a whole. Independently of the method, the RRfor USPS data are higher than those for ORL face data. Thisfinding could be ascribed to the simpler structure of thedigits (almost binary data, lower resolution, objects sparselycover the image area). Moreover, USPS contain much largerclasses (USPS: 2000 training images for only 10 classes, ORL:222 images with only 5 training images per class), so thatthe interclass variations in USPS can better be covered. Ingeneral, the RR obtained for faces from the CBCL databaseare significantly worse than in comparable cases with ORLface images. We assign these results to the poor resolution ofthe structured face image data.

(2) Not following the overall expectation, Euclidean anddiffusion distances showed better recognition performancesfor occluded test images in the case of ORL data. Asthese do not take into account subspace bases anglesthis is a surprise. USPS data treated with Hoyer’s Nmfscmethod behave like expected: with increasing dimension anddecreasingsW (i.e., increasing orthogonality, see Section 2.3),the RR measured with Euclidean and diffusion distancesdecrease (almost) monotonically. On the other hand, usingour modNMF method, Euclidean and diffusion distancesperform almost as well as the Riemannian-like metricsoverall dimensions and sparseness values. This gives a hintthat the relatively bad performances of these two metricsfor the Nmfsc method cannot totally be ascribed to thenonorthogonality of the bases, but to the used orthogonalprojections of the training images (HLS) instead of thewell approximating factor matrix H (V ≈ W·H) in theclassification phase; since we have observed no differencesbetween the RR for the original Riemannian distance andARC-distance, the proposed formula will need furtherexploration, likely to introduce some kind of numericalemphasis of the added sparseness term, as for example,exponential.

(3) Massive recognition experiments using Nmsfc andmodNMF algorithms, reported in our preliminary study[16], showed minor influence of sparseness parameter sH onrecognition rates in cases of unoccluded, as well as occludedimages selected from three mentioned image databases.Therefore, in the recognition experiments with occludedimages included in this study, the sparseness parameter sHhas not been controlled and we have been experimentingexclusively with sparseness value sW of the NMF basismatrix. Namely, we used three representative values: sW =0.1, 0.5, 0.9. As mentioned above we applied two NMFmethods, conventional Nmsfc and our modified modNMFalgorithm. Based on the analysis of the plots of RR for thesemethods and for images from three image databases, given inFigures 8, 9, and 10, the following conclusions on influenceof the sparseness sW on RR can be drawn as follows:


(i) ORL face images: Nmsfc method: the maximum RRhave been achieved for sW = 0.5, the minimum RRhave been achieved for sW = 0.9; modNMF method:the maximum RR have been obtained for sW = 0.1,however, the values of RR for sW = 0.5 were close tomaxima; the minimum values of the RR have beenobtained for sW = 0.9;

(ii) CBCL face images: Nmsfc method: the maximum RRhave been achieved for sW = 0.5, the minimum RRhave been achieved for sW = 0.1; modNMF method:the maximum RR have been obtained for sW =0.1, however the values of RR for sW = 0.5 were,similarly to the case of ORL, also close to maxima;the minimum values of the RR have been obtainedfor sW = 0.9;

(iii) USPS digit images: for both NMF methods compared,there were no significant influence of the sparsenessparameter sW on RR observed.

USPS performed better and followed the overall expectationsbetter than ORL and CBCL. We basically ascribe this factto the different training data situations. As mentioned inthe first point above, inter-class variations were much morecovered for the USPS dataset than for the face images.The novel modNMF algorithm even improved the resultsachieved in the case of the already well performing USPSdata set. ARC-distance in its current form did not fulfill theexpectations in the experiments. Significantly, lower spatialresolution of the CBCL face data than the face data inthe ORL image base is reflected in apparent decrease ofrecognition rates for occluded images for both methodsbeing compared. Various distances used for the CBCLdatabase manifested little influence on RR.

Spratling [17] analyzed the methodological situationrelated to the concept of “part-based” representation ofimage data by NMF subspaces, and pointed on the weak-nesses of application of this concept in the NMF framework.Inspired by Spratling’s results, we have analyzed possibilitiesof further research of improvement of the NMF methodol-ogy using a revisited version of this concept that could bemore attractive for object recognition tasks with occlusions.The research into this NMF version is in progress.

References


[2] D. D. Lee and H. S. Seung, “Algorithms for non-negativematrix factorization,” in Advances in Neural InformationProcessing Systems 13, MIT Press, Cambridge, Mass, USA,2001.

[3] P. Paatero and U. Tapper, “Positive matrix factorization: anon-negative factor model with optimal utilization of errorestimates of data values,” Environmetrics, vol. 5, no. 2, pp. 111–126, 1994.

[4] T. Feng, S. Z. Li, H.-Y. Shum, and H. Zhang, “Local non-negative matrix factorization as a visual representation,” inProceedings of the 2nd International Conference on Development

and Learning (ICDL ’02), pp. 178–183, Cambridge, Mass,USA, June 2002.

[5] S. Z. Li, X. W. Hou, H. J. Zhang, and Q. S. Cheng, “Learningspatially localized, parts-based representation,” in Proceedingsof the IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPR ’01), vol. 1, pp. 207–212,Kauai, Hawaii, USA, December 2001.

[6] P. O. Hoyer, “Non-negative sparse coding,” in Proceedingsof the 12th IEEE Workshop on Neural Networks for SignalProcessing (NNSP ’02), pp. 557–565, Martigny, Switzerland,September 2002.


[8] A. Pascual-Montano, J. M. Carazo, K. Kochi, D. Lehman,and R. D. Pascual-Marqui, “Nonsmooth non-negative matrixfactorization (nsNMF),” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 28, no. 3, pp. 403–415, 2006.

[9] D. Guillamet and J. Vitria, “Evaluation of distance metricsfor recognition based on non-negative matrix factorization,”Pattern Recognition Letters, vol. 24, no. 9-10, pp. 1599–1605,2003.

[10] H. Ling and K. Okada, “Diffusion distance for histogram com-parison,” in Proceedings of the IEEE Computer Society Confer-ence on Computer Vision and Pattern Recognition (CVPR ’06),vol. 1, pp. 246–253, New York, NY, USA, June 2006.

[11] W. Liu and N. Zheng, “Non-negative matrix factorizationbased methods for object recognition,” Pattern RecognitionLetters, vol. 25, no. 8, pp. 893–897, 2004.

[12] W. Liu, N. Zheng, and X. Lu, “Nonnegative matrix factor-ization for visual coding,” in Proceedings of the 2nd IEEEInternational Conference on Acoustics, Speech, and SignalProcessing (ICASSP ’03), vol. 3, pp. 293–296, Hong Kong, April2003.

[13] Z. Yuan and E. Oja, “Projective nonnegative matrix factor-ization for image compression and feature extraction,” inProceedings of the 14th Scandinavian Conference on ImageAnalysis (SCIA ’05), pp. 333–342, Joensuu, Finland, June 2005.

[14] C. Ding, T. Li, and M. I. Jordan, “Convex and semi-nonnegative matrix factorizations,” Tech. Rep., LawrenceBerkeley National Laboratory, Berkeley, Calif, USA, 2006.

[15] I. Buciu, “Learning sparse non-negative features for objectrecogntion,” in Proceedings of the 3rd IEEE International Con-ference on Intelligent Computer Communication and Processing(ICCP ’07), pp. 73–79, Cluj-Napoca, Romania, September2007.

[16] I. Bajla and D. Soukup, “Non-negative matrix factorization: astudy on influence of matrix sparseness and subspace distancemetrics on image object recognition,” in 8th InternationalConference on Quality Control by Artificial Vision, D. Fofi andF. Meriaudeau, Eds., vol. 6356 of Proceedings of SPIE, pp. 1–12,Le Creusot, France, May 2007.

[17] M. W. Spratling, “Learning image components for objectrecognition,” Journal of Machine Learning Research, vol. 7, pp.793–815, 2006.

Advances in Nonnegative Matrix and Tensor Factorization

Documents