2007 IEEE/WIC/ACM International Conference on Web Intelligence
Pairwise Constraints-Guided Non-negative Matrix Factorization for Document
Clustering ∗
Yu-Jiu Yang
Bao-Gang Hu
National Laboratory of Pattern Recognition
Institute of Automation, Chinese Academy of Sciences
Beijing Graduate School, Chinese Academy of Sciences
P.O. Box 2728, Beijing, 100080 China
{yjyang,hubg}@nlpr.ia.ac.cn
Abstract
Nonnegative Matrix Factorization (NMF) has been
proven to be effective in text mining. However, since NMF
is a well-known unsupervised components analysis technique, the existing NMF method can not deal with prior
constraints, which are beneficial to clustering or classification tasks. In this paper, we address the text clustering
problem via a novel strategy, called Pairwise Constraintsguided Non-negative Matrix Factorization (PCNMF for
short). Differing from the traditional NMF method, the proposed method can capture the available abundance prior
constraints in original space, which result in more effective for clustering or information retrieval. Therefore, PCNMF enforces the discriminative capability in the reduced
space. Utilizing the appropriate transformation, PCNMF
represents as a new optimization problem, which can be efficiently solved by an iterative approach. The cluster membership of each document can be easily determined as the
standard NMF. Empirical studies based on Benchmark document corpus demonstrate appealing results.
1 Introduction
Fast growing internet data poses a big challenge for text
mining. Among the many practical problems, topic extract
and document clustering are one of the core issues in web
information mining. For example, in order to be helpful for
the users, search results clustering based on content can save
them more time to find relative content information what
they are interested in, and topic extract plays an important
role in improving the search quality.
∗ This work is partially supported by Natural Science Foundation of
China under grant No.60073007 and No.60121302.
0-7695-3026-5/07 $25.00 © 2007 IEEE
DOI 10.1109/WI.2007.66
To represent the content of document, one can use Vector
Space Model. In this model, each document d is considered
to be a vector in the term-space. The weight of each word
is computed in term of various weight strategy (e.g. TFIDF,
TF and so on). To avoid polysemy, synonym noise and curse
of dimensionality effect, low-dimensionality representation
is desirable for computational efficiency and storage purpose. There are usually two types of methods to achieve
this goal: “ feature extraction” and “feature selection ”. In
this paper, we focus on the former which implicates more
semantical information. Normally, matrix factorization is a
popular technique for feature holistic extraction in information retrieval and in text category [10] [12] [13] [14] [17] ,
such as Latent Semantic Analysis (LSA), Non-negative matrix factorization (NMF), and so on.
Compared with LSA which is a direct method that transforms the occurrence matrix into the essential relationships
between some concepts and the terms, or the documents
in the low-dimensionality semantical space by taking advantage of some linear algebra skills, NMF method has
been proven effective, and yet, more reasonable interpretation for potential topics [9] [10] [11] [12] [14] [17].
Non-negative matrix factorization (NMF) (X ≈ W H,
W ≥ 0, H ≥ 0), which is pursuit of spare representation in
term of non-negative bases, turns out to be an effective approach for discovering latent semantic factors model. It imposes non-negativity constraints in learning the projection
vectors. The elements of the projection vectors, i.e., bases,
together with the low dimensional representations, are all
non-negative. This ensures that the basic projection vectors
shall be combined to form a term-by-document matrix in a
non-subtractive manner.
However, there are some limitations of the existing
NMF: (a). It is a unsupervised projection method. For
classification/clustering task, the decomposition may miss
some useful discriminant information. (b). It is not capa-
250
ble of incorporating some additional prior knowledge. It is
desirable to have a new NMF technique that can provide
flexible framework to catch semantically more meaningful
latent factors and enhance discriminant power for classification task.
On the other hand, there are many pairwise information
for document content on internet world, such user clicking record, webpage hyperlink, manual label, and so forth.
Generally, these pairwise constraints are easy to be obtained
in practice and are beneficial to clustering or classification.
To deal with the above problems, a novel technique for pairwise constraints-guided nonnegative factorization has been
provided in this paper.
The rest of this paper is organized as follows: The related works firstly is introduced in Section 2. The main
work is presented in Section 3: Section 3.1 describes the
motivations of this work and defines the problem setting.
section 3.2 expatiates our approach for Non-negative subspace analysis under pairwise constraints situation. Section
4 depicts some theoretical aspects of algorithm and analyzes
the convergence of PCNMF. The detailed experimental results on both classification task and clustering task were presented in Section 5. Section 6 concludes with future work.
tion methods have been proposed [3] [4]. In this paper, we
focus on NMF method with prior pairwise constraints for
improving document clustering performance and the interpretablity of factors.
3 The Proposed Method
Throughout this paper, we use lowercase boldface letters
to denote vectors and uppercase bold letters to denote matrices, if not stated specially. A+ indicates the Moore-Penrose
pseudo-inverse of a matrix A, and T r(A) means the trace
operator of the corresponding matrix A.
3.1 Problem statement
In general, the tradition NMF problem can be formally
expressed as follows [4]:
Problem 3.1. (The NMF problem) Given a nonnegative
and a positive integer k matrix X ⊆ X ∈ Rm×n
+
min{m, n}, find nonnegative matrices W ⊆ W ∈ Rm×k
+
and H ⊆ H ∈ Rk×n
to minimize the following objective
+
function
1
X − W H2F
2
s.t. W ≥ 0, H ≥ 0
2 Related work
Document clustering is one of the fundamental issue for
topic detection and Tracking, document content summarization and filtering.
Lee and Seung [13] firstly utilized NMF to text documents and demonstrated the ability of NMF to tackle semantic issues such as synonymy. As a low-rank approximations method, NMF is a natural tool for a clustering procedure. Xu et al. [22] described clustering experiments
with NMF, wherein they claimed that the NMF surpasses
the latent semantic index and the spectral clustering methods on accuracy and simplicity. Ding et al. [10] proposed a
variant of NMF, called orthogonal Nonnegative Matrix Trifactorizations, to achieve better performance for document
clustering. An application to email surveillance was discussed in [3]. Cao et al. [5] applied NMF for topic tracking
and document clustering in on-line manner. Other related
work on the algorithms aspect of NMF and its application
about text analysis was presented in the literature [4]. An
interesting discussion on the relationship between NMF an
PLSI was published in [9], the authors give a equivalence
relationship under mild condition.
Following that the Lee and Sueng’s work [13], many researchers have developed a various extensions of NMF in
the different context. One direction on this issue is pursuit
of the sparsity and smooth representation in favor of semantical interpretation [12] [17] [16]; Speedup or approximation algorithm is another hot topic, some alternative itera-
F1 (W , H) := min
W ,H
(1)
where · denotes Frobenius norm.
Conceptually, W is often thought of as a basis matrix,
and H is a mixing factor matrix associated with the data in
X. Note that the traditional NMF does not provide us with
a unique solution if we can find a full rank square matrix
R such that X = W RR−1 H, W R ≥ 0, R−1 H ≥ 0,
A possible R is a rotation matrix, which is an orthogonal
matrix with RT R = I, where I depicts the identity matrix.
As aforementioned in the last section, many data analysis tasks will benefit from utilizing domain knowledge, such
as class labels, pairwise constraints or other diverse form
information. In this paper, we only consider the pairwise
constraints situation. Pairwise constraints specify whether
a pair of data instances are similar (i.e., maybe come from
the same class) or not, and are often called must-link or
cannot-link constraints. They naturally arise in many practical problems. For simplicity of presentation, let M be the
set of unordered must-link pairs such that (xi , xj ) ∈ M
implies xi and xj should be assigned to the same class,
and C be the set of unordered cannot-link pairs such that
(xi , xj ) ∈ C implies xi and xj should be assigned to different classes.
Recall that the low-rank representation is given by H
from the Problem 3.1. It can also be viewed as the projection results under the transformation P : X → H,
251
where P = W + = (W T W )−1 W T . For constraint sets
must-link M and cannot-link C derived from the original
space, we hope that the new representation distribution in
the transformed space can keep all pairwise constraints relationship, that is ,
for (xi , xj ) ∈ M
hi ∼ hj
(2)
hi = hj
for (xi , xj ) ∈ C
where ∼ denotes that the pairwise points are affinity, while
= represents that the separability between the relative data
exists.
Mathematically, we can define pairwise constraints problem as follows:
Problem 3.2. (Pairwise Constraints Problem) Suppose the
data matrix X = {x1 |x2 | . . . |xn }, xi ∈ X has a lowdimensionality representation H = {h1 |h2 | . . . |hn }, hi ∈
H. Given the constraint data pairs sets must-link M and
cannot-link C in the original space Rm . Let dM and dC
be two metrics that quantify the cost of violating must-link
and cannot-link constraints, we aim to find an optimal representation in the transformed space H that minimizes the
objective function
F2 = min
αdM + (1 − α)dC
(3)
where α is a suitable constant and 0 ≤ α ≤ 1, which means
the trade-off between dissimilarity constraints and similarity constraints. Under Euclidean metric, dM and dC can be
expressed as
dM = (i,j)∈M hi − hj 2F
(4)
dC = − (k,l)∈C hk − hl 2F
The motivation behind the proposed approach is actually
quite straightforward. Recall that the solution of the Problem 3.1 is often not presenting uniqueness. Suppose we obtained a solution set S of the problem 1, we attempt to find
an optimum pair (W ∗ , H ∗ ) ∈ S satisfies the constraints
Eq.3. The limitations of previous models lead us to consider
a different framework for handling pairwise constraints in
Non-negative Matrix Factorization framework. We evaluate
the proposed approaches on high-dimensional text clustering problems and experimental results are promising.
3.2 Pairwise Constraints-guided
Matrix Factorization Model
Nonnegative
In this section, we combine the above two problems in
a unified framework. In fact, pairwise constraints idea is
not novel in the document clustering field, such as spherical kmeans algorithm [7], semi-supervised kmeans algorithm [19], and information theoretic co-clustering algorithm [6]. Semi-supervised clustering learning [1] has considered these prior informations as supervised information
or side information [21]. However, how to incorporate pairwise constraints in NMF framework is an interesting issue.
Given a set of pairwise data constraints, we aim to project
the original data to a low-dimensional nonnegative space,
in which must-link instance pairs are close and cannot-link
pairs far apart.
Similarly to local NMF [15] and Fisher NMF [20], we
propose a formulation that leads to particularly efficient factorization algorithm in a subsequent classification/clusting
stage. In order to incorporate discriminant constraints into
the NMF decomposition we substitute the pairwise constraints for the locality constraints of LNMF. In this way,
modified divergence can be constructed that is derived from
the minimization of the original Problem 3.1. Then, the
new constrained NMF problem can be expressed as:
1
X − W H2F + λ (αdM + (1 − α)dC )
2
W ≥ 0, H ≥ 0
min
W ,H
s.t.
(5)
where dM and dC are defined as Eq. 4. From the part-based
approach perspective, the pairwise constraints strongly imply that instances (image/docment) will share more the
same components in the same class than the different
classes. The intuition behind Eq. 5 is to take the total
Euclidean distance of the different class pairwise samples
in the transformed low-dimensional space as large as possible, while the same measure involved by the must-link as
small as possible. Therefore, this way will guarantee the
discriminant power in the transformed space.
Remark 3.1. The loss function of NMF can adopt a variety
of divergence formulation(ie. KL divergence, φ-divergence,
and so on). Without loss of generality, we only consider Euclidean distance metric for approximate error in this paper.
To attack the above problem Eq. 5, we concentrate on
investigating the properties of the second term F2 of Eq. 5.
For simplicity, here we only consider Euclidean measure
for dM and dC . Note that F2 indeed acts as a regularization term over variable W or H, and can be rewritten as a
concise form in matrix:
F2 (H) =
α
2
hi − hj 2F −
(i,j)∈M
1
=
(hi − hj )2 Aij
2 ij
1−α hk − hl 2F
2
(k,l)∈C
(6)
or be represented as
F2 (W ) =
252
1
(W + xi − W + xj )2 Aij
2 ij
(7)
4 Algorithm for PCNMF
where
Aij =
α
−(1 − α)
0
if (xi , xj ) ∈ M
if (xi , xj ) ∈ C
otherwise
Note that Eq.7 is equivalent to Eq.6 under the X ∼ W H
assumption. Hence we only consider the former representation here. With simple algebra formulation, the objective
function Eq.6 can be reduced to
=
1
(hi − hj )2 Aij
2 ij
1 T
hi Dii hi −
hTi Aij hj
2 i
ij
=
T r(HDH T ) − T r(HAH T )
=
=
T r(H(D − A)H T )
T r(HLH T )
In this section, we present a multiplicative algorithms for
solving Eq.8. The most of NMF factorization algorithms
are usually found using iterative algorithms, see [8] [14]
for details.
4.1 Summary of the algorithm
For the convenient of presentation, we first summarize
the whole iteration framework of PCNMF as follows:
Algorithm 1 An Algorithm Framework for PCNMF
1. Initialize W 0 and/or H 0 with non-negative values,
and scale the columns of W to unit norm;
set t ← 0.
2. Fix W t , solve the constrained problem
min{Xj − W Hj 22 + λT r(HLH T )
hj
where D is a diagonal matrix; its entries are column (or row,
since A
is symmetric) sum of A, D satisfies the constraint
D = j Aij , L = D − A is called the Laplacian matrix
in spectral graph theory. Interestingly, the above problem
is closely related to the spectral clustering, a popular technique recently.
That is, update H according to Eq. 13, e.g., find
H t+1 such that
F (W t , H t+1 ) ≤ F(W t , H t )
3. Fix H t+1 and update W in term of Eq. 14, where
W t+1 such that
Remark 3.2. Notice that minimizing F2 (H) is indeed
to minimize the total distance among must-link pairwise
points and maximize the corresponding one among cannotlink pairwise data simultaneously. Though the Laplacian matrix L is symmetric, it can be not positive semidefinite, which differs from its definition in Laplacian Eigenmaps [2].
Differing from the standard NMF, the constraints in the
proposed approach strongly encourage that the data point
(image/text) will share more the same components in the
same class than the difference classes, which is consist with
our intuition for observation. Here, we also notice that the
second term F2 indeed acts as a regularization term, the
penalty term would prohibit the wild decomposition and
make the solution more robust for classification task.
So far, the objective function Eq.5 can be easily rewritten
as follows
F = min
W ,H
F (W t+1 , H t+1 ) ≤ F(W t , H t+1 )
4. Rescale the columns of W to unit norm.
5. Let t ← t + 1, repeat Steps 2 - 4 until convergence
criteria are satisfied.
4.2 Justification of PCNMF algorithm
Firstly, we introduce the KKT condition of the multiobject optimization problem Eq. 8 by the following lemma
(we omit the proof details here, the interested reader can
refer to the related nonlinear programming textbook).
×
Lemma 4.1. Necessary conditions for (W , H) ∈ Rm×k
+
Rk×n
to
solve
the
nonnegative
matrix
factorization
problem
+
Eq. 8 are
T r(X T X) − 2T r(H T (W T X))
W ⊗ ((X − W H)H T ) =
H ⊗ (W T (X − W H) + λHL) =
+ T r(H T (W T W H)) + λT r(HLH T ) (8)
s.t.
W ≥ 0, H ≥ 0
In the next section, we will discuss how to solve this optimization problem.
(X − W H)H T ≥
W T (X − W H) + λHL ≥
where ⊗ denotes the Hadamard product.
253
0 ∈ Rm×k (9)
0 ∈ Rk×n(10)
0
0
(11)
(12)
Next we summarize the update rule details into theorem
4.2 under the KKT condition:
NMF convergence ratio
4
5
3
4
2.5
3
2
2
1.5
1
1
0.5
0
20
40
60
80
100
120
140
160
Iteration number
(a) a
hjt
(W t X)jt
← hjt
(W T W H + λHL)jt
(13)
wij
(XH T )ij
← wij
(W HH T )ij
(14)
Recall that the predefined criterion is usually non-convex
which causes many local optimal solutions, the iterative
procedure for optimizing the criterions usually makes the
final solutions heavily depend on the initialization and algorithm. To make the solution robust, we take the strategy
purposed by Xu et.al. [22] to impose normalization W ,
that is , to assure each row vector of W to have a unit variance. The cost function is invariant under these updates if
and only if W and H are at a stationary point of the distance metric.
The proof can be done by finding an auxiliary function
in the EM-like setup [14]. Because of the limit of space,
we omit it here.
Generally, we use a fixed number of iterations as secondary stopping criteria. However, the selection of the maximum number of iteration is problem-dependent.
4.3 Convergence of the PCNMF algorithm
The update rules are derived by flip-flop algorithm, that
is, fixing one variable and alternatingly updating the other.
The convergence proof is based on the auxiliary function
[14] and briefly illuminated here. Let G(θ, θk ) be an auxiliary function for the cost function F (θ) ( to be minimized
) if G(θ, θk ) ≥ F(θ) and G(θ, θ) = F (θ) hold. We then
update parameter θ by θk = arg minθ G(θ, θk ) to achieve
monotonic non-increasing convergence F (θk+1 , θk ) ≤
G(θk , θk ) = F (θk ).
Figure 1 shows convergence is guaranteed for the final quadratic distance in practical iterative procedure. Of
course, the convergence rate depends on the related iteration algorithm.
4.4 Advantages of PCNMF
As for NMF method, the unstable solution may lead
to the wild factors, which are unsuitable for discriminating document categories. In contrast, PCNMF imposes the
same class to share more similarity on topic and keep the
x 10
6
3.5
Theorem 4.2. Under the condition that all entries of the
matrix W T W H + λHL is enforced nonnegative during
iterative procedure, where L is a symmetric square matric,
the regularized cost function Eq. 8 is monotonically nonincreasing under the learning rules
PCNMF convergence ratio
6
7
4.5
180
0
0
20
40
60
80
100
120
140
160
180
200
Iteration number
(b) b
Figure 1. Convergence rate for NMF and PCNMF.
separatability of the different classes in the projected basis space. Though PCNMF does not offer the unique solution either, it boosts the discriminative ability in the reduced space, which is more suitable for document clustering/classification than NMF.
In addition, the pairwise constraints are more ubiquitous
than class labels in type of knowledge, which we can always
generate from the labeled data but cannot do so inversely. In
addition, when there is not enough labeled data to apply the
supervised learning methods, a clustering approach with the
supervision of constraints derived from the incomplete class
information is more useful.
Again, pairwise constraints are more natural in some scenarios and easier to collect than class labels. In some application domains, the class information can change dynamically. For example, in a dynamic topic tracking system
which topic changes over time, we will certainly encounter
new topic never seen before but we can collect pairwise
constraints on a newly emerged category. Obviously, it is
difficult for the existing classification models to detect new
topic, but PCNMF helps in this situation.
5 Experiments
In this section, we present extensive text clustering experiments to evaluate the effectiveness of our algorithm and
compare it with the start-of-the-art NMF algorithms on a
variety of datasets.
5.1 Text Corpora
All datasets are obtained from the internet, most of
which are frequently used in the information retrieval
benchmark research. The characteristics of all used datasets
refer to Table 1. Some of them are described as following
for more details:
20 Newsgroups1 The 20 Newsgroups data-set contains
18,828 documents belonging to 20 newsgroup classes. The
1 Available
254
at http://people.casil.mit.edu/∼jrennie/20Newsgroups/
5.3 Performance Measures
Table 1. Statics character of the used data
sets, applied for evaluating algorithm for clustering task. n¯c means the average samples
per class.
Dataset
NG20
Tr11
Tr12
Tr23
Tr31
Tr41
Tr45
#Data
1000
414
313
204
927
878
690
#F eature
1000
6429
5799
5832
10127
7454
8261
#Class
20
9
8
6
7
10
10
n¯c
50
46
39
34
132
88
69
news headers have been removed from the data, making
classification harder. Eliminating those minor categories
that contain less then 1000 documents, we select the top
20 categories to work with. For feature, we choose the top
100 words by information gain with class labels.
Tr11/12/23/31/41/452 Data sets Tr11, Tr12, Tr23, Tr31,
Tr41 and Tr45 are derived from TREC-5, TREC-6 and
TREC-7 collections (http://trec.nist.gov). These corpora
have been considered the ideal test set for document clustering purpose. The detail statistics descriptions of the datasets
refer to Table 1.
5.2 Experimental Setting
Firstly, we describe how we generate pairwise data constraints and how we process them in our experiments prior
to clustering task. Since the external class labels are available in our benchmark data sets, we randomly select pairs
of different instances and create must-link and cannot-link
depending on whether the class labels of the two instances
are the same or different. For simplicity, we set the labeled
ratio as 20%, therefore we get all constraints from the labeled data.
Then we perform clustering experiments on each Benchmark dataset including all constraint link relationship pairwise. We compare our PCNMF algorithm on the transformed space with the traditional NMF and LSI on the corresponding space, the latter acted as the basis line in experiments. For PCNMF and NMF approach, instead of utilizing some traditional clustering methods like K-means to
clustering on H, We follow the Xu’s strategy [22]: Use
matrix H to determine the cluster label of each data point.
More specifically, assign document di to cluster C if c =
arg maxj Hij . As for LSI, we still use K-means to do clustering. In all of our experiments, α is defaulted as 0.95, and
10 trials are performed in each test run.
2 Available
at http://www.cse.fau.edu/∼zhong/pubs.htm
To evaluate our algorithms, we consider Adjusted
Rand Index(ARI) [9] and Normalized Mutual Information
(NMI) [18] as the performance metrics. Generally, ARI and
NMI are the objective index for cluster analysis.
Adjusted Rand Index The Rand Index is defined as the
number of pairs of objects which are both located in the
same cluster and the same class, or both in different clusters
and different classes, divided by the total number of objects.
Adjusted Rand Index reassigns Rand Index into the interval
[0, 1]. The higher the Adjusted Rand Index, the more consistency between the clustering results and the labels.
Normalized Mutual Information (NMI) [18] is another index, often used for estimating the quality of clusters.
For two random variable X and Y , the NMI is defined as
I(X, Y )
N M I(X, Y ) = H(X)H(Y )
(15)
where I(X, Y ) is the mutual information between X and
Y , while H(X) and H(Y ) are the entropies of X and Y
respectively. One can see that N M I(X, X) = 1, which
is the maximal possible value of N M I. Given a clustering
result, the N M I in Eq.15 is estimated as:
c c
n·nl,i
l=1
i=1 nl,i log
nl n̂i
(16)
NMI = ( cl=1 nl log nnl )( ci=1 n̂i log n̂ni )
where nl denotes the number of data contained in the cluster
Cl (1 ≤ l ≤ c), n̂i is the number of data belonging to the
i-th class (1 ≤ i ≤ c), and nl,i denotes the number of data
that are in the intersection between the cluster Cl and the
i-th class. The larger this value, the better the performance.
5.4 Experiment results
Table 2 shows the Adjusted Rand Index and NMI measurement at different datasets, for the proposed approach
and the traditional NMF and K-means. From Table 2,
we observe that the proposed PCNMF improves significantly the performance of the standard NMF. the comparison shows the PCNMF is viable and competitive. Boldface
indicates the best performance in comparison.
From Table 2, we observe that the performance ranking is almost always in the descending order of PCNMF,
NMF, K-means+LSI regardless of the document corpora.
Generally, K-means+LSI is more robust because of the LSI
solution is unique.
In summary, the comparison shows that the PCNMF is a
viable and competitive for document clustering, especially
considering that PCNMF is performing document clustering and topic extraction simultaneously, while K-means is
performing document clustering only.
255
Table 2. Performance Comparisons of clustering algorithms , Each entry is the corresponding performance value of the algorithm on the row dataset, labeled data percentage is 20 %.
NG20
Tr11
Tr12
Tr23
Tr31
Tr41
Tr45
K-means+LSI
.123 ± .003
.097 ± .004
.079 ± .005
.069 ± .011
.125 ± .002
.275 ± .008
.121 ± .003
NMI
NMF
.518 ± .047
.408 ± .050
.205 ± .032
.137 ± .025
.210 ± .037
.355 ± .041
.256 ± .022
PCNMF
.589 ± .036
.434 ± .025
.227 ± .021
.148 ± .040
.220 ± .036
.379 ± .049
.259 ± .027
6 Conclusion and future work
In this paper, we systematically investigate the properties of PCNMF, the advantage of which can be reflected on
the intrinsic factor of data. The presented approach provides a flexible framework to incorporate useful constraints
information, and a straightforward way to exploit the locality structure in the transformed space. The low-dimensional
representation aims to avoid overlapping and ambiguous in
the original space and to be feasible for large-scale (timevarying) datasets.
In the future work, we plan to extend PCNMF to sparsity
constraints and kernelization version for nonlinear transform purpose. In addition, tensors are a generalization of
matrices to higher dimensional arrays, and they can be analyzed with multilinear algebra for PCNMF. We also note
that how to guarantee the uniqueness solution of NMF family problem is still an open problem.
References
[1] S. Basu, A. Banerjee, and R. J. Mooney. Semi-supervised
clustering by seeding. In ICML, pages 27–34, 2002.
[2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:1373–1396, 2003.
[3] M. W. Berry and M. Browne. Email surveillance using nonnegative matrix factorization. Comput. Math. Organ. Theory, 11(3):249–264, 2005.
[4] M. W. Berry, M. Browne, A. N. Langville, P. V. Pauca, and
R. J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. 2006 (To appear ).
[5] B. Cao, D. Shen, J.-T. Sun, X. Wang, Q. Yang, and Z. Chen.
Detect and track latent factors with online nonnegative matrix factorization. In IJCAI, pages 2689–2694, 2007.
[6] I. S. Dhillon, S. Mallela, and D. S. Modha. Informationtheoretic co-clustering. In KDD, pages 89–98, 2003.
[7] I. S. Dhillon and D. S. Modha. Concept decompositions for
large sparse text data using clustering. Machine Learning,
42(1/2):143–175, 2001.
K-means+LSI
.124 ± .002
.006 ± .000
.007 ± .002
.038 ± .001
.039 ± .001
.147 ± .012
.003 ± .001
ARI
NMF
.279 ± .036
.249 ± .056
.109 ± .023
.069 ± .018
.131 ± .005
.199 ± .052
.128 ± .032
PCNMF
.316 ± .028
.287 ± .044
.151 ± .021
.087 ± .031
.135 ± .054
.207 ± .062
.132 ± .038
[8] I. S. Dhillon and S. Sra. Generalized nonnegative matrix
approximations with bregman divergences. In NIPS, 2005.
[9] C. Ding, T. Li, and W. Peng. Nonnegative matrix factorization and probabilistic latent semantic indexing: Equivalence
chi-square statistic, and a hybrid method. In AAAI, 2006.
[10] C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for clustering. In KDD, pages
126–135, 2006.
[11] M. Heiler and C. Schnörr. Learning non-negative sparse image codes by convex programming. In ICCV, 2005.
[12] P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research,
5:1457–1469, 2004.
[13] D. D. Lee and H. S. Seung. Learning the parts of objects
by non-negative matrix factorization. Nature, 401:788–791,
october 1999.
[14] D. D. Lee and H. S. Seung. Algorithms for non-negative
matrix factorization. In NIPS, pages 556–562, 2000.
[15] S. Z. Li, X. Hou, H. Zhang, and Q. Cheng. Learning spatially localized, parts-based representation. In CVPR (1),
pages 207–212, 2001.
[16] C.-J. Lin. Projected gradient methods for non-negative matrix factorization. Neural Computation, 2007. To appear.
[17] A. D. Pascual-Montano, J. M. Carazo, K. Kochi,
D. Lehmann, and R. D. Pascual-Marqui. Nonsmooth nonnegative matrix factorization (nsnmf). IEEE Trans.on PAMI,
28(3):403–415, 2006.
[18] A. Strehl and J. Ghosh. Cluster ensembles — a knowledge
reuse framework for combining multiple partitions. Journal
of Machine Learning Research, 3:583–617, 2002.
[19] K. Wagstaff, C. Cardie, S. Rogers, and S. Schrödl. Constrained k-means clustering with background knowledge. In
ICML, pages 577–584, 2001.
[20] Y. Wang, Y. Jia, C. Hu, and M. Turk. Fisher non-negative
matrix factorization for face recognition. In Asian Conf. on
Comp. Vision, 2004.
[21] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. J. Russell. Distance metric learning with application to clustering with
side-information. In NIPS, pages 505–512, 2002.
[22] W. Xu, X. Liu, and Y. Gong. Document clustering based on
non-negative matrix factorization. In SIGIR, pages 267–273,
2003.
256
© Copyright 2026 Paperzz