Spectral Clustering with Links and Attributes · 2011. 5. 14. · Spectral clustering techniques, which partition data into dis-joint clusters using the eigenstructure of a similarity

Spectral Clustering with Links and Attributes

Jennifer NevilleDepartment of Computer

ScienceUniversity of Massachusetts

Amherst, MA 01003

[email protected]

Micah AdlerDepartment of Computer


Amherst, MA 01003

[email protected]

David JensenDepartment of Computer


Amherst, MA 01003

[email protected]

ABSTRACTIf relational data contain communities—groups of inter-relateditems with similar attribute values—a clustering techniquethat considers attribute information and the structure ofrelations simultaneously should produce more meaningfulclusters than those produced by considering attributes alone.We investigate this hypothesis in the context of a spectralgraph partitioning technique, considering a number of hy-brid similarity metrics that combine both sources of infor-mation. Through simulation, we find that two of the hybridmetrics achieve superior performance over a wide range ofdata characteristics. We analyze the spectral decompositionalgorithm from a statistical perspective and show that thesuccessful hybrid metrics exaggerate the separation betweencluster similarity values, at the expense of increased vari-ance. We cluster several relational datasets using the besthybrid metric and show that the resulting clusters exhibitsignificant community structure, and that they significantlyimprove performance in a related classification task.

Categories and Subject DescriptorsI.5.3 [Clustering]: Pattern Recognition

KeywordsClustering, Relational Learning, Spectral Analysis

1. INTRODUCTIONSpectral clustering techniques, which partition data into dis-joint clusters using the eigenstructure of a similarity ma-trix, have been successfully applied in a number of domains,including image segmentation [19] and document cluster-ing [5]. Finding an optimal partition is in general NP com-plete, but the eigenvectors of the matrix provide some infor-mation that can be used to guide an approximate solution.Experimental evidence has shown this heuristic approach of-ten works well in practice and has prompted further inves-tigation into the properties of spectral clustering. Recentfindings—facilitated by a long history of work in spectral

graph theory (e.g., [2])—include a connection to randomwalks [13] and preliminary performance analysis [10, 16].In this paper, we investigate methods of adapting spectralclustering techniques to relational domains.

The goal of this work is to find communities in relationaldata represented as an attributed graph G = (V, E, X),where the nodes V represent objects in the data (e.g., genes),the edges E represent relations among the objects (e.g., in-teractions), and the attributes X record data about each ob-ject (e.g., localization). Community clusters identify groupsof objects that have similar attributes and are also highlyinter-related. For example in genomic data, a group of geneswith similar attributes and many common interactions mayall be involved in a similar function in the cell. The underly-ing assumption is that there is a latent cluster variable thatinfluences both the attribute values intrinsic to objects andthe relationships among objects. In particular, objects aremore likely to link to other objects in the same cluster thanobjects in other clusters, and pairs of objects within a clus-ter are more likely to have similar attribute values than pairsspanning different clusters. A clustering algorithm that ex-amines both link structure and attributes simultaneouslyshould be more robust to noise than methods examiningattribute or link information in isolation.

There has been little work applying spectral techniques torelational domains with a combination of link and attributeinformation. Existing techniques use either: (1) a completegraph where attribute similarity is calculated for all n × npairs of objects (e.g., [16]), or (2) a nearest neighbor graph,where attribute similarity is calculated for n × d pairs ofobjects—each object is connected to a fixed number (d) ofother objects determined by spatial locality (e.g., [19]). Ourwork differs in that we are trying to incorporate the hetero-geneous relational structure into the similarity metric.

The similarity metric, used to populate the similarity ma-trix, provides a means to extend spectral techniques to newdomains. However, the success of spectral clustering tech-niques depends heavily on the choice of metric. There hasbeen some research into learning the correct similarity func-tion from labeled data (e.g., [1]), but for domains where thecorrect clustering is unknown, design has been approachedin a relatively ad-hoc manner. This leaves us with little guid-ance as to how to incorporate link and attribute informationinto a metric for relational domains. This work investigatesthe design of similarity metrics that incorporate multiple

Report Documentation Page Form ApprovedOMB No. 0704-0188

Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering andmaintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, ArlingtonVA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if itdoes not display a currently valid OMB control number.

1. REPORT DATE 2004 2. REPORT TYPE

3. DATES COVERED 00-00-2004 to 00-00-2004

4. TITLE AND SUBTITLE Spectral Clustering with Links and Attributes

5a. CONTRACT NUMBER

5b. GRANT NUMBER

5c. PROGRAM ELEMENT NUMBER

6. AUTHOR(S) 5d. PROJECT NUMBER

5e. TASK NUMBER

5f. WORK UNIT NUMBER

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) University of Massachusetts,Department of Computer Science,Amherst,MA,01003-9264

8. PERFORMING ORGANIZATIONREPORT NUMBER

9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)

11. SPONSOR/MONITOR’S REPORT NUMBER(S)

12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited

13. SUPPLEMENTARY NOTES

14. ABSTRACT If relational data contain communities?groups of inter-related items with similar attribute values-aclustering technique that considers attribute information and the structure of relations simultaneouslyshould produce more meaningful clusters than those produced by considering attributes alone. Weinvestigate this hypothesis in the context of a spectral graph partitioning technique, considering a numberof hybrid similarity metrics that combine both sources of information. Through simulation, we find thattwo of the hybrid metrics achieve superior performance over a wide range of data characteristics. Weanalyze the spectral decomposition algorithm from a statistical perspective and show that the successfulhybrid metrics exaggerate the separation between cluster similarity values, at the expense of increasedvariance. We cluster several relational datasets using the best hybrid metric and show that the resultingclusters exhibit significant community structure, and that they significantly improve performance in arelated classification task.

15. SUBJECT TERMS

16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT Same as

Report (SAR)

18. NUMBEROF PAGES

12

19a. NAME OFRESPONSIBLE PERSON

a. REPORT unclassified

b. ABSTRACT unclassified

c. THIS PAGE unclassified

Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

sources of information and identifies the characteristics thatunderlie successful metrics.

Specifically, we analyze the normalized cut (NCut) spec-tral partitioning algorithm [19] from a statistical perspec-tive. For the special case of bi-partitioning, we show that ascluster size →∞, the spectral decomposition will include aneigenvector that is piecewise constant, with respect to theclusters, for any similarity metric where the average intra-cluster similarity differs from the average inter-cluster sim-ilarity. If the eigenvector associated with the 2nd smallesteigenvalue of the similarity matrix is piecewise constant, thespectral partitioning will be exact [19]. Next, we empiricallyevaluate the effect of finite cluster sizes using synthetic data.We show that: (1) decreasing variance of cluster similari-ties, and increasing separation of similarities, both improvethe ordering of the eigenvector with respect to the clusters,and (2) increasing the separation of cluster similarities hasa greater impact on algorithm performance when the NCutobjective function is used. This indicates that a metric thatincreases variance in order to better separate the cluster sim-ilarities will perform better over a wider range of conditions.Based on these results, we propose a hybrid similarity metricfor relational data that incorporates link and attribute infor-mation, and we evaluate performance on several relationaldatasets. We show that resulting clusters exhibit signifi-cant community structure and demonstrate significant per-formance gains when using the resulting clusters in a relatedclassification task.

2. SPECTRAL CLUSTERINGSpectral clustering originated with graph partitioning tech-niques that exploit the connection between eigenvectors andalgebraic properties of a graph (e.g., [6, 7]). Recently, Shiand Malik [19] presented a new clustering algorithm thatuses spectral partitioning to optimize the NCut objectivefunction. We investigate the application of this algorithmto relational domains through the use of similarity metricsthat incorporate link and attribute information.

The NCut algorithm of [19] clusters datasets through eigen-value decomposition of a similarity matrix. The algorithmis a divisive, hierarchical clustering algorithm, which takes agraph G = (V, E), a set of k attributes X = {X1, · · · ,Xk},where Xk = {xk

i : vi ∈ V }, and a similarity function S,where S(i, j) defines the similarity between vi, vj ∈ V , andrecursively partitions the graph as follows:

Let WN×N = [S(i, j)] be the similarity matrix and let Dbe an N ×N diagonal matrix with di =

Pj∈V S(i, j). Solve

the eigensystem (D − W)x = λDx for the eigenvector x1

associated with the 2nd smallest eigenvalue λ1. Consider muniform values between the minimum and maximum valuein x1. For each value m: bipartition the nodes into (A, B)such that A ∩B = ∅, A ∪B = V, and ∀va ∈ A x1a < m, andcalculate the NCut value for the partition, NCut(A, B) =P

i∈A,j∈B S(i,j)Pi∈A di

+P

i∈A,j∈B S(i,j)Pj∈B dj

. Partition the graph into

the (A, B) with minimum NCut. If stability(A, B) ≤ c, re-cursively repartition A and B.1

1We use the stability threshold proposed in [19] where the sta-bility value is the ratio of the minimum and maximum bin sizes,after the values of x1 are binned by value into m bins. All the ex-

It takes O(n3) operations to solve for all eigenvalues of anarbitrary eigensystem. However, O(|E|) approximate algo-rithms exist [10], and if the weight matrix is sparse, O(n1.4)Lanczos algorithms can be used to compute the solution [18]—for this reason, similarity metrics that produce sparse ma-trices are preferable.

Our hybrid metrics calculate the similarity between objectsi and j through a weighted combination of attribute and linkinformation: S(i, j) = α · 1

k

Pk sk(i, j) + (1 − α) · l, where

sk(i, j) = 1 if xki = xk

j and 0 otherwise, and l = 1 if eij ∈ Eor eji ∈ E, and 0 otherwise.

When α = 1, we refer to the metric as AttrOnly. Whenα = 0, we refer to the metric as LinkOnly. These metricsare included as baselines—one for data clustering techniquesthat ignore link information, and the other for graph par-titioning techniques that ignore attribute information. At-trOnly calculates similarity by counting the number of at-tribute values objects i and j have in common (scaled by k sothe maximum similarity is 1). LinkOnly uses the relationalstructure as a measure of similarity.

When α = kk+1

, we refer to the metric as LinkAsAttr. Thisapproach is an obvious way to include relational information—links are incorporated as a match on the (k +1)th attribute.With no prior domain knowledge, we have no reason to ex-pect that link structure contains more information than at-tribute values. However, link structure is often central inrelational domains—for example, in a graph of hyperlinkedweb documents, we expect a link to confer more informationabout topic clustering than a match on a single word for twopages. To better exploit the relational information, we setα = 1

2. This metric, referred to as WtLinkAttr1, combines

the link and attribute information uniformly—high similar-ity indicates that two objects are related or have a numberof attribute values in common.

In sparse relational graphs, the expected intra-cluster linksimilarity will be less than one, even if the links are per-fectly correlated with cluster membership. In this case, if thelink and attribute information are combined uniformly (e.g.,WtLinkAttr1 ), or if the attributes are given proportionallymore weight (e.g., LinkAsAttr), noise in the attributes candrown out a strong link signal. An approach that gives thelink information proportionally more weight (e.g., α > 1

2)

may achieve better performance. In practice we will notknow how to scale the link information to combine the twosources of information equally. However, for the syntheticexperiments discussed in the next section, we know the max-imum edge probability is 0.2 so setting α = 1

6equalizes the

attribute and link signals. When α = 16, we refer to the met-

ric as WtLinkAttr2. Although we will not know the scalingfactor in practice, we include this metric to test the con-jecture that the poor performance of WtLinkAttr1 is dueto the relatively weak link signal being combined uniformlywith the attribute signal.

When α = l, we refer to the metric as LinkAsFilter. It cal-

periments in this paper used the settings: m = dlog2(N)+1e, andc = 0.06. Sensitivity analysis on synthetic data shows c = 0.06 tobe a conservative threshold, returning clusters with high precisionbut low recall.

culates similarity by weighting the existing edges of G withthe AttrOnly metric. Objects that are not directly relatedhave a similarity of 0 regardless of their attribute values. Ahigh similarity score indicates that two objects are relatedand have a number of attribute values in common. This ap-proach incorporates both sources of information while main-taining the sparsity of the relational data graph so the algo-rithm can use efficient eigensolver techniques.

3. ALGORITHM ANALYSISThe recursive nature of the algorithm complicates analy-sis of higher-order partitioning, so we restrict our attentionto the (simpler) case of a single bipartitioning of the graph.Finding an optimal partition, which minimizes the NCut cri-terion, is an NP-hard problem [19]. However, [19] shows thatwhen there is a partition (A, B) of V such that the 2nd small-est eigenvector x1, of the eigensystem (D − W)x = λDx,is piecewise constant with respect to a partition (A, B):x1i = α, i ∈ A, and x1i = β, i ∈ B, β 6= α, then (A, B)is the optimal partition—it minimizes the NCut criterionand λ1 = NCut.

Recent analysis has focused on achieving a more thoroughunderstanding of the conditions under which x1 will be piece-wise constant. Meila and Shi [13] outline a set of condi-tions under which the spectral algorithm will return an ex-act partitioning, showing that the spectral problem formu-lated for NCut is equivalent to the eigenvectors/values ofthe stochastic matrix P = D−1W. The authors connectspectral clustering to Markov random walks, showing thatP will have an eigenvector that is piecewise constant w.r.t.a partition (A1, A2) iff P is block-stochastic w.r.t. (A1, A2).Here, block-stochastic means that the underlying Markovrandom walk can be viewed as a Markov chain with statespace ∆ = (A1, A2) and transition probability matrix R =[Pss′ ]s,s′=1,2, where for s, s′ = 1, 2 ,

Pj∈As′

Pij is constant∀i ∈As, and Pss′ =

Pj∈As′

Pij for any i ∈ As. This shows that

spectral clustering groups nodes based on the similarity oftheir transition probabilities to subsets of the graph.

There has been little analysis of the impact of non-constanttransition probabilities on algorithm performance. Empir-ical evidence indicates that the algorithm finds good par-titions even when the transition probabilities are far fromconstant. Ideally, we would like to characterize the condi-tions necessary for optimal performance and bound algo-rithm performance otherwise. As a first step, we analyzeasymptotic performance for non-constant intra- and inter-cluster transition probabilities.

If we assume a generative model of the data where a latentcluster variable (A1, A2), determines the attribute values in-trinsic to the objects and the relationships among objects,we can analyze the similarity metric S(i, j), and each entryin W, as a random variable. Consider the entries of rowi. The entries Wij ,Wik are not independent because thesimilarity values are both based on node i. However, con-ditioned on the state of i (e.g., attribute values of i), theentries are independent random variables since the state ofj is independent of the state of k. As a result, the entriesof row i can be viewed as independent random variables.With this model we can show that any similarity metric willproduce piecewise constant eigenvectors in the limit.

Theorem: Let ∆ = (A1, A2) be a partition of V . Letthe function S(i, j) define the similarity measure betweenvi, vj ∈ V . If, ∀i, j, k, S(i, j) is conditionally independentof S(i, k) given node i, and E[P11]E[P22] 6= E[P12]E[P21]then, P has an eigenvector that will converge to piecewiseconstant w.r.t. ∆ as |A1|, |A2| → ∞.

We provide the intuition for the proof here and refer thereader to Appendix A for details. If we view the entries ofW as random variables, the normalized values in P are alsorandom variables (i.e., the entries in W divided by a rowsum of random variables). The total intra- and inter-clustertransition probabilities in P (e.g.,

Pj∈As′

Pij) then corre-

spond to the ratio of two sums of random variables. Sincethe transition probabilities are composed of sums of inde-pendent random variables, as cluster size → ∞, the intra-and inter-cluster transition probabilities will converge to thesame value for all nodes in each cluster. Therefore an eigen-vector of the similarity matrix will converge to piecewise con-stant w.r.t. (A1, A2), provided the intra- and inter-clustermeans (e.g., E[P11], E[P12]) are distinguishable.

This analysis indicates that all metrics will perform equallyin the limit. We expect however, that finite sample perfor-mance will vary based on the characteristics of the metrics.In particular, we expect that performance will be influencedby the mean and variance of the intra- and inter clustertransition probabilities. We demonstrate the impact of thetransition probability distributions below, using syntheticdata experiments.

4. SYNTHETIC DATA EXPERIMENTSIn order to identify the situations where we can expect eachof the similarity metrics to perform well, we evaluate al-gorithm performance on synthetic data sets for which thecorrect clustering is known. This facilitates analysis over awide range of conditions.

4.1 Synthetic DataOur synthetic data sets are undirected, connected graphs(G = (V, E)) where nodes correspond to objects and edgescorrespond to relations among objects. Unless otherwise in-dicated, |V | = 200. A binary label, C = {+,−}, is usedto represent cluster membership; labels are assigned ran-domly to each object with P (+) = 0.5. Each object has fivebinary attributes, where the attribute values are assignedrandomly given the object’s cluster label. Edges are addedto the graph by considering each pair of objects in V in-dependently, and adding edges randomly given the clusterlabels of the two objects.

The experiments record algorithm performance while vary-ing both attribute and link association. Within each level ofcorrelation, all five attributes were generated with the sameprobability: P+ = P (A = 1|C = +) = {0.50, 0.55, . . . , 0.95, 1.0},P− = P (A = 1|C = −) = 1.0 − P+. The symmetry in at-tribute parameters simplifies the analytical analysis but itis not necessary for algorithm correctness. Intra-cluster andinter-cluster links were generated with the following range ofprobabilities: P l

in = P (eij |Ci = Cj) = {0.10, 0.12, . . . , 0.18, 0.20},P l

out = P (eij |Ci 6= Cj) = 0.2− P lin. Here the range of prob-

abilities, and symmetry, was chosen to produce a graph with

approximately 10% of the n(n − 1)/2 possible edges. Thislevel of linkage is comparable to the levels of sparsity wehave observed in real-world relational data sets.

4.2 Metric PerformanceWe measured the accuracy of the six metrics across the rangeof attribute and link probabilities described above. Figure 1reports the accuracy of the clusterings returned by the simi-larity metrics, averaged over 100 trials at each setting. Notethat the bottom, foremost corner of each plot representscompletely random link and attribute information, whereno metric should do better than 0.5.

LinkOnly and AttrOnly performance is as expected—theyperform well when the link, or respectively attribute, signalis moderate to high, but poorly otherwise. The LinkAsAttrand WtLinkAttr1 results are comparable to AttrOnly. How-ever, the LinkAsFilter and WtLinkAttr2 metrics achieveperfect accuracy over a wide range of conditions, with LinkAs-Filter covering more space than WtLinkAttr2. These met-rics should yield good results in datasets where either thelinks or the attributes are moderately correlated with theclusters. However, they do not always perform as well asLinkOnly and AttrOnly. Consider the LinkOnly results whenlink correlation is moderate and attribute correlation is low—both hybrid metrics achieve significantly lower accuracy thanwould be achieved considering links in isolation. Similar be-havior is apparent for the AttrOnly metric, but notice thatthe effect is more pronounced in this situation. This indi-cates that the two metrics rely more heavily on link infor-mation and illustrates the tradeoff for utilizing both sourcesof information—the additional information increases vari-ance, which will impair performance in some situations, inexchange for better coverage of the space.

4.3 Performance AnalysisLinkAsFilter and WtLinkAttr2 achieve superior performanceover a wide range of data characteristics, but what is themechanism by which this occurs? Following our analysisin section 3, we hypothesize that metric performance is in-fluenced by intra- and inter-cluster transition probabilities.We conjecture that the algorithm will be able to distinguishclusters, if the distributions of intra- and inter-cluster tran-sition probabilities are separable, where separation dependson the mean and variance of the transition probabilities.

Given our data generation parameters, we can calculateintra- and inter-cluster mean transition probabilities ana-lytically. Recall that our data generation process producesthe same distribution for each cluster, and furthermore, weknow that the transition probabilities in P are normalizedto sum to one. This means we can examine µPin = E[Pin]from a single set of distributions, µPin and µPout . WhenµPin = 1.0 there is maximal separation between the twoclusters; µPin = 0.5 corresponds to no separation.

Figure 2 graphs µPin vs. attribute/link correlations. Theshapes of the graphs are quite similar to the accuracy graphsin figure 1, indicating a strong relationship between meanseparation and algorithm performance. However, the areaswhere we observe perfect performance (i.e., accuracy = 1.0)do not necessarily correspond to maximum mean separa-tion (i.e., µPin ≤ 1.0). This illustrates a difference between

the LinkAsFilter and WtLinkAttr2 metrics—µPin is signif-icantly higher on average for the LinkAsFilter metric.

To examine the effect of µPin on algorithm performance,we analyzed the data from all metrics concurrently. Figure3a graphs µPin vs. accuracy for the experiments reportedabove, combining results from all the metrics in the samegraph. There is a clear relationship between µPin and accu-racy (corr= 0.849, p � 0.05)—accuracy is consistenly highfor µPin > 0.675 and consistently low otherwise. We lookedat the association between µPin and the eigenvector val-ues in x1 using a number of different measures of eigenvec-tor stability. Only one measure showed a clear relationshipto µPin—a measure of the quality of the ordering in the(sorted) eigenvector, which looked at the sorted eigenvectorand recorded the maximum accuracy possible from the setof m possible partition values considered by the algorithm.The linear search for an optimal partition (in the NCut al-gorithm) should not be adversely affected by degradation ofpiecewise constancy unless the degradation also affects theordering of objects’ eigenvector values. If the maximum ac-curacy is low, this indicates disorder in the eigenvector. Theevector ordering measure is graphed against µPin in figure3b. It shows that decreasing µPin results in a disordering ofthe eigenvector values. These results explain the high accu-racy results—for µPin > 0.675 there is little disorder in theeigenvector.

Figure 3c graphs evector ordering vs. accuracy. There isa strong correlation between evector ordering and accuracy,but there are also a significant number of trials with verylittle disorder that achieve only low accuracy. This effectis explained by figure 3d, where we graph the precision ofthe smallest cluster returned by the algorithm. This showsthat when the eigenvector is ordered correctly but the al-gorithm only achieves low accuracy, it is because the algo-rithm prefers to separate a small, but pure, cluster from therest of the graph. Why does the algorithm break off small,high-precision clusters even when the eigenvector orderingis correct? This is not a spurious effect due to considerationof only a small number of thresholds (e.g., m values). Itremains consistent even when we set m = N . We discussreasons for this effect below.

We have shown that mean separation affects algorithm per-formance through the ordering of the objects’ eigenvectorvalues, but how does variance interact with mean sepa-ration to degrade performance? Figures 4a-b graph thesame variables as figure 3a, but for a set of experimentswith |V | = 500, and |V | = 50. This illustrates the im-pact of decreased, and increased, variance in the transitionprobabilities—increasing variance impairs performance forall µPin , but decreasing variance only improves performancefor µPin > 0.675. This is contrary to our expectation thatdecreased variance would improve performance by increas-ing the separation between cluster transition probabilities.However, this effect is due to the NCut optimization, notthe ordering of the eigenvector values. Figure 4c shows abox plot of evector ordering as a function of sample size, forthe set of trials with µPin < 0.675. Except for the small-est sample size, where we see higher accuracy due to chancealone, the mean ordering value is monotonically increasingwith sample size. Figure 4d graphs accuracy results for the

0.5

0.60.7

0.80.9

1.0

0.10

0.120.14

0.160.18

0.200.5

0.6

0.7

0.8

0.9

1.0

(a)

Pinl P+

Acc

urac

y

0.5

0.60.7

0.80.9

1.0

0.10

0.120.14

0.160.18

0.200.5

0.6

0.7

0.8

0.9

1.0

(b)

Pinl P+

Acc

urac

y

0.5

0.60.7

0.80.9

1.0

0.10

0.120.14

0.160.18

0.200.5

0.6

0.7

0.8

0.9

1.0

(c)

Pinl P+

Acc

urac

y

0.5

0.60.7

0.80.9

1.0

0.10

0.120.14

0.160.18

0.200.5

0.6

0.7

0.8

0.9

1.0

(d)

Pinl P+

Acc

urac

y

0.5

0.60.7

0.80.9

1.0

0.10

0.120.14

0.160.18

0.200.5

0.6

0.7

0.8

0.9

1.0

(e)

Pinl P+

Acc

urac

y

0.5

0.60.7

0.80.9

1.0

0.10

0.120.14

0.160.18

0.200.5

0.6

0.7

0.8

0.9

1.0

(f)

Pinl P+

Acc

urac

yFigure 1: Cluster accuracy of metrics on synthetic data: (a) AttrOnly, (b) LinkOnly, (c) LinkAsAttr, (d)WtLinkAttr1, (e) WtLinkAttr2, and (f) LinkAsFilter.

0.5

0.60.7

0.80.9

1.0

0.10

0.120.14

0.160.18

0.200.5

0.6

0.7

0.8

0.9

1.0

(a)

Pinl P+

Intr

a cl

uste

r m

ean

0.5

0.60.7

0.80.9

1.0

0.10

0.120.14

0.160.18

0.200.5

0.6

0.7

0.8

0.9

1.0

(b)

Pinl P+

Intr

a cl

uste

r m

ean

0.5

0.60.7

0.80.9

1.0

0.10

0.120.14

0.160.18

0.200.5

0.6

0.7

0.8

0.9

1.0

(c)

Pinl P+

Intr

a cl

uste

r m

ean

0.5

0.60.7

0.80.9

1.0

0.10

0.120.14

0.160.18

0.200.5

0.6

0.7

0.8

0.9

1.0

(d)

Pinl P+

Intr

a cl

uste

r m

ean

0.5

0.60.7

0.80.9

1.0

0.10

0.120.14

0.160.18

0.200.5

0.6

0.7

0.8

0.9

1.0

(e)

Pinl P+

Intr

a cl

uste

r m

ean

0.5

0.60.7

0.80.9

1.0

0.10

0.120.14

0.160.18

0.200.5

0.6

0.7

0.8

0.9

1.0

(f)

Pinl P+

Intr

a cl

uste

r m

ean

Figure 2: Intra-cluster means of metrics for synthetic data: (a) AttrOnly, (b) LinkOnly, (c) LinkAsAttr, (d)WtLinkAttr1, (e) WtLinkAttr2, and (f) LinkAsFilter.

0.5 0.6 0.7 0.8 0.9 1.0

0.5

0.7

0.9

(a)

Intra cluster mean

Acc

urac

y

0.5 0.6 0.7 0.8 0.9 1.0

0.5

0.6

0.7

0.8

0.9

1.0

(b)

Intra cluster mean

Eve

ctor

ord

erin

g

0.5 0.6 0.7 0.8 0.9 1.0

0.5

0.7

0.9

(c)

Evector ordering

Acc

urac

y

0.5 0.6 0.7 0.8 0.9 1.0

0.5

0.7

0.9

(d)

Evector ordering

Pre

cisi

on

Figure 3: Analysis of intra-cluster mean on algo-rithm performance: (a) 200 objects, (b) µPin vs. or-dering, (c) ordering vs. accuracy and (d) precision.

same sample, showing that the algorithm converges to lowaccuracies as sample size increases. Maximizing the NCutcriterion causes the algorithm to consistently prefer highprecision over high accuracy when the separation betweenintra- and inter-cluster transition probabilities is low (i.e.,µPin < 0.675). This indicates that metrics with low µPin

should not be combined with the NCut criterion.

It is now clear that the WtLinkAttr2 and LinkAsFilter met-rics achieve their good performance due to high µPin , butwhat do they tradeoff for this increased separation? Fig-ure 5a graphs a box plot of µPin for each metric individually.This is a one-dimensional summary of the data in figure 2,which again illustrates that the µPin is significantly higherfor the LinkAsFilter metric on average. Figure 5b graphs abox plot of the variance of Pin for each metric. This showsthat LinkAsFilter trades off higher variance for increasedmean separation. Figure 4c-d graphs the performance ofWtLinkAttr2 and LinkAsFilter for |V | = 50. Compare thisto figure 1 to see that performance degradation is not uni-form across metrics. The LinkAsFilter metric is adverslyaffected over a wider range of data conditions. This il-lustrates the primary distinction between LinkAsFilter andWtLinkAttr2. The LinkAsFilter metric reduces the amountof information it uses in order to increase the mean sepa-ration between the clusters. Because it is filtering the at-tribute information through the existing edges of the graph,it throws away both useful and noisy data and increases thevariance of the transition probabilities. If the sample size islarge enough to withstand this increase in variance, then themetric will produce superior clusterings. However, when thesample size is low, the filter can do more harm than good.For example, filtering through the existing edges may dis-connect a previously connected cluster. In these situations,it may be best to use the WtLinkAttr2 metric, which suf-

0.5 0.6 0.7 0.8 0.9 1.0

0.5

0.7

0.9

(a)

Intra cluster mean

Acc

urac

y

0.5 0.6 0.7 0.8 0.9 1.0

0.5

0.7

0.9

(b)

Intra cluster mean

Acc

urac

y

26 50 100 200 400

0.5

0.7

0.9

(c)

Dataset Size

Eve

ctor

ord

erin

g ●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●

●●●●●●●●●●●

●

●

●

●●●

●●●

●

●

●●●●●●●●●

●●

●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●

●●●

●

●●

●●

●●●●●●●

●●●●●●●●

●●●●●●●●

●

●●

●●

●●

●●●●●●●●●●●

●●●●●●●●●

●●●●●●●

●

●●●●●●●●●●●●

●●

●●●●●●●●●●

●●

●

●

●●●

●●●●●●

●●●●●●

●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●

●●

●●●

●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●

●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●●●●

●●●

●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●

●

●●●●●

●●●●●

●●●●

●

●●●●●●●●●●●

●●●●●

●●●●●●

●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●

●

●

●

●●●●●●●

●●

●●●●●●●●●

●●●●●●●●●

●●

●●●●●●

●●●●●●●●●

●

●●●●●●●●●●

●

●

●●

●●

●

●●

●

●●●●

●●●●●

●

●●

●●●●●

●●

●●

●●

●●●

●●●●●●

●●●●

●

●●●●

●●●●●●●●

●●●

●●●●●●●

●

●●

●

●●●

●●

●●

●●●●

●●●

●●●●●●●

●

●●

●

●

●●●●●●●●●

●

●

●

●●●●●

●

●●●

●●●●●●●●

●●●●

●

●

●●●●●●

●

●●●●●●

●

●●

●●●●●●●●●

●●●●●

●●

●

●●

●

●●●●

●●●●●●

●

●●

●●

●

●●●●●●●●●●

●

●●

●●●●

●

●●●

●

●●●●

●

●●●●●●

●

●

●●●

●

●●●●●●●●●●●●●●●●●●

●

●●

●●

●●●●

●●●

●

●

●

●

●

●●●●●

●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●

●●●●●●●●

●

●●●

●

●●●●●●●

●

●●●●●●●●●

●

●●●

●●

●●●

●

●●●

●●●●●

●●●●●●●●●

●

●

●●●

●

●●●

●●

●●

●

●

●●●

●●

●

●

●●

●

●●●●

●

●

●●●●●●●●●●●●

●●●●●

●

●●●

●

●●

●●

●●●●●●●●●

●●●●●

●

●

●●●

●●●●●●●●●●

●●●●

●

●

●

●

●

●

●●

●●●●

●●

●●●●

●●●●●

●

●●●●●●

●

●●

●●●

●●●●●●●●●●●

●

●●●

●●●

●

●●●●●●●●●●●

●●

●

●●●●●

●●●●●●

●

●

●

●●●●●●●

●

●●

●●

●

●

●

●●●●

●●

●

●●●

●

●●●●●●●●

●●●

●●

●●

●●

●

●●

●●●●●●●●

●

●

●●●

●●●

●●●●●●

●

●

●

●●●●●●

●●●●

●

●●●●

●

●

●●●

●

●●

●

●●

●●●

●●

●●●●●●●●●●●

●

●

●

●

●●●●●●●●●

●

●●

●

●●●●●

●●●●●●●●●●●●●●

●

●

●●●

●

●

●●

●●

●

●●●●

●

●●●●●●●

●●

●

●●●●●●●

●

●

●●

●●●●●●●●●●

●

●●●●●●●●●

●

●●●

●●●●●●●●●●●●●●

●

●

●

●●●●

●●●●

●

●●

●●

●

●●●●●●●

●

●

●

●

●●●

●

●●●●●●

●●

●●●

●

●

●

●●●●●●●●●●●●●

●

●●

●

●●●

●●

●●

●

●

●

●●●●●

●

●●●●●

●

●

●●●

●

●

●●

●●●●●

●●

●●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●

●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●

●

●

●●●●●●

●

●

●●

●

●

●

●

●●●●●

●●

●●●●

●●●●

●●●

●●●●●●●

●●

●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●●

●●

●●●●●●●●●●

●●●●●●●●●●●●●●

●

●●●●

●

●●●●●●

●●●●●●●●●

●

●

●●●●●●●●●●●●●●●●

●●●●

●

●●

●●●●●

●

●●●●●

●●●●

●

●●

●●

●

●●

●●

●

●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●

26 50 100 200 400

0.5

0.7

0.9

(d)

Dataset Size

Acc

urac

y

Figure 4: Analysis of intra-cluster variance on al-gorithm performance: (a) 500 objects, (b) 50 ob-jects, (c) ordering and (d) accuracy for settings withµPin < 0.675.

fers less from increased variance and still performs well overa wide range of data characteristics. However, since we donot know how to set α for WtLinkAttr2 in practice, andbecause LinkAsFilter offers the opportunity to use efficienteigensolver techniques, we focus on LinkAsFilter for our em-pirical data experiments.

5. EMPIRICAL DATA EXPERIMENTSThe experiments reported below are intended to evaluatetwo assertions. The first claim is that the LinkAsFilter clus-tering approach can be used to find groups of items withsimilar attribute values and high inter-connectedness. Weevaluate this claim by comparing the clusters produced bythe LinkAsFilter metric to randomly generated clusters ofthe same size, evaluating intra-cluster attribute similarityand intra-cluster linkage.

The second claim is that the LinkAsFilter clustering ap-proach finds meaningful clusters. Evaluating clusterings ofdatasets for which there is no right answer is a difficult task.One approach is to present the resulting clusters for user ex-amination. For this type of subjective evaluation, we includeexample cluster members from two real-world datasets. An-other, more objective, approach is to examine cluster utilityby evaluating the cluster labels ability to improve a relatedclassification task. We evaluate three approaches (LinkOnly,AttrOnly, and LinkAsFilter) on a third real-world datasetin this manner, and show the LinkAsFilter clusters achievea significant improvement in classification accuracy.

5.1 DatasetsWe clustered three real-world datasets where attributes ex-hibit correlation among linked objects, and the link struc-ture exhibits clustering. These are the characteristics we

A B C D E F

0.5

0.7

0.9

(a)

Similarity metric

Intr

a cl

uste

r m

ean

Attr

Onl

y Link

Onl

y

Link

AsA

ttr

WtL

inkA

ttr1

WtL

inkA

ttr2

Link

AsF

ilter

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

A B C D E F

0.00

0.04

0.08

0.12

(b)

Similarity metric

Var

ianc

e

0.50.6

0.70.8

0.91.0

0.100.12

0.140.16

0.180.200.5

0.6

0.7

0.8

0.9

1.0

(c)

Pinl P+

Acc

urac

y

0.50.6

0.70.8

0.91.0

0.100.12

0.140.16

0.180.200.5

0.6

0.7

0.8

0.9

1.0

(d)

Pinl P+

Acc

urac

y

Figure 5: (a) Intra-cluster mean by metric, (b)intra-cluster variance by metric, (c) accuracy ofWtLinkAttr2 and (d) LinkAsFilter for 50 objects.

expect to find in datasets that contain communities, and itis in these situations that we expect our clustering algorithmwill perform well.

The first data set is drawn from Cora, a database of com-puter science research papers extracted automatically fromthe web using machine learning techniques [12]. We selectedthe largest connected component from the set of machine-learning papers published after 1993. The resulting graphcontains 1,042 papers and 2546 citation links. We clus-tered the undirected version of this graph. The similaritymetric considered two topic attributes at different levels ofgranularity (e.g., {Machine Learning, Neural Networks} and{Planning, Rule Learning}).

The second data set consists of a set of web pages fromfour computer science departments, collected by the WebKBProject [4]. The web pages have been manually classifiedinto the categories: course, faculty, staff, student, researchproject, or other. The category “other” denotes a pagethat is not a home page (e.g., a curriculum vitae linkedfrom a faculty page or homework description linked from acourse page). The collection contains approximately 4,000web pages and 8,000 hyperlinks among those pages. Weclustered the largest connected component in these data—agraph of 1236 pages and 3673 hyperlinks. Again, we usedthe undirected version of the graph. The similarity metricconsidered two attributes: page category and department.However, the entire component is from a single department(Wisconsin) so the department attribute adds no additionalinformation.

The third data set is a relational data set containing infor-mation about the yeast genome at the gene and the pro-tein level (www.cs.wisc.edu/∼dpage/kddcup2001/). The data

Table 1: Cora cluster examples

Cluster 9: Belief revision: A critique; Plausibility measuresand default reasoning; Modeling belief in dynamic systems. PartI: foundations; Knowledge-Based Framework for Belief Change,Part II: Revision and Update; Iterated revision and minimal re-vision of conditional beliefs; An event-based abductive model ofupdate; On the logic of iterated belief revision; A unified modelof qualitative belief change: A dynamical systems perspective;Generalized update: Belief change in dynamic settingsCluster 14: In defense of C4.5: Notes on learning one-leveldecision trees; Exploring the decision forest: An empirical in-vestigation of Occams razor in decision tree induction; Algorith-mic stability and sanity-check bounds for leave-one-out cross-validation; Bias and the quantification of stability; Characteriz-ing the generalization performance of model selection strategies;A new metric-based approach to model selection; Preventingoverfitting of Cross-Validation data; Further experimental evi-dence against the utility of occams razorCluster 19: An empirical evaluation of bagging and boosting;On-line portfolio selection using multiplicative updates; Hetero-geneous uncertainty sampling for supervised learning; Improvedboosting algorithms using confidence-rated predictions; On-linealgorithms in machine learning; Training algorithms for hiddenMarkov models using entropy based distance functions; A sys-tem for multiclass multi-label text categorization; Coevolution-ary Search Among AdversariesCluster 24: Refinement of Bayesian networks by combin-ing connectionist and symbolic techniques; DistAl: An inter-pattern distance-based constructive learning algorithm; AnAnytime Approach to Connectionist Theory Refinement: Refin-ing the Topologies of Knowledge-Based Neural Networks; Cre-ating advice-taking reinforcement learners; Learning controllersfor industrial robots; Generating accurate and diverse membersof a neural-network ensemble; A Neural Architecture for a High-Speed Database Query System; Comparing methods for refiningcertainty-factor rule-bases;

set contains information about 1,243 genes and 1,734 in-teractions. We clustered the largest connected component,which consisted of 814 genes and 1475 interactions. Thesimilarity metric considered 13 boolean function attributes.Each gene may have multiple functions. We evaluated theresulting cluster labels’ ability to predict gene localization.We applied a relational Bayesian classifier [15] to the entiredataset, using the cluster labels as an additional attribute,and measured performance.

5.2 ResultsClustering the sample of Cora papers produced 71 clustersvarying in size from 1-202 papers, with an average size of15. We report statistics for the 28 clusters with more thansix papers. Table 1 includes randomly selected titles fromfour clusters for subjective evaluation. Although we did notuse title words in the similarity metrics, the clusters show asurprising uniformity among the titles. This indicates thatresearch papers can be clustered into meaningful groups us-ing the citation structure and topic attributes alone.

To evaluate intra-cluster attribute similarity, we averagedthe attribute similarity across all pairs of genes within eachcluster. As a baseline measure we calculated the average at-tribute similarity in ten random clusterings. Figure 6a plotsthe intra-cluster attribute similarity (dark bars) comparedto the expected averages given random clusterings (lightbars), with the clusters listed in ascending order by size.

1 4 7 10 13 16 19 22 25 28

(a)

Cluster

Mea

n A

ttrib

ute

Sim

ilarit

y

0.5

0.6

0.7

0.8

0.9

1.0

1 4 7 10 13 16 19 22 25 28

(b)

ClusterP

rop.

Intr

a−C

lust

er L

inks

0.0

0.2

0.4

0.6

0.8

1.0

Figure 6: Evaluation of hybrid clusters in Cora.

Table 2: WebKB cluster examples

Cluster 5: http://www.cs.wisc.edu/Dienst/UI/2.0/Describe/-ncstrl.uwmadison/CS-TR-89-890; http://www.cs.wisc.edu/-Dienst/UI/2.0/Describe/ncstrl.uwmadison/CS-TR-90-947;

http://www.cs.wisc.edu/Dienst/UI/2.0/Describe/-ncstrl.uwmadison/CS-TR-95-1283; http://www.cs.wisc.edu/-Dienst/UI/2.0/Describe/ncstrl.uwmadison/CS-TR-91-1037;http://www.cs.wisc.edu/Dienst/UI/2.0/Describe/ncstrl.-uwmadison/CS-TR-90-962; http://www.cs.wisc.edu/Dienst/-UI/2.0/Describe/ncstrl.uwmadison/CS-TR-89-900; http://-www.cs.wisc.edu/∼reps/reps.html; http://www.cs.wisc.edu/-Dienst/UI/2.0/Describe/ncstrl.uwmadison/CS-TR-91-1038Cluster 9: http://www.cs.wisc.edu/∼bart/537/quizzes/-quiz6.html; http://www.cs.wisc.edu/∼bart/cs537.html;http://www.cs.wisc.edu/∼bart/537/quizzes/quiz3.html;http://www.cs.wisc.edu/∼bart/537/quizzes/quiz10.html;http://www.cs.wisc.edu/∼bart/537/quizzes/quiz2.html;http://www.cs.wisc.edu/∼bart/537/programs/program2.html;http://www.cs.wisc.edu/∼bart/537/lecturenotes/-titlepage.html; http://www.cs.wisc.edu/∼bart/537/quizzes/-quiz9.html;Cluster 11: http://www.cs.wisc.edu/∼cs354-2/cs354/-lec.notes/numbers.html; http://www.cs.wisc.edu/-∼cs354-2/cs354/lec.notes/data.structures.html; http://-www.cs.wisc.edu/∼cs354-2/cs354/solutions/Q2.j.html; http://-www.cs.wisc.edu/∼cs354-2/cs354/lec.notes/arch.features.html;http://www.cs.wisc.edu/∼cs354-2/cs354/lec.notes/-interrupts.html; http://www.cs.wisc.edu/∼cs354-2/cs354/-lec.notes/case.studies.html; http://www.cs.wisc.edu/∼cs354-2/cs354/lec.notes/arith.int.html; http://www.cs.wisc.edu/-∼cs354-2/cs354/lec.notes/MAL.html;Cluster 14: http://www.cs.wisc.edu/condor/research.html;http://www.cs.wisc.edu/∼bart/cs638.html; http://-www.cs.wisc.edu/coral/coral.people.html; http://-www.cs.wisc.edu/∼brad/brad.html; http://www.cs.wisc.edu/-∼sastry/spring96.html; http://www.cs.wisc.edu/∼ashraf/-ashraf.html; http://maf.wisc.edu/distributed/condor/-index.html; http://www.cs.wisc.edu/∼ssl/resume.html;

1 3 5 7 9 11 13 15

(a)

Cluster

Mea

n A

ttrib

ute

Sim

ilarit

y

0.5

0.6

0.7

0.8

0.9

1.0

1 3 5 7 9 11 13 15

(b)

Cluster

Pro

p. In

tra−

Clu

ster

Lin

ks

0.0

0.2

0.4

0.6

0.8

1.0

Figure 7: Evaluation of hybrid clusters in WebKB.

Attribute similarity is significantly higher than expected.2

Note that the largest cluster (#28) does not exhibit highlinkage or attribute similarity. This cluster may contain theset of papers that could not be partitioned into smaller clus-ters (i.e., the papers with no coherent community structure).

Figure 6b shows the actual and expected proportion of intra-cluster citations. To assess the connectivity of the clusters,we compared the proportion of intra-cluster linkage (percluster) to expected proportions, given ten random clus-terings. Again, the proportion of intra-cluster citations issignificantly higher than the expected values. This indi-cates that the clustering technique is finding groups of highlyinter-connected research papers.

Clustering the sample of WebKB pages produced 55 clustersvarying in size from 1-649 pages, with an average size of 22.We report statistics for the 15 clusters with more than sixpages, listed in ascending order by size. Table 2 includesrandomly selected URLs from four clusters for subjectiveevaluation. Recall that the component graph only containspages from the University of Wisconsin. The selected clus-ters appear to group by function—for example, tech reports,course pages, or research group pages.

Figure 7b plots the intra-cluster averages compared to theexpected averages given random clusterings. Figure 7b showsthe actual and expected proportion of intra-cluster hyper-links. The proportion of intra-cluster linkage is significantlyhigher than expected, but notice that the largest cluster’s(#15) expected linkage is quite high by random chance.This may indicate that the largest cluster contains a setof pages that are too tightly connected to partition. Thisclustering does exhibit significantly higher than expected at-tribute similarity. However, we note that the algorithm isstill able to cluster pages into groups that are highly inter-connected. This indicates that the LinkAsFilter metric maybe robust to irrelevant attribute values.

Clustering the sample of genes produced 88 clusters varyingin size from 1-140 genes, with an average size of 8. We reportstatistics for the 14 clusters with more than six genes. Intra-cluster attribute similarity (figure 8a) and intra-cluster link-age (figure 8b) are both significantly higher than expected.These results show that the LinkAsFilter metric can be usedto find groups of genes with similar functions and many com-mon interactions.

The structure of genomic data offers an opportunity for anobjective evaluation of the clustering results. Clusters ofinter-connected genes with similar associated functions mayindicate a group of genes that are interacting to perform aparticular function in the cell. If this is the case, the clusterlabels should be helpful in predicting gene localization in thecell. To test this hypothesis, we used the cluster labels topredict gene localization. We applied a relational Bayesianclassifier (RBC) [15] to the gene data, using the cluster labelsas an additional attribute, and measured change in accuracy.Figure 8d reports average 10-fold cross-validation accuraciesfor RBC models learned using the cluster labels from theLinkOnly, AttrOnly, and LinkAsFilter metrics. The baseline

2We assessed significance using two-tailed t-tests, p < 0.05.

1 3 5 7 9 11 13

(a)

Cluster

Mea

n A

ttrib

ute

Sim

ilarit

y

0.08

0.10

0.12

0.14

1 3 5 7 9 11 13

(b)

Cluster

Pro

p. In

tra−

Clu

ster

Lin

ks

0.0

0.2

0.4

0.6

0.8

1.0

1 3 5 7 9 11 13

(c)

Cluster

Siz

e

020

4060

8010

012

014

0

AttrOnlyLinkOnlyLinkAsFilterw/o Clusters

(d)A

ccur

acy

0.50

0.60

0.70

0.80

Cluster labels only Other attrs + cluster labels

Figure 8: Evaluation of hybrid clusters in Gene.

RBC model used twelve attributes for prediction, includinggene phenotype and motif, and achieved an average accu-racy of 66.3%. The RBC model that included cluster labelsfrom AttrOnly did not significantly improve accuracy.3 Themodel that included cluster labels from LinkOnly achieveda significant improvement in accuracy, with an average of68.4%, indicating that gene interactions alone are helpfulfor predicting location. However, the model that includedcluster labels from LinkAsFilter achieved an average accu-racy of 70.2%. This is a significant improvement over bothLinkOnly and the baseline RBC model without cluster la-bels, which demonstrates the utility of clustering for com-munities using both attribute and link information.

6. DISCUSSIONThis paper presents a hybrid metric for spectral clusteringalgorithms that exploits both attribute information and linkstructure to improve discovery of communities in relationaldata. There has been relatively little work investigatingclustering techniques for relational domains. The work inthis area has focused on either complex generative modelswith latent variables [11, 20, 3], or augmented clusteringtechniques that use ad-hoc similarity metrics to incorporateboth link and attribute information [14, 9]. Due to the com-plexity of probabilistic relational models with latent vari-ables, and the sparsity of relational graphs that enable theuse of efficient eigensolver techniques, we chose to exploreextensions to spectral clustering for relational domains.

The most closely related prior work is that of He, Ding, Zha,and Simon [9], which uses a spectral graph-partitioning al-gorithm to automatically identify topics in sets of retrievedweb pages. This approach uses a similarity measure specifi-cally designed for high-dimensional text domains with weightedco-citation links. We differ from this work, and other re-

3Again, significance was assessed using two-tailed t-tests,p < 0.05.

search on hybrid spectral algorithms, in our exploration ofthe characteristics that underlie successful similarity met-rics.

We have set up a framework to evaluate different similaritymetrics quantitatively over a wide range of relational datasets. Our experiments show that increasing the separationbetween total intra-cluster and inter-cluster transition prob-abilities results in superior performance over a wide range ofdata characteristics. One way to increase the separation be-tween cluster transition probabilities is to drop potentiallynoisy information from consideration. Using this approach,we expect the LinkAsFilter metric will successfully recovergroupings over a wide range of data characteristics.

There are two primary advantages to using the LinkAsFiltermetric. The first advantage is algorithm efficiency—thereare O(E) approximate eigensolver algorithms, and there areO(n1.4) exact eigensolver algorithms for sparse matrices thatcan exploit the sparse matrix structure produced by the met-ric. The second advantage is the choice of α = l, which isindependent of data characteristics. We expect the metricwill work well in any dataset exhibiting community struc-ture, provided there is enough data to withstand the associ-ated increase in variance. In small datasets, where the sizeof the data cannot offset the increase in variance, the appli-cation of balanced metrics (e.g., WtLinkAttr2 ) may producesuperior clusterings. In practice however, this approach islimited by the need to set α to balance the link and attributeinformation.

With a way to evaluate each setting, an algorithm couldsearch for the best α. Our analysis indicates that the “best”settings will maximize the separation between the intra-cluster and inter-cluster transition probabilities. We con-jecture that the eigenvector information—more specifically,the separation between the means of distributions of theeigenvector values on either side of the cut—can be used toapproximate this information. We report preliminary find-ings in support of this conjecture.

Figure 9a graphs the correlation between algorithm perfor-mance and the separation of eigenvector-value distributions.We clustered over the space of synthetic datasets describedin section 4.1 using 20 different values of α, chosen uni-formly in the range [0, 1]. We recorded (1) the accuracy ofthe clustering, and (2) the distance between the means of theeigenvector-value distributions on either side of the chosencut (after the values were normalized to unit range). Fig-ure 9b shows performance when we set α by maximizing theseparation between the means of the eigenvector-value dis-tributions. Comparing this graph to figure 1, we can see thatthis technique approaches the performance of the LinkAs-Filter metric. This is a promising direction to explore forapplications with little data, where the variance will be toohigh to apply LinkAsFilter successfully.

7. CONCLUSIONS AND FUTURE WORKWe have analyzed the spectral decomposition algorithm froma statistical perspective and shown that the successful hy-brid metrics use the link and attribute information to in-crease the separation between noisy clusters. We have shownan empirical connection between the distribution of tran-

+++ ++ +++++++++++++ ++++++++++ ++++++++ ++

++

+++++++ ++

++

++++++

++

++++

++++++++++++++++++++

+

++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++

++ ++++ ++ ++++++++ ++ + ++

++++ ++++++ +

+

+++++ + +++

+++++++++++++++++++++++++++++++++++++ ++ ++ +

+++++

++++++++++ +++

+++

+++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++

++++++++++++++++++++ +

+ +++++++++++ ++ +++++++

+++++++++++++++++++++

++++++++++++++

+ +++

+++++++++++++

++++++

+++++

+++++++++++++++++++++

+++++++++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++

++++ ++ ++++++++++++++ +

+ +++++ ++++

+

++++++ +++ +

+++++++++++ +++++ ++++++++++++++++++++++++++

+

+++++++

++++++++

+

++++

+++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++

++ ++++++ ++ ++++++++++ +

++++++++

++++++ +++++++

+++++++++ +++++++ +

+

+

+

+

++ ++++++++ +++

++

++

+

+++

+++

++

+++

++++

+++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++

+++++++++++++++ ++++

++

+

+++

+

+++++

+

+++ ++++ ++ +

+++++ +++++++++++++

+++

+++ ++++++++

++++++++

++

++++

+

+

++++++

+

++++++++++

+++++++++++++++++++

+++++++++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++

+++++ ++++++++++ ++++++

+++ + +++++ ++++ ++++ ++ ++

++++++++++

+++

+

++

+++++

++++++

+

++

+

+ ++++++++++

+++

++

+

+

++++++++++++++

++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++

+++ ++++++++++++

+

+

+

+

++

+ + ++

+

+++++

+

++++

+

++

++

+

+++++++++++++ +

+

+++

+

++

++++++

+++

+ +

++

+

+++++++

++

+++

+++

++++++

+++++++

+

++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++ +++++++++++++++++++++

+++++++++++++++++

+

+

+

+

++ +

+

++

+

+++

+

+

+

+ ++++ ++

+

+++++ ++++

+

+

+

+ + + + + + + + +

++++

++

+

+

+

++ + + + + + + + + + +

+

+

++

++ + + + + + + + + + + +++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++

+

+

++ + + + + + ++ ++++++ + ++++++ ++++ ++

+++++++++

+ +

++

+

+ + + + + + +

++++++ +

+

+

+

+ + + + + + + + +++

++

+

+

++

+++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++ ++

+

+

+

++

+ + + +++ ++++++++++ +

+++ ++++ +

+++++++

+

+

+

++ + + + + + + + + +

+ ++++

+

+

++

+ + + + + + + + + + ++

+

+

+

+++

+ + + + + + + + + + + + + ++

+ + + + + + + + + + + + + + ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++ +++

+

+

++

+ + + + + + + +++++++++++++++++++++

0.4 0.5 0.6 0.7 0.8

0.5

0.6

0.7

0.8

0.9

1.0

(a)

Eigenvector value separation

Acc

urac

y

0.50.6

0.70.8

0.91.0

0.100.12

0.140.16

0.180.200.5

0.6

0.7

0.8

0.9

1.0

(b)

Pinl P+

Acc

urac

yFigure 9: Searching for α to use in the metric: (a)correlation between separation of eigenvector valuesand accuracy (corr = 0.71), and (b) cluster accuracyusing α that maximizes separation.

sition probabilities and algorithm performance, connectingboth mean and variance to cluster accuracy. Future workwill compare this approach to latent-variable relational mod-els and explore complexity/efficiency tradeoffs between thetwo techniques. Furthermore, we will attempt to derive the-oretical bounds on finite-sample performance, and explorethe alternative optimization criteria for data with low meanseparation, where the NCut criteria prefers high-precision/low-recall groupings.

In addition, the WebKB results suggest an alternative clus-tering task—clustering data that exhibit role equivalencestructure, rather than community structure. Objects thatplay the same roles in a graph have similar attributes andsimilar link patterns but may not actually link to each other.For example, faculty pages rarely link to each other but theyconistently link to student and course pages. Current meth-ods for grouping data in this manner focus primarily on linkinformation (e.g., [17]). Extending this work to incorporateattribute information seems an exciting direction to explore.

8. ACKNOWLEDGMENTSThe authors acknowledge helpful comments and discussionfrom Alicia Wolfe. This research is supported under a AT&TGraduate Research Fellowship and by DARPA and AFRLunder contract numbers F30602-00-2-0597 and F30602-01-2-0566.

9. REFERENCES[1] F. Bach and M. Jordan. Learning spectral clustering.

In Proceedings of NIPS16, 2003.

[2] F. Chung. Spectral Graph Theory. The AmericanMathematical Society, 1997.

[3] D. Cohn and T. Hofmann. The missing link - aprobabilistic model of document content andhypertext connectivity. Advances in NeuralInformation Processing Systems, 10, 2001.

[4] M. Craven, D. DiPasquo, D. Freitag, A. McCallum,T. Mitchell, K. Nigam, and S. Slattery. Learning toextract symbolic knowledge from the world wide web.In Proceedings of the 15th National Conference onArtificial Intelligence, 1998.

[5] I. Dhillon. Co-clustering documents and words usingbipartite spectral graph partitioning. In Proc. of the7th ACM International Conf. on Knowledge Discoveryand Data Mining, 2001.

[6] W. Donath and A. Hoffman. Lower bounds for thepartitioning of graphs. IBM Journal of Research andDevelopment, 17):420–425, 1973.

[7] M. Fiedler. Algebraic connectivity of graphs.Czecheslovak Math. Jour., 23(98):298–305, 1973.

[8] G. Golub and C. V. Loan. Matrix Computations.Johns Hopkins University Press, 1983.

[9] X. He, C. Ding, H. Zha, and H. Simon. Automatictopic identification using webpages clustering. InProceedings of the 1st IEEE International Conferenceon Data Mining, 2001.

[10] R. Kannan, S. Vempala, and A. Vetta. On clusterings:Good, bad and spectral. In Proceedings of the 41stSymposium on the Foundations of Computer Science,2000.

[11] J. Kubica, A. Moore, J. Schneider, and Y. Yang.Stochastic link and group detection. In Proceedings ofthe 18th National Conference on ArtificialIntelligence, 2002.

[12] A. McCallum, K. Nigam, J. Rennie, and K. Seymore.A machine learning approach to buildingdomain-specific search engines. In Proceedings of the16th International Joint Conference on ArtificialIntelligence, 1999.

[13] M. Meila and J. Shi. A random walks view of spectralsegmentation. In Proceedings of the 8th InternationalWorkshop on Artificial Intelligence and Statistics,2001.

[14] D. Modha and W. Spangler. Clustering hypertextwith applications to web searching. In Proceedings ofthe 11th ACM Conference on Hypertext andHypermedia, 2000.

[15] J. Neville, D. Jensen, and B. Gallagher. Simpleestimators for relational bayesian classifiers. InProceedings of the 3rd IEEE International Conferenceon Data Mining, 2003.

[16] A. Ng, M. Jordan, and Y. Weiss. On spectralclustering: Analysis and an algorithm. In NIPS 2001,2001.

[17] K. Nowicki and T. Snijders. Estimation and predictionfor stochastic blockstructures. Journal of theAmerican Statistical Association, 96:1077–1087, 2001.

[18] B. Parlett. The Symmetric Eigenvalue Problem.Prentice-Hall, Inc., 1980.

[19] J. Shi and J. Malik. Normalized cuts and imagesegmentation. IEEE Transactions on Pattern Analysisand Machine Intelligence, 22(8):888–905, 2000.

[20] B. Taskar, E. Segal, and D. Koller. Probabilisticclustering in relational data. In Proceedings of the 17thInternational Joint Conference on ArtificialIntelligence, 2001.

APPENDIXA. PROOF OF THEOREMTheorem: Let ∆ = (A1, A2) be a partition of V . Let

the function S(i, j) define the similarity measure betweenvi, vj ∈ V . If, ∀i, j, k, S(i, j) is conditionally independentof S(i, k) given node i, and E[P11]E[P22] 6= E[P12]E[P21]then, P has an eigenvector that will converge to piecewiseconstant w.r.t. ∆ as |A1|, |A2| → ∞.

Proof. In order to simplify the calculations below, weassume that the two clusters share the same distribution ofintra- and inter- cluster similarity values. The symmetryin attribute parameters simplifies the analysis but is notnecessary for correctness. Let µin be the mean intra-clustersimilarity for nodes i, j ∈ A1 or i, j ∈ A2. Similarly, let µout

be the mean inter-cluster similarity for nodes i ∈ A1 andj ∈ A2.

We can represent each entry in W as a random variable.Consider the entries of row i. The entries Wij ,Wik are notindependent because the similarity values are both basedon node i. However, conditioned on the state of i (e.g. at-tribute values of i), the entries can be viewed as independentrandom variables if the state of j is independent of the stateof k. This assumption corresponds to a generative model inwhich the objects and links in the graph are conditionallyindependent given the object cluster memberships.

We will calculate the expected intra- and inter-cluster transi-tion probabilities in P as a ratio of sums of random variables.Let T i

in be the total intra-cluster transition probability fornode i, where i ∈ Ak,k∈1,2, and let |Ak| = nk. Similarly, letT i

out be the total inter-cluster transition probability, and T iall

be the total transition probability. Then Piin is the ratio of

T iin and T i

all, and Piout is the ratio of T i

out and T iall.

The normalized transition probabilities in P then corre-spond to the ratio of two random variables (e.g., T i

in/T iall),

which can be approximated using a truncated Taylor se-ries expansion. The expectation and variance for intra- andinter-cluster normalized transition probabilities are below.(Analytical derivations are included in Section A.1.)

E[Piin] = E[T i

in/T iall] ≈ µTin

µTall· [1 + [

σTallµTall

]2 − σTinTallµTin

µTall]

E[Piout] = E[T i

out/T iall] ≈ µTout

µTall· [1 + [

σTallµTall

]2 − σToutTallµTout

µTall]

where σXY is the covariance of X, Y .

As n1, n2 → ∞, it follows directly from the Law of LargeNumbers that the value of T i

in/T jin → 1 for i, j ∈ Ak, since

Tin is a sum of independent random variables with finitemean and variance. A similar argument holds for Tout andTall. Now consider the normalized transition probabilitiesfor P. If, in the limit, the sums T i

in (and T iout, T i

all) convergeto the same value for all i ∈ Ak, then the normalized sumsPi

in will converge to the same value Pin for all i ∈ Ak. Asimilar argument holds for Pi

out.

As n1, n2 → ∞, we can decompose the matrix P into P =P′+εE, where P′ is a matrix with constant transition prob-

abilities Pin and Pout, and E is a perturbation matrix with||E||2 = 1. Then by matrix perturbation theory [8]:

(P′ + εE)xi(ε) = λi(ε)xi(ε)

where xi(ε) = xi + εPn

j=1,j 6=i {yj

T Exi

(λi−λj)yjT xi

}+ O(ε2) ,

and λi(ε) = λi ± ε|yi

T xi|

Here xi, yi, and λi, are the right and left eigenvectors, andthe eigenvalues of P′. As n1, n2 → ∞, ε → 0 and theeigenvectors of P will converge to the eigenvectors of P′.Therefore the graph will converge to a Markov chain withstate space ∆ = (A1, A2), and constant transition probabil-ities R11 = R22 = E[Pi

in], and R12 = R21 = E[Piout]. If

R11 6= R12, then R will be non-singular, and by proposition2 in [13], P will have a piecewise linear eigenvector w.r.t∆.

A.1 Analytic DerivationsWhen S(i, j) is conditionally independent of S(i, k) giventhe state of node i, the cluster transition probabilities aresimply sums of independent random variables. Using condi-tional expectation (E[h(X, Y )] = EX{E[h(X, Y )|X]}), wecan calculate the expectation for T i

in based on the state ofi, which we refer to as iS :

E[T iin] = E[

Pj∈Ak

S(i, j)]

=P

iSp(iS) · E[

Pj∈Ak

S(iS , j)]

=P

iSp(iS) · nk · E[S(iS , j)|j ∈ Ak]

= nk ·P

iSp(iS) ·

PjS

p(jS) · S(iS , jS)

= nk ·P

iS

PjS

p(iS) · p(jS) · S(iS , jS)

= nk · E[Sin]

= nk · µin

Total inter-cluster and overall means are calculated in a sim-ilar fashion. E[T i

out] = nk′ · µout, and E[T iall] = (nk · µin) +

(nk′ · µout), where nk′ = ni,i6=k.

The variance of the total intra-cluster similarity is calculatedas follows 4:

V ar[T iin] = V ar[

Pj∈Ak

S(i, j)]

= EiS{V ar[P

j∈AkS(iS , j)]}

=P

iSp(iS) · V ar[

Pj∈Ak

S(iS , j)]

=P

iSp(iS) · nk · V ar[S(iS , j)|j ∈ Ak]

= nk ·P

iS

PjS

p(iS) · p(jS) · {S(iS , jS)− EiS [S(iS , jS)]}2

Total inter-cluster and overall variance are calculated in a

4The derivation uses the following equivalence:

V ar(h(X, Y )) = E[h(X, Y )2]− E[h(X, Y )]2

= EX{E[h(X, Y )2|X]} − EX{E[h(X, Y )|X]2}= EX{V ar(h(X, Y )|X)}

similar fashion: V ar[T iout] = nk′ ·

PiS

p(iS) · V ar[S(iS , j)|j ∈ Ak′ ],

and V ar[T iall] =

PiS

p(iS) {nk′ · V ar[S(iS , j)|j ∈ Ak′ ]

+nk · V ar[S(iS , j)|j ∈ Ak]}.

From these we can calculate the expected transition prob-abilities of P using the ratio of two random variables (e.g.,Tin/Tall). These calculations use an approximation of theratio of two random variables, based on a truncated Taylorseries expansion:

E[X/Y ] ≈ µXµY

· [1 + [ σYµY

]2 − σXYµXµY

]

V ar(X/Y ) ≈ [µXµY

]2 · [[ σXµX

]2 + [ σYµY

]2 − 2 σXYµXµY

]

The expectation and variance for intra- and inter-clusternormalized transition probabilities are as follows:

E[Piin] = E[T i

in/T iall] ≈ µTin

µTall· [1 + [

σTallµTall

]2 − σTinTallµTin

µTall]

V ar[Piin] = V ar[T i

in/T iall] ≈ [

µTinµTall

]2 · [[ σTinµTin

]2 + [σTallµTall

]2 − 2σTinTall

µTinµTall

]

E[Piout] = E[T i

out/T iall] ≈ µTout

µTall· [1 + [

σTallµTall

]2 − σToutTallµTout

µTall]

V ar[Piout] = V ar[T i

out/T iall] ≈ [

µToutµTall

]2 · [[ σToutµTout

]2 + [σTallµTall

]2 − 2σToutTall

µToutµTall

]

where σXY is the covariance of X, Y . For the equationsabove, the covariance of Tin and Tall reduces to the vari-ance of Tin, using conditional expectation to eliminate thecovariance:

σTinTall = E[TinTall]− E[Tin] · E[Tall]

= E[Tin(Tin + Tout)]− E[Tin] · E[(Tin + Tout)]

= E[T 2in + Tin · Tout]− E[Tin]2 − E[Tin] · E[Tout]

= E[T 2in] + E[Tin · Tout]− E[Tin]2 − E[Tin] · E[Tout]

= E[T 2in]− E[Tin]2 + E[Tin · Tout]− E[Tin] · E[Tout]

= V ar(Tin)−P

iSp(iS){E[Tin · Tout|i]− E[Tin|i] · E[Tout|i]}

= V ar(Tin)−P

iSp(iS) · 0

= V ar(Tin)

A similar derivation applies to the covariance of Tout andTall.

Spectral Clustering with Links and Attributes · 2011. 5. 14. · Spectral clustering techniques, which partition data into dis-joint clusters using the eigenstructure of a similarity

Documents