Community Extracting Using Intersection Graph and Content Analysis in Complex Network

ENGINEERING SCIENCE

Community Extracting using Intersection Graph and Content Analysis in Complex Network

Toshiya Kuramochi Naoki Okada Kyohei TanikawaYoshinori Hijikata Shogo Nishida

Graduate School of Engineering Science, Osaka University, Japan

The 2012 IEEE/WIC/ACM International Conference on Web Intelligence

2page Overview

1. Background and Problems of Community Detection

2. Our Proposed Method

3. Experimentation in Real SNS Networks

4. Results and Discussions

5. Conclusion

page Background 3

Community structure• connection in groups is densely• connection among groups is sparsely

communities in WWW

sets of web pages relatedto a certain topic

Many researchers have studied about complex networksand have found the “community structure”

business

science

sport

Community structure is a key characteristic of complex network

page Problem (1) – overlap of communities 4

Some nodes belong to several communities in real networks

communities in WWW

community ofsports pages

community ofbusiness pages

overlap of communities

Most of ordinary clustering methods allocate nodes one cluster They CANNOT represent the overlap of communities

(e.g., economic effect of the Olympic Games)

Community detection method should be able to allocate nodes several clusters

page Problem (2) – edge inhomogeneity 5

Edges are not homogeneous in real networks

edges in SNS network

Many community detection methods assume all edges are same

They CANNOT represent the edge inhomogeneity

same hobby

family same university

work place

Weights of edges should be set individually

page Problem (3) – appropriate number of communities 6

The number of real communities is often unknown

How many communitiesin this network?

234

Most hierarchical clustering methodsrequire manual input of appropriatenumber of communities

Number of communities should be determined automatically

7page Purpose of this work

• A node may belong to several communities Using the idea of intersection graph [Everett & Borgatti, 1998]

• Weights of edges are set individually Content information analysis

• Number of communities are automatically determined Clustering based on modularity [Newman, 2003]

We solve these three problems by proposinga new community detection method

8page Overview





5. Conclusion

9page Summary of our proposed method

Input Graph & Content information

• Step 1: Enumeration of dense subgraphs

• Step 2: Conversion to the intersection graph

• Step 3: Calculation of the weights of edges

• Step 4: Clustering based on modularity

Output Clusters (communities)

10page

threshold enumerate

3

4

5

Step 1: Enumeration of dense subgraphs

dense subgraphclique, n-clique, n-clan, etc.

(example of dense subgraph)

clique threshold: thresholdof size of clique

A

B

C

DE

IJ

K

F

G

H

example of clique enumeration

complete graph

{ B, C, D, J, K } (size = 5)

{ D, E, I, J } (size = 4)

{ E, G, H } (size = 3)

11page Step 2: Conversion to the intersection graph

A

B

C

DE

IJ

K

F

G

H

B, C, D

J, K

cliques in inputgraph

X

Z

Y

D, E,

I, J

E, G,

H

intersection graph

intersection graph • dense subgraphs in original graph

{ D, J }(common member)

{ E }∅

overlap threshold: threshold of number of common members

12page Step 3: Calculation of weights of edges

B C

K DJ E

I

XY

• degree of overlap

• similarity of content information

• weight of the edge

= (Jaccard coefficient)

1. each set ( and ) as one vector ( and ) (tf-idf score)

2. (cosine similarity)

𝑤 ( 𝑋 ,𝑌 )= 𝑑 (𝑋 ,𝑌 )1+𝜖−𝑠𝑖𝑚(𝑋 ,𝑌 )

(0<𝜖<1 )

13page Step 4: Clustering based on modularity

Modularity

• Modularity is an indicator for evaluation ofdivision of networks

• Clustering method based on modularityoptimizes directly

• Division with the highest is the best division

automatically detection of best number of clusters

14page Summary of our proposed method

• Step 1: Enumerationof cliques

• Step 2: Conversion tothe intersection graph

• Step 3: Calculation ofweights of edges

• Step 4: Clusteringbased on modularity

A

B

C

DE

IJK

F

G

H

X

Y

Z

X

Y

Z

𝑤(𝑋 ,𝑌 )𝑤(𝑌 ,𝑍 )

𝑤 ( 𝑋 ,𝑍 )=0

X

Y

Z

cluster 1

cluster 2

15page Overview




4. Experimental Results

5. Conclusion

16page

mixi: one of the most popular SNS in Japan• test subjects: 20 mixi users• link structure: two radius from each test subject• content information: self-introduction, friend introduction,

attributes (gender, address, birthday, etc.)

Dataset

ground truth: relation names between a test subject and personin the dataset which is enumerated by the test subject(e.g., ‘same university’, ‘hobby friend’, ‘coworker’)

evaluation: we evaluate communities that include test subject• numerical evaluation

• precision• recall• F-measure

• visual evaluation

17page Implementation

parameter setting: (clique threshold, overlap threshold) = (3, 2), (4, 2), (4, 3), (5, 2), (5, 3) and (5, 4)

implementation:

WithCA NonCA

conventional method

Everett’s method

Everett’s method*

content analysis

friend introduction analysis none

clustering method

clustering based on modularitysimple hierarchical

clustering

# output clusters

automatically determined correct data* equals to NonCA

* the number of relation names which is enumerated by test subject

18page Overview





5. Conclusion

19page

Everett Everett* NonCA WithCA

Everett bad bad bad

Everett* good bad bad

NonCA good good

WithCA good good

Numerical evaluation (F-measure)

(3, 2) (4, 2) (4, 3) (5, 2) (5, 3) (5, 4)0

0.1

0.2

0.3

0.4

0.5Everett's method Everett's method* NonCA WithCA

(clique threshold, overlap threshold)

F-m

easu

re

contribution of clusteringbased on modularity

superiority of our method

20page Numerical evaluation (precision)

(3, 2) (4, 2) (4, 3) (5, 2) (5, 3) (5, 4)0

0.10.20.30.40.50.60.70.8

Everett's method Everett's method* NonCA WithCA


prec

isio

n


Everett bad

Everett* bad

NonCA bad

WithCA good good goodthe precision becomebetter with content analysis

21page Numerical evaluation (recall)

(3, 2) (4, 2) (4, 3) (5, 2) (5, 3) (5, 4)0

0.1

0.2

0.3

0.4

0.5

0.6Everett's method Everett's method* NonCA WithCA


reca

ll


Everett bad bad bad

Everett* good bad bad

NonCA good good good

WithCA good good bad

the recall become better byusing clustering methodbased on modularity

our methods overcomethe ordinary method

22page Visual evaluation (Everett’s method vs. WithCA)

Everett’s method WithCA

23page Visual evaluation (NonCA vs. WithCA)

NonCA WithCA

24page Overview





5. Conclusion

25page Conclusion

• Features of our proposed method– Our method can allocate nodes several clusters– Our method can represent edge inhomogeneity – Our method can automatically detect the number of

clusters

• Evaluation on real SNS networks– Our method overcomes conventional method in F-

measure– The recall becomes better by using clustering method

based on modularity– The precision becomes better with content analysis

Community Extracting Using Intersection Graph and Content Analysis in Complex Network

Documents

proposed method page

edges page

intersection graph step

weights of edges step

input step

threshold of number

edges x y step

e c f step