ENGINEERING SCIENCE Community Extracting using Intersection Graph and Content Analysis in Complex Network Toshiya Kuramochi Naoki Okada Kyohei Tanikawa Yoshinori Hijikata Shogo Nishida Graduate School of Engineering Science, Osaka University, Japan The 2012 IEEE/WIC/ACM International Conference on Web Intelligence
25
Embed
Community Extracting Using Intersection Graph and Content Analysis in Complex Network
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
ENGINEERING SCIENCE
Community Extracting using Intersection Graph and Content Analysis in Complex Network
Graduate School of Engineering Science, Osaka University, Japan
The 2012 IEEE/WIC/ACM International Conference on Web Intelligence
2page Overview
1. Background and Problems of Community Detection
2. Our Proposed Method
3. Experimentation in Real SNS Networks
4. Results and Discussions
5. Conclusion
page Background 3
Community structure• connection in groups is densely• connection among groups is sparsely
communities in WWW
sets of web pages relatedto a certain topic
Many researchers have studied about complex networksand have found the “community structure”
business
science
sport
Community structure is a key characteristic of complex network
page Problem (1) – overlap of communities 4
Some nodes belong to several communities in real networks
communities in WWW
community ofsports pages
community ofbusiness pages
overlap of communities
Most of ordinary clustering methods allocate nodes one cluster They CANNOT represent the overlap of communities
(e.g., economic effect of the Olympic Games)
Community detection method should be able to allocate nodes several clusters
page Problem (2) – edge inhomogeneity 5
Edges are not homogeneous in real networks
edges in SNS network
Many community detection methods assume all edges are same
They CANNOT represent the edge inhomogeneity
same hobby
family same university
work place
Weights of edges should be set individually
page Problem (3) – appropriate number of communities 6
The number of real communities is often unknown
How many communitiesin this network?
234
Most hierarchical clustering methodsrequire manual input of appropriatenumber of communities
Number of communities should be determined automatically
7page Purpose of this work
• A node may belong to several communities Using the idea of intersection graph [Everett & Borgatti, 1998]
• Weights of edges are set individually Content information analysis
• Number of communities are automatically determined Clustering based on modularity [Newman, 2003]
We solve these three problems by proposinga new community detection method
8page Overview
1. Background and Problems of Community Detection
2. Our Proposed Method
3. Experimentation in Real SNS Networks
4. Results and Discussions
5. Conclusion
9page Summary of our proposed method
Input Graph & Content information
• Step 1: Enumeration of dense subgraphs
• Step 2: Conversion to the intersection graph
• Step 3: Calculation of the weights of edges
• Step 4: Clustering based on modularity
Output Clusters (communities)
10page
threshold enumerate
3
4
5
Step 1: Enumeration of dense subgraphs
dense subgraphclique, n-clique, n-clan, etc.
(example of dense subgraph)
clique threshold: thresholdof size of clique
A
B
C
DE
IJ
K
F
G
H
example of clique enumeration
complete graph
{ B, C, D, J, K } (size = 5)
{ D, E, I, J } (size = 4)
{ E, G, H } (size = 3)
11page Step 2: Conversion to the intersection graph
A
B
C
DE
IJ
K
F
G
H
B, C, D
J, K
cliques in inputgraph
X
Z
Y
D, E,
I, J
E, G,
H
intersection graph
intersection graph • dense subgraphs in original graph
{ D, J }(common member)
{ E }∅
overlap threshold: threshold of number of common members
12page Step 3: Calculation of weights of edges
B C
K DJ E
I
XY
• degree of overlap
• similarity of content information
• weight of the edge
= (Jaccard coefficient)
1. each set ( and ) as one vector ( and ) (tf-idf score)
2. (cosine similarity)
𝑤 ( 𝑋 ,𝑌 )= 𝑑 (𝑋 ,𝑌 )1+𝜖−𝑠𝑖𝑚(𝑋 ,𝑌 )
(0<𝜖<1 )
13page Step 4: Clustering based on modularity
Modularity
• Modularity is an indicator for evaluation ofdivision of networks
• Clustering method based on modularityoptimizes directly
• Division with the highest is the best division
automatically detection of best number of clusters
14page Summary of our proposed method
• Step 1: Enumerationof cliques
• Step 2: Conversion tothe intersection graph
• Step 3: Calculation ofweights of edges
• Step 4: Clusteringbased on modularity
A
B
C
DE
IJK
F
G
H
X
Y
Z
X
Y
Z
𝑤(𝑋 ,𝑌 )𝑤(𝑌 ,𝑍 )
𝑤 ( 𝑋 ,𝑍 )=0
X
Y
Z
cluster 1
cluster 2
15page Overview
1. Background and Problems of Community Detection
2. Our Proposed Method
3. Experimentation in Real SNS Networks
4. Experimental Results
5. Conclusion
16page
mixi: one of the most popular SNS in Japan• test subjects: 20 mixi users• link structure: two radius from each test subject• content information: self-introduction, friend introduction,
attributes (gender, address, birthday, etc.)
Dataset
ground truth: relation names between a test subject and personin the dataset which is enumerated by the test subject(e.g., ‘same university’, ‘hobby friend’, ‘coworker’)
evaluation: we evaluate communities that include test subject• numerical evaluation
the recall become better byusing clustering methodbased on modularity
our methods overcomethe ordinary method
22page Visual evaluation (Everett’s method vs. WithCA)
Everett’s method WithCA
23page Visual evaluation (NonCA vs. WithCA)
NonCA WithCA
24page Overview
1. Background and Problems of Community Detection
2. Our Proposed Method
3. Experimentation in Real SNS Networks
4. Results and Discussions
5. Conclusion
25page Conclusion
• Features of our proposed method– Our method can allocate nodes several clusters– Our method can represent edge inhomogeneity – Our method can automatically detect the number of
clusters
• Evaluation on real SNS networks– Our method overcomes conventional method in F-
measure– The recall becomes better by using clustering method
based on modularity– The precision becomes better with content analysis