Matnet Tutorial Part III: Analysis of Attributed Networks Rushed Kanawati, Martin Atzmueller, Christine Largeron A 3 , Université Sorbonne Paris Cité, France CSLab, Cognitive Science and Artificial Intelligence,Tilburg University, Netherlands Laboratoire Hubert Curien, Université Jean Monnet, Université de Lyon, France WWW 2018, Lyon, 2018-04-24
84
Embed
MatnetTutorial Part III: Analysis of Attributed Networkskanawati/mantut/Community detection in attributed networks Community detection in attributed network: definition Given an attributed
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Matnet Tutorial Part III:Analysis of Attributed Networks
Rushed Kanawati, Martin Atzmueller, Christine Largeron
A3, Université Sorbonne Paris Cité, FranceCSLab, Cognitive Science and Artificial Intelligence,Tilburg University, Netherlands
Laboratoire Hubert Curien, Université Jean Monnet, Université de Lyon, France
WWW 2018, Lyon, 2018-04-24
Agenda
�Attributed Networks/Graphs
� Subgroups & Communities
�Community Detection
2
Attributed Network
3
Context
Attributed network
Definition 1 [Zhou2009] Network represented by a graph G = (V,E)where each node v 2 V is associated with a vector of attributesvj, j = {1, .., p}.
CHRISTINE LARGERON (LaHC) Ecole EGC 2018 8 / 61
Attributed Network
4
Context
Attributed network
Definition 2 [Yin2010], [Gong2011]Network represented by
I a graph G = (V, E) describing the relationships between the entitiesI a bipartite graph Ga = (V [ Va,Ea) describing the relationships between
� Integration of heterogenous data (networks + vectors)
� Enables simultaneous analysis of relational + attribute data
5
Subgroups & Cohesive subgroups� Subgroup�Subset of actors (and all their ties)
�Define subgroups using specific criteria(homogeneity among members)�Compositional – actor attributes
�Structural – using tie structures
�Detection of cohesive subgroups & communities è structural aspects
� Subgroup discovery è actor attributes
�… attributed graph è can combine both
6
[Wasserman & Faust 1994]
Compositional Subgroups�Detect subgroups according to specific
compositional criteria�Focus on actor attributes
�Describe actor subset using attributes
�Often hypothesis-driven approaches: Test specific attribute combinations
� In contrast: Subgroup discovery�Hypothesis-generating approach
�Exploratory data mining method
�Local exceptionality detection
7
[Atzmueller 2015]
Agenda
�Attributed Networks/Graphs
� Subgroups & Communities
�Community Detection
8
Community Detection/Attributed Networks
9
Community detection in attributed networks
Community detection in attributed network: definition
Given an attributed graph G = (V,E), the task consists in building a partitionP = {C1, . . . ,Cr} of V in r communities such that:
vertices in the same community are densely connected and similar interms of attributesvertices from distinct communities are loosely connected and different
CHRISTINE LARGERON (LaHC) Ecole EGC 2018 19 / 61
[Combe et al. IDA 2015]
Combining Structure and Attributes
�Data sources
�Structural variables (ties, links)
�Compositional variables
�Actor attributes
�Represented as attribute vectors
�Edge attributes
�Each edge has an assigned label
�Multiplex graphsè Multiple edges (labels) between nodes
10
Communities/Edge-Attributed Graphs
�Clustering edge-attributed graphs
�Reduce/flatten to weighted graph[Bothorel et al. 2015]� Derive weights according to number of edges where nodes are
directly connected [Berlingerio et al. 2011]
� Standard graph clustering approaches can then be directly applied
� Frequent-itemset based [Berlingerio et al. 2013]
�Weight modification (edges) according to nodalattributes [Ge et al. 2008, Dang & Viennet 2012, Ruanet al. 2013, Zhou et al. 2009, Steinhaeuser & Chawla 2008]
�Abstraction into similarities between nodesè Edge weightsè Apply standard community detection algorithm,
� Specifically, distance-based community detectionmethods
� Entropy-oriented methods [Zhu et al. 2011, Smith et al. 2014, Cruz et al. 2011]
�Model-based approaches [Xu et al. 2012, Yang et al. 2013, Akoglu et al. 2012]
14
Weight modification� Use attribute-based distance measure
� Community detection: Group nodes according tothreshold �, i.e., given � � (0, 1) place any pair ofnodes whose edge weight exceeds the thresholdinto the same community
� Evaluate final partitioning using Modularity15
[Steinhaeuser & Chawla 2008]
Entropy Minimization
� For a partition, optimize entropy usingMonte-Carlo
�Data compression of connectivity& attribute matrices (PICS algorithm)
�Lossless compression è MDL cost-function
�Resulting node groups�Homogeneous both in node & attribute matrix
�Nodes - similar connectivity & high attribute coherence
17
[Akoglu et al. 2013]
Descriptive Community Patterns
�Community mining scenario�Discover "densely connected groups of nodes"�Communities should have explicit description�Community (evaluation) space: network/graph
�Goal:�Often: Discover top-k communities�Maximize some community
quality function
18
Finding Explicit Descriptions
� Cluster transformed node-attribute similaritygraph & extract pure clusters
for communities�Redescription generation: Induce description
for each community, and reshape if necessary
� Heuristic approach, due to large searchspace
[Pool et al. 2014]
22
� Starts with candidate communities�Domain knowledge� Partial communities� Start with single vertices (later being extended
using hill-climbing approach)
� ReMine algorithm for deriving patterns forcommunities [Zimmermann et al. 2010]
23
[Pool et al. 2013]
24
Description-Oriented Community Detection
� Basic Idea: Pattern Mining for Community Characterization�Mine patterns in description space (tags/topics)è Subgroups of users described by tags/topics
� Optimize quality measure in community spaceè Network/graph of users
� Improve understandability of communities (explanation)
[Atzmueller et al., Information Science, 2016]
25
Direct Descriptive Community Mining
� Goal: Identification/description of communities witha high quality (exceptional model mining)� Input: Network/Graph + node properties (e.g., tags)
� Output: k-best community patterns
� Description language: conjunctive expressions
� COMODO algorithm: Top-k pattern mining, based on SD-Map* algorithm for subgroup discovery� Discover k-best patterns
� Search space: Conjunctions/tags
� Apply standard community quality functions, e.g., Modularity [Newman 2004]
26
Community Detection on Attributed Graphs
�Goal: Mine patterns describing such groups
�Merge networks + descriptive features, e.g., characteristics of users
Party 3: Community detection methods for attributed graph
Modularity
Measure based on null model [Newman and Girwan, 2004]Comparison of the degree distribution in each group with the expecteddistribution in the configuration model
Given a network with n nodes and m edges, the expected number of edgesbetween two nodes i and i0 with degrees ki and ki0 is kiki0
2m
The expected number of edges between nodes 1 and 2 is 3⇤22⇤14 = 0.21
Party 3: Community detection methods for attributed graph
Inertia based modularity: properties
Qinertia(P)
Taking its values between -1 and 1, as modularity does,Insensitive to linear transformation applied to all the vectors,Insensitive to the number of clusters in the partition.
Party 3: Community detection methods for attributed graph
Evaluation of I-Louvain method on a real network
Results obtained on a real dataset built using the databasesDBLP (06/18/2014)Microsoft Academic Search (02/03/2014)
DBLP allows to generate a copublication graph G = (V,E) with|V| = 2515 authors|E| = 5313 links: there is a link between two authors if they have aco-published paper in computer science in DBLP.23 attributes: number of publications in 23 research fieldsGround truth: major area of publication in Microsoft Academic Search
Table: Evaluation according to the normalized mutual information (NMI)
Party 3: Community detection methods for attributed graph
I-Louvain algorithm
Theoretical and experimental resultsInterest of using both attributes and relationships with missing data.The calculation of the variation of Qinertia inducted by the move of anelement from a class toward an other only relies on local information[Combe 2013].
Consensus clustering [Goder and Filkov,2008, Topchy et al. 2005]Combinatory optimization problemGreedy approach [Strehl and Ghosh,2002]
1 Apply A on G nP times, yielding nP partitions2 Compute the consensus matrix D = [Dij] is the number of partitions in
which vertices i and j of G are assigned to the same community, dividedby nP
3 All entries of D below a chosen threshold are set to zero4 Apply A on D nP times, yielding nP partitions5 If the partitions are all equal, stop . Otherwise go back to 2.
Measure the intrinsic quality of the proposed partition.Based on a function Q which associates to a partition P a real value Q(P)Comparison of partitions: High score means ”good” partitionAllows to compare different partitions and identify the best oneAdditivity property: Q(P) =
State of art papers on partition qualityI Jure Leskovec, Kevin J. Lang, Michael W. Mahoney, Empirical
Comparison of Algorithms for Network Community Detection, WWW2010, p631-640
I H. Almeida, D. Guedes, W. Meira, M. Zaki, Is There a Best QualityMetric for Graph Clusters?, ECML PKDD’11, p.44-59
I J. Yang, J. Leskovec, Defining and Evaluating Network Communitiesbased on Ground -truth, MDS ’12 Proceedings of the ACM SIGKDDWorkshop on Mining Data Semantics Article No. 3
DifficultiesI Interpretation of obtained values for internal criteriaI Access to a ground truth for external criteriaI Solution : benchmarks and generators
I UCI Irwine: https://networkdata.ics.uci.edu/resources.phpI UCINET data collectionI Stanford Large Network Dataset Collection (SNAP)I KONECT (the Koblenz Network Collection) (261 networks)I The Colorado Index of Complex Networks (ICON) (608 networks)
but ... metadata does not necessarily fit the community structure[Hric,2014 Peel,2017]Generators
Parametric software / model to generate networksTypes of generators
I Graphs without attribute and community [Erdos-Renyi1960,Watts-Strogatz1998, Barabasi-Albert1999],
I Attributed graphs [Gong et al. 2012, Le Tran2015], Rtg[Akoglu-Faloutsos,2009]
I Graph with community structure: Kleinberg model [Kleinberg1999], GNbenchmark [Girvan-Newman2002], Forest Fire [Leskovec et al.2005],LFR [Lancichinetti-Fortunato-2009], Block Two-level Erdos-Renyi[Kolda et al,2013]
I Grow-shrink benchmark: dynamic graph with community structurederived from stochastic block model [Granell et al. 2015]
I Attributed graphs with community structure [Dang2012, Largeron2015,Largeron2016]
DANCer: Dynamic Attributed Network with Community Structure Generatorfree available under the terms of the GNU Public Licence at:http://perso.univ-st-etienne.fr/largeron/DANCer_Generator/
easy to use with the interfaceLargeron et al., Plos one 2015, ECML-PKDD 2016, KAIS 2017
Generation of a dynamic network i.e. a sequence of graphs
Given a set of parameters, generation of a first graphApply micro and / or macro operations to obtain the next graphs
I Micro operations:• Add / remove vertices• Add / remove between and/or within edges• Update attribute values
I Macro operations:• Merge two communities into a single one• Split one community into two• Migrate vertices from a community to either a new or an existing community
Parameters fileNetwork file with the description of the graphs obtained for thetimestampsMeasures used to evaluate the properties:
I Graph characteristics: Nb. edges between and within, Nb. edges, Nb.vertices, Nb of connected components, Number of communities andnumber of elements per community, Degree distribution
I Attribute measures: Within inertia rate, Observed vs expected homophilyI Structural measures: Modularity, Average clustering coefficient vs
Random clustering coefficient, Average degree, Average shortest pathlength, Diameter
Great number of parameters butThey allow to built a network with well defined structure for the linksand the attributes (good communities / classes)Link-based structure and/or attribute-based structure can then be weakenby changing the parameters to evaluate the performance of algorithms onnoisy networks.Seed parameter : allows to reproduce exactly the same network.