StructMatrix: large-scale visualization of graphs bymeans of structure detection and dense matrices
Hugo Gualdron, Robson L. F. Cordeiro, Jose F Rodrigues-Jr
University of Sao PauloIn collaboration with Carnegie Mellon University
(Prof. Christos Faloutsos, and PhD Danai Koutra)
Funding by research agency Fapesp (2013/03906-0, 2014/07879-0, 2015/18335)
In: The Fifth IEEE ICDM Workshop on Data Mining in Networks,Atlantic City, NJ, USA - November, 2015
http://www.icmc.usp.br/pessoas/junio
Jose F Rodrigues-Jr (University of Sao Paulo) 1 / 20
Introduction
Motivation
Big Data!!!
A lot of information, much of it in the form of relationships;
Large-scale graphs: graphs generated by applications in which usersor entities are distributed along large geographical areas - even theentire planet;
Social networks, recommendation networks, road nets, e-commerce,computer networks, client-product logs, and many others.
Data analysis is the differential for industrial competition.
General Electric & Accenture.
Jose F Rodrigues-Jr (University of Sao Paulo) 2 / 20
Introduction
Problem
Such graphs are too big:
node-link visualization cannot handle even thousand-vertices graphs;
adjacency matrices are limited by the number of pixels of the screen;
in any case, the cardinality of the nodes prevents rationalization;
non-visual analytical techniques might produce way too manypatterns preventing human cognition.
Still, we want to characterize the structure of graphs for:
understanding the overall structure, and not only thedistribution-based analyses;
spotting outliers and trends that are not dominant;
requesting details on demand concerning subregions of the graphtopology.
Jose F Rodrigues-Jr (University of Sao Paulo) 3 / 20
Introduction
Problem
Layouts node-link and adjacency matrix
Node-link Adjacency matrix
Scalability:Hundred nodes Thousand nodes
Jose F Rodrigues-Jr (University of Sao Paulo) 4 / 20
Introduction
Methodology overview
Assumptions:
graphs are made of recurrent simple structures (cliques, bi-partitecores, stars, and chains);
such structures are more meaningful than sole nodes;
even at lower resolutions, the graph main properties are maintained ina visualization.
Hypothesis: we reach more scalable and meaningful graph visualizationswith:
graph summarization by detecting recurrent structures of the graph;
dense adjacency matrices.
Jose F Rodrigues-Jr (University of Sao Paulo) 5 / 20
Methodology
Proposed method: StructMatrix
Our method has two parts:
1 An algorithm to detect substructures;
2 A dense adjacency matrix of the structures that were detected.
Jose F Rodrigues-Jr (University of Sao Paulo) 6 / 20
Methodology
1.Structure detection
Jose F Rodrigues-Jr (University of Sao Paulo) 7 / 20
Methodology
1.Structure detection
We designed a graph partitioning algorithm based on the fact thatreal-world graphs obey to power-law distributions;In such graphs: few nodes with very high degree and the majority ofnodes with low degree;Kang and Faloutsos [1] demonstrated that the ordered removal of thehigher degree nodes leads to the removal of hubs from the giant CC,creating satellite (much smaller) connected components;
This ordered removal lends to a structural scanning of the graph.
Jose F Rodrigues-Jr (University of Sao Paulo) 8 / 20
Methodology
1.Structure detection–Structure vocabulary
StructMatrix Vocabulary ψ
Jose F Rodrigues-Jr (University of Sao Paulo) 9 / 20
Methodology
1.Structure detection–Algorithm
1 If the queue has connected components, StructMatrix gets the firstelement for processing.
Jose F Rodrigues-Jr (University of Sao Paulo) 10 / 20
Methodology
1.Structure detection–Algorithm
2 StructMatrix selects the vertices with higher degree (up to 1% of thevertices) and removes their edges.
Jose F Rodrigues-Jr (University of Sao Paulo) 10 / 20
Methodology
1.Structure detection–Algorithm
2 We get a set of smaller connected subcomponents.
Jose F Rodrigues-Jr (University of Sao Paulo) 10 / 20
Methodology
1.Structure detection–Algorithm
3 We classify the subcomponents according to the vocabulary.
Jose F Rodrigues-Jr (University of Sao Paulo) 10 / 20
Methodology
1.Structure detection–Structure classification
α = n2
4 β = n(n−1)2 ε = 0.2
Jose F Rodrigues-Jr (University of Sao Paulo) 10 / 20
Methodology
1.Structure detection–Algorithm
4 We store the classified subcomponents; the ones that were notidentified go to the queue waiting for a new round of shattering.
Jose F Rodrigues-Jr (University of Sao Paulo) 10 / 20
Methodology
1.Structure detection–Algorithm
5 We proceed to the next element in the queue.
Jose F Rodrigues-Jr (University of Sao Paulo) 10 / 20
Methodology
1.Structure detection–Structure detection results
Graph # Structures fs st ch nc fc nb fbDBLP 160.885 76% 5% 2% 2% 15% <1% -WWW-barabasi 15.652 32% 52% 5% 3% 2% 4% 2%cit-HepPh 14.479 79% 13% 6% 1% <1% <1% <1%Wikipedia-vote 1.706 65% 33% 2% - - <1% -Epinions 8774 52% 31% 14% <1% <1% 2% <1%Roadnet PA 51.175 23% 45% 27% - - 5% -Roadnet CA 88.993 27% 39% 29% - - 4% -Roadnet TX 62.614 25% 43% 28% - - 4% -
Jose F Rodrigues-Jr (University of Sao Paulo) 10 / 20
Methodology
1.Structure detection–Runtime
We compare to algorithm VoG (Koutra et al.[2]): better performance, andbigger vocabulary.
Jose F Rodrigues-Jr (University of Sao Paulo) 10 / 20
Methodology
2.Visualization–Projection
After structure detection, we build an adjacency matrixstructure-to-structure whose edges’ weights indicate the number ofedges between the nodes of each structure;
Although smaller than the original matrix, for million-scale graphs,the struct matrix is still too large to fit in the screen;
For this reason we create a dense matrix according to a straightproportion (x , y)→ (ρx , ρy ) for:
ρx =⌈
(Resx − 1) x−xminxmax−xmin
+ 12
⌉ρy =
⌈(Rexy − 1) y−ymin
ymax−ymin+ 1
2
⌉ (1)
where (x , y) are points of the original matrix and Resx ,Resy are thetarget resolutions; the more resolution, the more details are presented– these parameters allow for interactive grasping of details.
Jose F Rodrigues-Jr (University of Sao Paulo) 11 / 20
Methodology
2.Visualization–Projection
Jose F Rodrigues-Jr (University of Sao Paulo) 12 / 20
Methodology
2.Visualization–Layout
We organize the matrix according to structure type, and to number ofedges – size of structures (number of nodes) is given by color.
Jose F Rodrigues-Jr (University of Sao Paulo) 13 / 20
Methodology
2.Visualization–Layout
We organize the matrix according to structure type, and to number ofedges – size of structures (number of nodes) is given by color.
Jose F Rodrigues-Jr (University of Sao Paulo) 13 / 20
Experiments
Experiments–Real datasets
Graph # Structures fs st ch nc fc nb fbDBLP 160.885 76% 5% 2% 2% 15% <1% -WWW-barabasi 15.652 32% 52% 5% 3% 2% 4% 2%cit-HepPh 14.479 79% 13% 6% 1% <1% <1% <1%Wikipedia-vote 1.706 65% 33% 2% - - <1% -Epinions 8774 52% 31% 14% <1% <1% 2% <1%Roadnet PA 51.175 23% 45% 27% - - 5% -Roadnet CA 88.993 27% 39% 29% - - 4% -Roadnet TX 62.614 25% 43% 28% - - 4% -
Jose F Rodrigues-Jr (University of Sao Paulo) 14 / 20
Experiments
Experiments–Real datasets–WWW-barabasi
WWW-barabasi: webpages and links between them.
Stars (st and fs) refer to webpages with many out links.
Most of the webpages have less than one thousand connections;however, some present unusual thousand connections.
Jose F Rodrigues-Jr (University of Sao Paulo) 15 / 20
Experiments
Experiments–Real datasets–Road nets
Pennsylvania California Texas
The three road graphs have a similar structure – all U.S. roads;
There is a hierarchical connectivity: bigger to smaller cities;
Surprising grid-like (due to symmetry) structure: intersections refer tohub cities, and lines refer to inter-city paths.
Jose F Rodrigues-Jr (University of Sao Paulo) 16 / 20
Experiments
Experiments–Real datasets–Road nets
Comparison: Structure-to-structure vs Node-to-node.
California (structure-to-structure) California (node-to-node)
Main differences:
1 The partitioning according to structures;
2 The ordering by number of edges to other structures;
3 There is a hierarchical connectivity: bigger to smaller cities;
4 Surprising grid-like structure: intersections refer to hub cities, andlines refer to inter-city paths.
Jose F Rodrigues-Jr (University of Sao Paulo) 17 / 20
Experiments
Experiments–Real datasets–DBLP
Overall FC-FC zoom
DBLP is mainly characterized by false stars – possibly becauseadvisors have students, and students connect one to each other;By zooming FC-FC, one can see outliers, for instance k3 = “TheBiomolecular Interaction Network Database and related tools 2005update” 75 authors.
Jose F Rodrigues-Jr (University of Sao Paulo) 18 / 20
Conclusions
Contributions
Visualization technique: we introduce a processing and visualizationmethodology that puts together algorithmic techniques and design inorder to reach large-scale visualizations;
Analytical scalability: our technique extends the most scalabletechnique found in the literature; plus, it is engineered to plot millionsof edges in a matter of seconds;
Practical analysis: we show that large-scale graphs have well-definedbehaviors concerning the distribution of structures, their size, andhow they are related one to each other; finally, using a standardlaptop, our techniques allowed us to experiment in real, large-scalegraphs coming from domains of high impact, i.e., WWW, Wikipedia,Roadnet, and DBLP.
Jose F Rodrigues-Jr (University of Sao Paulo) 18 / 20
References
U. Kang and C. Faloutsos, “Beyond ’caveman communities’: Hubsand spokes for graph compression and mining,” in ICDM, 2011, pp.300–309.
D. Koutra, U. Kang, J. Vreeken, and C. Faloutsos, “Vog:Summarizing and understanding large graphs,” in SDM, 2014.
Jose F Rodrigues-Jr (University of Sao Paulo) 18 / 20