Feb 26, 2019
Social Network Analysis
Challenges in Computer Science – April 1, 2014
Frank Takes ([email protected])
LIACS, Leiden University
Overview
Context
Social Network Analysis
Online Social Networks
Friendship Graph
Centrality Measures
Example
Conclusions
Data Science
Data Science is the study of the generalizable extraction of knowledge from data, yet the key word is science (Wikipedia)
Builds on techniques and theories from many fields, including machine learning, computer programming, statistics, data engineering, pattern recognition and learning, visualization, …
Goal is extracting meaning from data and creating data products
Data science is a buzzword, often used interchangeably with analytics or big data…
Big Data?
Online Social Network
Friendship Graph
8 million users
1 billion friendships between users
10GB on disk
Is this Big Data?
No.
From Data to Networks
Unstructured data: numeric measurementsfrom a temperature sensor, textual contexts of a news article
Structured data: data organized according to a model or data structure, for example:
Database: tables with rows and columns
Graph/Network: nodes and edges
Graphs
C
D
E
B
AVertex/Node/Knoop
Relationship/Edge/Link/Tak
Distance/Afstandd(C, E) = d(E, C) = 2
n = 5 nodesm = 6 edges
Social Network Analysis
Social Network Analysis (SNA): the study of social networks to understand their structure and behavior.
Social Network: a social structure of people, related (directly or indirectly) to each other through a common relation or interest.
Social Networks != Social Media
Social Network Analysis
Social Network Analysis (SNA) Sociology
Algorithms
Data Mining
Social Networks Real-life (explicit)
Online (explicit)
Derived (implicit) e-mail networks, citation networks, co-author networks, terrorist collaboration networks
History
1997: SixDegrees.com
2000: Friendster
2003: LinkedIn & MySpace
2004: Hyves
2005: Facebook
2006: Twitter
................
2010: The Social Network (movie)
Online Social Networks
User (node) has a profile
Profiles have attributes (labels/annotations)
Explicit links (edges) Social Links / Friendship links
User groups
Implicit links Social messaging
Common attributes
Directed vs. undirected links
Example: Facebook
More than 1 billion active users
Average user has 130 friends
Estimated 100 billion social links
Over 600 million interactive objects (pages, groups and events)
More than 45 billion pieces of content (web links, news stories, blog posts, notes, photo albums, etc.) shared each month
OSNA Research Topics
User behavior
Privacy & Anonymity
Trust & Authorities
Diffusion of information
Sampling & Crawling
Community Detection
Friendship Graph
Friendship Graph
Static analysis Who is the most important person in a network?
Can we distinguish between groups of people in the network?
What is the average distance between two peoplein the network?
Dynamic analysis Who are likely to become friends next?
How does the social network evolve over time?
Centrality measures
Degree centrality
Betweenness centrality
Closeness centrality
Graph centrality (eccentricity centrality)
Eigenvector centrality
Random walk centrality
Hyperlink Induced Topic Search (HITS)
PageRank
Centrality
B
C
D
E
A
Degree Centrality:C has the highest degree
Who has a central position in this graph?
E
F
G
H
B
C
D
E
A
E
F
G
H
Centrality
B
C
D
E
A
Betweenness Centrality:E is part of the largest
number of shortest paths
Who has a central position in this graph?
E
F
G
H
B
C
D
E
A
E
F
G
H
Google PageRank
4 webpages A, B, C and D (N = 4)
Initially: PR(A) = PR(B) = PR(C) = PR(D) = 1/n
L(A) is the outdegree of page A
Now if B, C and D each link to A, the simple PageRank PR(A) of a page A is equal to:
Google PageRank
PageRank as suggested by Larry Page in 1999
N = number of pages, pi and pj are pages
M(pi) is the set of pages linking to pi
L(pj) is the outdegree of pj
d = 0.85, 85% chance to follow a link, 15% chance to jump to a random page (random surfer)
t = 0
t = t + 1
Frequent Subgraphs
B
C
B
A
A
B
C
B
A
A
Frequent Subgraph: A-B-C
B
C
B
A
A
B
C
B
A
A
What pattern occurs frequently in this graph?
Six Degrees of Separation"I read somewhere that everybody on this planet is separated by only six other people. Six degrees of separation. Between us and everybody else on this planet. The president of the United States. A gondolier in Venice. Fill in the names. I find that A) tremendously comforting that we're so close and B) like Chinese water torture that we're so close. Because you have to find the right six people to make the connection. It's not just big names. It's anyone. A native in a rain forest. A Tierra del Fuegan. An Eskimo. I am bound to everyone on this planet by a trail of six people. It's a profound thought."
John Guare, 1990
Six Degrees of Separation
Stanley Milgram, 1969
300 brieven van Omaha naar Boston
Geadresseerd aan 300 willekeurige mensen, met het verzoek debrief door te sturenrichting de uiteindelijkgeadresseerde.
Na gemiddeld 5.5 stap-pen kwam de brief bij de geadresseerde aan.
Six Degrees of Separation
Testen op een online sociaal netwerk
Dataset
8 miljoen gebruikers
900 miljoen onderlinge vriendschappen
9GB in text (datafile), 4GB in memory
Alle afstanden vergelijken: 8M x 8M = 64 x 1012
Sampling: onderlinge afstand van paren van 1000 willekeurige gebruikers bepalen mb.v. kortstepad-algoritme van Dijkstra
Gemiddelde afstanden
Netwerk Gebruikers Vriendschappen Gemiddelde afstand
Flickr 1.800.000 22.600.000 5.67
Hyves 8.000.000 900.000.000 4.75
LiveJournal 5.300.000 77.400.000 5.88
Orkut 3.100.000 223.500.000 4.25
YouTube 1.160.000 4.950.000 5.10
1000 samples
Random gekozen
Voor datasets van sociale netwerken:
Data storage in memory
n nodes, m links, k links per node on average
1 < k < n < m < n2
Adjacency Matrix Sorted Adjecency List
Size 8M x 8M x 1bit = 64 Tbit = 8 TbyteO(n2) space
900M x 8 bytes (INT pairs) = 7.2 GbyteO(m) space
Link existence O(1) time O(log k) time
Link addition O(1) time O(k log k) time
Link deletion O(1) time O(1) time
Neighborhood O(n) time O(1) time
Friendship Graph Analysis
Static analysis
Densely connected core
Fringe of low-degree nodes
Few isolated communities & singletons
Static properties
Node degree distribution, average distance, diameter
Edge/node ratio, level of symmetry
Number of cliques, k-cliques, etc.
Small world phenomanon
Small World Networks
Class of networks with certain properties:
Sparse graphs
Highly connected
Short average node-to-node distance: d ~ log(n)
Fat tailed power law node degree distribution
Densely connected core with many (near-)cliques
Existence of hubs: nodes with a very high degree
Fringe of low(er)-degree nodes
Small World Networks
Other examples of small world networks
Web graphs
Gene networks
E-mail networks
Telephone call graphs
Information networks
Internet topology networks
Scientific co-authorship networks
Corporate networks (interlocks or ownerships)
Friendship Graph Analysis
Static analysis
Dynamic analysis
Network evolution
Network modelling
Network growth
Link Prediction
Triadic Closure
Preferential Attachment
Preferential Attachment
Nodes with a large degree acquire new links at a faster rate.
B
C D
J
A E F
G
Conclusions
When you hear “big data”, then it is almostnever really big data.
Online social networks are an excellent domain of study for data (or graph-) miners.
Social Network Analysis is important for many areas of research, not only computer science.
(Small world) networks are everywhere.
Try this at home
Graph visualization: http://www.gephi.orghttp://nodexl.codeplex.com
Network datasets: http://snap.stanford.eduhttp://konect.uni-koblenz.de/networks