Mining and Analyzing Online Social Graph Data Drew Conway New York University, Dept. of Politics May 13, 2010
May 11, 2015
Mining and Analyzing Online Social Graph Data
Drew Conway
New York University Dept of Politics
May 13 2010
Agenda
Network basics
I Unit of analysis
I Data representation
I Analysis amp Visualization
Thoughts on research design
I Edge contexts
I A bit on social graph data and ethics
Digging into the data
I Where to get it
I Our first scrape and build
Real-time social graph analysis
I Build the network of Twitter users in the room
I Hold our breath
Live demo setup
Last week Bruno asked how many of you use Twitter
I 15 (54) members said yes which may be enough to do some interestingstuff
Please tweet the following hash-tag
analyticsnyc
Some example tweets
I Many thanks to the GiltGroupe for hosting tonightrsquos analyticsnycmeetup
I Hanging out with DKALab and brunocm at analyticsnyc
I I cannot wait to watch drewconway crash and burn with hisanalyticsnyc live demo
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Networks and the study of relationships
Network theory use the language of graph theory
G = V Euarr
In the abstract this is very powerful we have a general purpose way torepresent any number of relations
I G=RoutersPacket TrafficEdges have non-binary values
I G=US Airports Commercial RoutesEdges can represent distance cost frequency etc
I G=New York City nerds Co-membership in MeetupsEdges are implied
While both nodes and edges are needed to have a network of any substance itis the edge (relationship) that will always be the primary focus of our analyses
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Why focus on the edge
Consider a very simple example of two people meeting for the first time
If we focus on the nodes we mightconsider this meeting creates adyad
We know however that people donot exist as isolates and thisassumption is ignoring all ofexogenous social structure theseindividuals brings with them
In reality the meeting may reveal a largedegree of shared structure
By expressing this meeting in terms of thenodes as a function of their edges we maygain a much richer understanding of thestructural dynamics
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Representing network data the canonical
Perhaps the most natural way to represent the relationships between N actorsis with an NxN matrix often referred to as a ldquosociomatrixrdquo
X1 X2 XN
X1 0 1 0X2 1 0 1
XN 0 1 0
I Very intuitive representation
I Can support directional and weighted edges
I Application of matrix algebraic operation foranalysis
As an introduction to representing relationship a matrix provides insight to anetwork by itself In practice however this representation has many limitations
I Unwieldy as network sizes scale up
I Most networks are sparse too many zeroes
I Difficult to publish and share
Fortunately there are many other options for representing network data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Representing network data the practical
Remember it is all about the edges therefore more efficient representationswill be limited to edge data
Edge list
A text file with two columns (usuallydelimited by a space or tab) where firstcolumn in source and second is target
1 2
1 3
2 1
6 7
6 8
10 12
10 13
Adjacency list
Also a text file however here the firstcolumn is the source and allsubsequent entries are ldquoadjacentrdquonodes
1 2 3
2 1
6 7 8
10 12 13
While these are the most universal data formats network data representation isa bit of a cottage industry
I Pajek (net)I GraphML (gml)I GraphViz (dot)I and domain specific formats (eek)
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
A bit on tools
The number of software suites and packages available for conducting socialnetwork analysis has exploded over the past ten years
I In general this software can be categorized in two waysI Type - many SNA tools are developed to be standalone applications while
others are language specific packagesI Intent - consumers and producer of SNA come from a wide range of
technical expertise andor need therefore there exist simple tools for datacollection and basic analysis as well as complex suites for advanced research
Standalone Apps Modules amp Packages
Basic- ORA (Windows) - libSNA (Python)- Analyst Notebook (Windows) - UrlNet (Python)- KrakPlot (Windows) - NodeXL (MS Excel)
Advanced- UCINet (Windows) - NetworkX (Python)- Pajek (Multi) - JUNG (Java)- Network Workbench (Multi) - igraph (Python R amp Ruby)
Many of the above tools have visualization components but several tools aredesigned specifically for visualization Graphviz NetDraw Tom Sawyer GephietcWhat I use
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Comparing two network metrics to find key actors
Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure
I Degree (number of connections)
I Betweenness (number of shortest paths an actor is on)
I Closeness (relative distance to all other actors)
I Eigenvector centrality (leading eigenvector of sociomatrix)
One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note
I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor
I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
First visualize the data
Visualization can be the best first step in the analytical process
I This will give you a goodfeel for what is going onwith your relationships
I For large networks this isoften not possible
I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges
I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
The first steps are to load the data into memory and perform some basiccentrality analysis1
Load the data into igraph
library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal
Store metrics in new data frame
centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step
1Weeks et al (2002) httpdxdoiorg101023A1015457400897
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
Plot the data
library(ggplot2)
We use ggplot2 to make things a
bit prettier
plt-ggplot(centaes(x=bety=eig
label=rownames(cent)colour=res
size=abs(res)))+xlab(Betweenness
Centrality)+ylab(Eigenvector
Centrality)
We use the residuals to color and
shape the points of our plot
making it easier to spot outliers
p+geom_text()+opts(title=Key Actor
Analysis for Hartford Drug Users)
We use the geom_text function to plot
the actorsrsquo IDrsquos rather than points
so we know who is who
Key Actor Analysis for Hartford Drug Users
Betweenness Centrality
Eig
enve
ctor
C
entr
ality
1 2 3 45
6
7 8
9
1011
12
13 1415
16
17
1819
20
21
22
23
24
2526
27
28
29303132
33
34 35
36
37 3839 40
41
42
43
44
4546
47
48
49
50
51
52
53
5455
56
57
58
59 6061 62
63
64
65
66
67
6869
7071 72 7374
75
76
77
78 798081 8283
8485
86
87
88
89
90
91
92
93
94
9596 9798 99
100
101
102103
104
105
106107108109
110
111
112113 114 115116 117118119120121
122 123124
125
126
127128129
130
131132133134135
136
137138
139
140
141
142143 144145146
147
148149
150151152153 154
155
156 157158 159 160161
162
163164
165
166167
168
169
170
171
172
173
174
175176
177
178179180
181
182
183
184
185
186187188189 190191 19219319400
02
04
06
08
10
0 1000 2000 3000 4000 5000 6000
res
minus02
0
02
04
06
abs(res)
01
02
03
04
05
06
07
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Key Actor Plot
20
284447
50
53
58
79
102
141155 Network plot
Create positions for all of
the nodes w force directed
llt-layoutfruchtermanreingold(G
niter=500)
Set the nodesrsquo size relative to
their residual value
V(G)$sizelt-abs(res)10
Only display the labels of key
players
nodeslt-asvector(V(G)+1)
Key players defined as have a
residual value gt25
nodes[which(abs(res)lt25)]lt-NA
Save plot as PDF
pdf(lsquoactor_plotpdfrsquopointsize=7)
plot(Glayout=lvertexlabel=nodes
vertexlabeldist=025
vertexlabelcolor=lsquoredrsquoedgewidth=1)
devoff()
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Not all edges are created equally
I have spent a lot of time tonight describing network analysis as a way tounderstand relationships
I Depending on the source and context of the data these relationships can be interpreted as many differentthings
I Must consider the data generation process by which the edge was created
I With respect to online social graph data ask how do people use this service
Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you
I Geography andhistory less important
I Networks may clusteraround communitiesof interest
Ties here are driven bypersonal contactme rarr offline rarr you
I Offline relationshipsare drivingldquofriendingrdquo
I Considerable amountof meta-data alreadyin FB what does thisadd
Google SocialGraph
Connecting all of Googlersquossocial sites togetherme rarr anything rarr you
I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Ethics and network data
Why should we be discusses ethics at a data analytics seminar
ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo
Mark Zuckerburg CEO Facebook
ldquoMoney is a terrible master but anexcellent servantrdquo
ldquoTherersquos a sucker born every minuterdquo
PT Barnum
In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways
I Social networking services are moving previously private data to the public
I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles
I Analysts are caught in the middle There is no IRB on the Internet
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their
friend structureI Problem Extremely private information questionable methods
PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when
people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization
informationI Problem Useful for raising awareness hurtful to anyone who was actually
robbed
Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study
the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to
attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Agenda
Network basics
I Unit of analysis
I Data representation
I Analysis amp Visualization
Thoughts on research design
I Edge contexts
I A bit on social graph data and ethics
Digging into the data
I Where to get it
I Our first scrape and build
Real-time social graph analysis
I Build the network of Twitter users in the room
I Hold our breath
Live demo setup
Last week Bruno asked how many of you use Twitter
I 15 (54) members said yes which may be enough to do some interestingstuff
Please tweet the following hash-tag
analyticsnyc
Some example tweets
I Many thanks to the GiltGroupe for hosting tonightrsquos analyticsnycmeetup
I Hanging out with DKALab and brunocm at analyticsnyc
I I cannot wait to watch drewconway crash and burn with hisanalyticsnyc live demo
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Networks and the study of relationships
Network theory use the language of graph theory
G = V Euarr
In the abstract this is very powerful we have a general purpose way torepresent any number of relations
I G=RoutersPacket TrafficEdges have non-binary values
I G=US Airports Commercial RoutesEdges can represent distance cost frequency etc
I G=New York City nerds Co-membership in MeetupsEdges are implied
While both nodes and edges are needed to have a network of any substance itis the edge (relationship) that will always be the primary focus of our analyses
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Why focus on the edge
Consider a very simple example of two people meeting for the first time
If we focus on the nodes we mightconsider this meeting creates adyad
We know however that people donot exist as isolates and thisassumption is ignoring all ofexogenous social structure theseindividuals brings with them
In reality the meeting may reveal a largedegree of shared structure
By expressing this meeting in terms of thenodes as a function of their edges we maygain a much richer understanding of thestructural dynamics
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Representing network data the canonical
Perhaps the most natural way to represent the relationships between N actorsis with an NxN matrix often referred to as a ldquosociomatrixrdquo
X1 X2 XN
X1 0 1 0X2 1 0 1
XN 0 1 0
I Very intuitive representation
I Can support directional and weighted edges
I Application of matrix algebraic operation foranalysis
As an introduction to representing relationship a matrix provides insight to anetwork by itself In practice however this representation has many limitations
I Unwieldy as network sizes scale up
I Most networks are sparse too many zeroes
I Difficult to publish and share
Fortunately there are many other options for representing network data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Representing network data the practical
Remember it is all about the edges therefore more efficient representationswill be limited to edge data
Edge list
A text file with two columns (usuallydelimited by a space or tab) where firstcolumn in source and second is target
1 2
1 3
2 1
6 7
6 8
10 12
10 13
Adjacency list
Also a text file however here the firstcolumn is the source and allsubsequent entries are ldquoadjacentrdquonodes
1 2 3
2 1
6 7 8
10 12 13
While these are the most universal data formats network data representation isa bit of a cottage industry
I Pajek (net)I GraphML (gml)I GraphViz (dot)I and domain specific formats (eek)
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
A bit on tools
The number of software suites and packages available for conducting socialnetwork analysis has exploded over the past ten years
I In general this software can be categorized in two waysI Type - many SNA tools are developed to be standalone applications while
others are language specific packagesI Intent - consumers and producer of SNA come from a wide range of
technical expertise andor need therefore there exist simple tools for datacollection and basic analysis as well as complex suites for advanced research
Standalone Apps Modules amp Packages
Basic- ORA (Windows) - libSNA (Python)- Analyst Notebook (Windows) - UrlNet (Python)- KrakPlot (Windows) - NodeXL (MS Excel)
Advanced- UCINet (Windows) - NetworkX (Python)- Pajek (Multi) - JUNG (Java)- Network Workbench (Multi) - igraph (Python R amp Ruby)
Many of the above tools have visualization components but several tools aredesigned specifically for visualization Graphviz NetDraw Tom Sawyer GephietcWhat I use
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Comparing two network metrics to find key actors
Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure
I Degree (number of connections)
I Betweenness (number of shortest paths an actor is on)
I Closeness (relative distance to all other actors)
I Eigenvector centrality (leading eigenvector of sociomatrix)
One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note
I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor
I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
First visualize the data
Visualization can be the best first step in the analytical process
I This will give you a goodfeel for what is going onwith your relationships
I For large networks this isoften not possible
I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges
I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
The first steps are to load the data into memory and perform some basiccentrality analysis1
Load the data into igraph
library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal
Store metrics in new data frame
centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step
1Weeks et al (2002) httpdxdoiorg101023A1015457400897
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
Plot the data
library(ggplot2)
We use ggplot2 to make things a
bit prettier
plt-ggplot(centaes(x=bety=eig
label=rownames(cent)colour=res
size=abs(res)))+xlab(Betweenness
Centrality)+ylab(Eigenvector
Centrality)
We use the residuals to color and
shape the points of our plot
making it easier to spot outliers
p+geom_text()+opts(title=Key Actor
Analysis for Hartford Drug Users)
We use the geom_text function to plot
the actorsrsquo IDrsquos rather than points
so we know who is who
Key Actor Analysis for Hartford Drug Users
Betweenness Centrality
Eig
enve
ctor
C
entr
ality
1 2 3 45
6
7 8
9
1011
12
13 1415
16
17
1819
20
21
22
23
24
2526
27
28
29303132
33
34 35
36
37 3839 40
41
42
43
44
4546
47
48
49
50
51
52
53
5455
56
57
58
59 6061 62
63
64
65
66
67
6869
7071 72 7374
75
76
77
78 798081 8283
8485
86
87
88
89
90
91
92
93
94
9596 9798 99
100
101
102103
104
105
106107108109
110
111
112113 114 115116 117118119120121
122 123124
125
126
127128129
130
131132133134135
136
137138
139
140
141
142143 144145146
147
148149
150151152153 154
155
156 157158 159 160161
162
163164
165
166167
168
169
170
171
172
173
174
175176
177
178179180
181
182
183
184
185
186187188189 190191 19219319400
02
04
06
08
10
0 1000 2000 3000 4000 5000 6000
res
minus02
0
02
04
06
abs(res)
01
02
03
04
05
06
07
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Key Actor Plot
20
284447
50
53
58
79
102
141155 Network plot
Create positions for all of
the nodes w force directed
llt-layoutfruchtermanreingold(G
niter=500)
Set the nodesrsquo size relative to
their residual value
V(G)$sizelt-abs(res)10
Only display the labels of key
players
nodeslt-asvector(V(G)+1)
Key players defined as have a
residual value gt25
nodes[which(abs(res)lt25)]lt-NA
Save plot as PDF
pdf(lsquoactor_plotpdfrsquopointsize=7)
plot(Glayout=lvertexlabel=nodes
vertexlabeldist=025
vertexlabelcolor=lsquoredrsquoedgewidth=1)
devoff()
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Not all edges are created equally
I have spent a lot of time tonight describing network analysis as a way tounderstand relationships
I Depending on the source and context of the data these relationships can be interpreted as many differentthings
I Must consider the data generation process by which the edge was created
I With respect to online social graph data ask how do people use this service
Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you
I Geography andhistory less important
I Networks may clusteraround communitiesof interest
Ties here are driven bypersonal contactme rarr offline rarr you
I Offline relationshipsare drivingldquofriendingrdquo
I Considerable amountof meta-data alreadyin FB what does thisadd
Google SocialGraph
Connecting all of Googlersquossocial sites togetherme rarr anything rarr you
I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Ethics and network data
Why should we be discusses ethics at a data analytics seminar
ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo
Mark Zuckerburg CEO Facebook
ldquoMoney is a terrible master but anexcellent servantrdquo
ldquoTherersquos a sucker born every minuterdquo
PT Barnum
In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways
I Social networking services are moving previously private data to the public
I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles
I Analysts are caught in the middle There is no IRB on the Internet
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their
friend structureI Problem Extremely private information questionable methods
PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when
people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization
informationI Problem Useful for raising awareness hurtful to anyone who was actually
robbed
Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study
the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to
attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Live demo setup
Last week Bruno asked how many of you use Twitter
I 15 (54) members said yes which may be enough to do some interestingstuff
Please tweet the following hash-tag
analyticsnyc
Some example tweets
I Many thanks to the GiltGroupe for hosting tonightrsquos analyticsnycmeetup
I Hanging out with DKALab and brunocm at analyticsnyc
I I cannot wait to watch drewconway crash and burn with hisanalyticsnyc live demo
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Networks and the study of relationships
Network theory use the language of graph theory
G = V Euarr
In the abstract this is very powerful we have a general purpose way torepresent any number of relations
I G=RoutersPacket TrafficEdges have non-binary values
I G=US Airports Commercial RoutesEdges can represent distance cost frequency etc
I G=New York City nerds Co-membership in MeetupsEdges are implied
While both nodes and edges are needed to have a network of any substance itis the edge (relationship) that will always be the primary focus of our analyses
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Why focus on the edge
Consider a very simple example of two people meeting for the first time
If we focus on the nodes we mightconsider this meeting creates adyad
We know however that people donot exist as isolates and thisassumption is ignoring all ofexogenous social structure theseindividuals brings with them
In reality the meeting may reveal a largedegree of shared structure
By expressing this meeting in terms of thenodes as a function of their edges we maygain a much richer understanding of thestructural dynamics
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Representing network data the canonical
Perhaps the most natural way to represent the relationships between N actorsis with an NxN matrix often referred to as a ldquosociomatrixrdquo
X1 X2 XN
X1 0 1 0X2 1 0 1
XN 0 1 0
I Very intuitive representation
I Can support directional and weighted edges
I Application of matrix algebraic operation foranalysis
As an introduction to representing relationship a matrix provides insight to anetwork by itself In practice however this representation has many limitations
I Unwieldy as network sizes scale up
I Most networks are sparse too many zeroes
I Difficult to publish and share
Fortunately there are many other options for representing network data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Representing network data the practical
Remember it is all about the edges therefore more efficient representationswill be limited to edge data
Edge list
A text file with two columns (usuallydelimited by a space or tab) where firstcolumn in source and second is target
1 2
1 3
2 1
6 7
6 8
10 12
10 13
Adjacency list
Also a text file however here the firstcolumn is the source and allsubsequent entries are ldquoadjacentrdquonodes
1 2 3
2 1
6 7 8
10 12 13
While these are the most universal data formats network data representation isa bit of a cottage industry
I Pajek (net)I GraphML (gml)I GraphViz (dot)I and domain specific formats (eek)
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
A bit on tools
The number of software suites and packages available for conducting socialnetwork analysis has exploded over the past ten years
I In general this software can be categorized in two waysI Type - many SNA tools are developed to be standalone applications while
others are language specific packagesI Intent - consumers and producer of SNA come from a wide range of
technical expertise andor need therefore there exist simple tools for datacollection and basic analysis as well as complex suites for advanced research
Standalone Apps Modules amp Packages
Basic- ORA (Windows) - libSNA (Python)- Analyst Notebook (Windows) - UrlNet (Python)- KrakPlot (Windows) - NodeXL (MS Excel)
Advanced- UCINet (Windows) - NetworkX (Python)- Pajek (Multi) - JUNG (Java)- Network Workbench (Multi) - igraph (Python R amp Ruby)
Many of the above tools have visualization components but several tools aredesigned specifically for visualization Graphviz NetDraw Tom Sawyer GephietcWhat I use
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Comparing two network metrics to find key actors
Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure
I Degree (number of connections)
I Betweenness (number of shortest paths an actor is on)
I Closeness (relative distance to all other actors)
I Eigenvector centrality (leading eigenvector of sociomatrix)
One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note
I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor
I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
First visualize the data
Visualization can be the best first step in the analytical process
I This will give you a goodfeel for what is going onwith your relationships
I For large networks this isoften not possible
I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges
I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
The first steps are to load the data into memory and perform some basiccentrality analysis1
Load the data into igraph
library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal
Store metrics in new data frame
centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step
1Weeks et al (2002) httpdxdoiorg101023A1015457400897
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
Plot the data
library(ggplot2)
We use ggplot2 to make things a
bit prettier
plt-ggplot(centaes(x=bety=eig
label=rownames(cent)colour=res
size=abs(res)))+xlab(Betweenness
Centrality)+ylab(Eigenvector
Centrality)
We use the residuals to color and
shape the points of our plot
making it easier to spot outliers
p+geom_text()+opts(title=Key Actor
Analysis for Hartford Drug Users)
We use the geom_text function to plot
the actorsrsquo IDrsquos rather than points
so we know who is who
Key Actor Analysis for Hartford Drug Users
Betweenness Centrality
Eig
enve
ctor
C
entr
ality
1 2 3 45
6
7 8
9
1011
12
13 1415
16
17
1819
20
21
22
23
24
2526
27
28
29303132
33
34 35
36
37 3839 40
41
42
43
44
4546
47
48
49
50
51
52
53
5455
56
57
58
59 6061 62
63
64
65
66
67
6869
7071 72 7374
75
76
77
78 798081 8283
8485
86
87
88
89
90
91
92
93
94
9596 9798 99
100
101
102103
104
105
106107108109
110
111
112113 114 115116 117118119120121
122 123124
125
126
127128129
130
131132133134135
136
137138
139
140
141
142143 144145146
147
148149
150151152153 154
155
156 157158 159 160161
162
163164
165
166167
168
169
170
171
172
173
174
175176
177
178179180
181
182
183
184
185
186187188189 190191 19219319400
02
04
06
08
10
0 1000 2000 3000 4000 5000 6000
res
minus02
0
02
04
06
abs(res)
01
02
03
04
05
06
07
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Key Actor Plot
20
284447
50
53
58
79
102
141155 Network plot
Create positions for all of
the nodes w force directed
llt-layoutfruchtermanreingold(G
niter=500)
Set the nodesrsquo size relative to
their residual value
V(G)$sizelt-abs(res)10
Only display the labels of key
players
nodeslt-asvector(V(G)+1)
Key players defined as have a
residual value gt25
nodes[which(abs(res)lt25)]lt-NA
Save plot as PDF
pdf(lsquoactor_plotpdfrsquopointsize=7)
plot(Glayout=lvertexlabel=nodes
vertexlabeldist=025
vertexlabelcolor=lsquoredrsquoedgewidth=1)
devoff()
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Not all edges are created equally
I have spent a lot of time tonight describing network analysis as a way tounderstand relationships
I Depending on the source and context of the data these relationships can be interpreted as many differentthings
I Must consider the data generation process by which the edge was created
I With respect to online social graph data ask how do people use this service
Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you
I Geography andhistory less important
I Networks may clusteraround communitiesof interest
Ties here are driven bypersonal contactme rarr offline rarr you
I Offline relationshipsare drivingldquofriendingrdquo
I Considerable amountof meta-data alreadyin FB what does thisadd
Google SocialGraph
Connecting all of Googlersquossocial sites togetherme rarr anything rarr you
I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Ethics and network data
Why should we be discusses ethics at a data analytics seminar
ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo
Mark Zuckerburg CEO Facebook
ldquoMoney is a terrible master but anexcellent servantrdquo
ldquoTherersquos a sucker born every minuterdquo
PT Barnum
In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways
I Social networking services are moving previously private data to the public
I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles
I Analysts are caught in the middle There is no IRB on the Internet
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their
friend structureI Problem Extremely private information questionable methods
PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when
people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization
informationI Problem Useful for raising awareness hurtful to anyone who was actually
robbed
Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study
the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to
attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Networks and the study of relationships
Network theory use the language of graph theory
G = V Euarr
In the abstract this is very powerful we have a general purpose way torepresent any number of relations
I G=RoutersPacket TrafficEdges have non-binary values
I G=US Airports Commercial RoutesEdges can represent distance cost frequency etc
I G=New York City nerds Co-membership in MeetupsEdges are implied
While both nodes and edges are needed to have a network of any substance itis the edge (relationship) that will always be the primary focus of our analyses
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Why focus on the edge
Consider a very simple example of two people meeting for the first time
If we focus on the nodes we mightconsider this meeting creates adyad
We know however that people donot exist as isolates and thisassumption is ignoring all ofexogenous social structure theseindividuals brings with them
In reality the meeting may reveal a largedegree of shared structure
By expressing this meeting in terms of thenodes as a function of their edges we maygain a much richer understanding of thestructural dynamics
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Representing network data the canonical
Perhaps the most natural way to represent the relationships between N actorsis with an NxN matrix often referred to as a ldquosociomatrixrdquo
X1 X2 XN
X1 0 1 0X2 1 0 1
XN 0 1 0
I Very intuitive representation
I Can support directional and weighted edges
I Application of matrix algebraic operation foranalysis
As an introduction to representing relationship a matrix provides insight to anetwork by itself In practice however this representation has many limitations
I Unwieldy as network sizes scale up
I Most networks are sparse too many zeroes
I Difficult to publish and share
Fortunately there are many other options for representing network data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Representing network data the practical
Remember it is all about the edges therefore more efficient representationswill be limited to edge data
Edge list
A text file with two columns (usuallydelimited by a space or tab) where firstcolumn in source and second is target
1 2
1 3
2 1
6 7
6 8
10 12
10 13
Adjacency list
Also a text file however here the firstcolumn is the source and allsubsequent entries are ldquoadjacentrdquonodes
1 2 3
2 1
6 7 8
10 12 13
While these are the most universal data formats network data representation isa bit of a cottage industry
I Pajek (net)I GraphML (gml)I GraphViz (dot)I and domain specific formats (eek)
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
A bit on tools
The number of software suites and packages available for conducting socialnetwork analysis has exploded over the past ten years
I In general this software can be categorized in two waysI Type - many SNA tools are developed to be standalone applications while
others are language specific packagesI Intent - consumers and producer of SNA come from a wide range of
technical expertise andor need therefore there exist simple tools for datacollection and basic analysis as well as complex suites for advanced research
Standalone Apps Modules amp Packages
Basic- ORA (Windows) - libSNA (Python)- Analyst Notebook (Windows) - UrlNet (Python)- KrakPlot (Windows) - NodeXL (MS Excel)
Advanced- UCINet (Windows) - NetworkX (Python)- Pajek (Multi) - JUNG (Java)- Network Workbench (Multi) - igraph (Python R amp Ruby)
Many of the above tools have visualization components but several tools aredesigned specifically for visualization Graphviz NetDraw Tom Sawyer GephietcWhat I use
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Comparing two network metrics to find key actors
Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure
I Degree (number of connections)
I Betweenness (number of shortest paths an actor is on)
I Closeness (relative distance to all other actors)
I Eigenvector centrality (leading eigenvector of sociomatrix)
One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note
I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor
I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
First visualize the data
Visualization can be the best first step in the analytical process
I This will give you a goodfeel for what is going onwith your relationships
I For large networks this isoften not possible
I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges
I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
The first steps are to load the data into memory and perform some basiccentrality analysis1
Load the data into igraph
library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal
Store metrics in new data frame
centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step
1Weeks et al (2002) httpdxdoiorg101023A1015457400897
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
Plot the data
library(ggplot2)
We use ggplot2 to make things a
bit prettier
plt-ggplot(centaes(x=bety=eig
label=rownames(cent)colour=res
size=abs(res)))+xlab(Betweenness
Centrality)+ylab(Eigenvector
Centrality)
We use the residuals to color and
shape the points of our plot
making it easier to spot outliers
p+geom_text()+opts(title=Key Actor
Analysis for Hartford Drug Users)
We use the geom_text function to plot
the actorsrsquo IDrsquos rather than points
so we know who is who
Key Actor Analysis for Hartford Drug Users
Betweenness Centrality
Eig
enve
ctor
C
entr
ality
1 2 3 45
6
7 8
9
1011
12
13 1415
16
17
1819
20
21
22
23
24
2526
27
28
29303132
33
34 35
36
37 3839 40
41
42
43
44
4546
47
48
49
50
51
52
53
5455
56
57
58
59 6061 62
63
64
65
66
67
6869
7071 72 7374
75
76
77
78 798081 8283
8485
86
87
88
89
90
91
92
93
94
9596 9798 99
100
101
102103
104
105
106107108109
110
111
112113 114 115116 117118119120121
122 123124
125
126
127128129
130
131132133134135
136
137138
139
140
141
142143 144145146
147
148149
150151152153 154
155
156 157158 159 160161
162
163164
165
166167
168
169
170
171
172
173
174
175176
177
178179180
181
182
183
184
185
186187188189 190191 19219319400
02
04
06
08
10
0 1000 2000 3000 4000 5000 6000
res
minus02
0
02
04
06
abs(res)
01
02
03
04
05
06
07
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Key Actor Plot
20
284447
50
53
58
79
102
141155 Network plot
Create positions for all of
the nodes w force directed
llt-layoutfruchtermanreingold(G
niter=500)
Set the nodesrsquo size relative to
their residual value
V(G)$sizelt-abs(res)10
Only display the labels of key
players
nodeslt-asvector(V(G)+1)
Key players defined as have a
residual value gt25
nodes[which(abs(res)lt25)]lt-NA
Save plot as PDF
pdf(lsquoactor_plotpdfrsquopointsize=7)
plot(Glayout=lvertexlabel=nodes
vertexlabeldist=025
vertexlabelcolor=lsquoredrsquoedgewidth=1)
devoff()
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Not all edges are created equally
I have spent a lot of time tonight describing network analysis as a way tounderstand relationships
I Depending on the source and context of the data these relationships can be interpreted as many differentthings
I Must consider the data generation process by which the edge was created
I With respect to online social graph data ask how do people use this service
Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you
I Geography andhistory less important
I Networks may clusteraround communitiesof interest
Ties here are driven bypersonal contactme rarr offline rarr you
I Offline relationshipsare drivingldquofriendingrdquo
I Considerable amountof meta-data alreadyin FB what does thisadd
Google SocialGraph
Connecting all of Googlersquossocial sites togetherme rarr anything rarr you
I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Ethics and network data
Why should we be discusses ethics at a data analytics seminar
ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo
Mark Zuckerburg CEO Facebook
ldquoMoney is a terrible master but anexcellent servantrdquo
ldquoTherersquos a sucker born every minuterdquo
PT Barnum
In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways
I Social networking services are moving previously private data to the public
I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles
I Analysts are caught in the middle There is no IRB on the Internet
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their
friend structureI Problem Extremely private information questionable methods
PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when
people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization
informationI Problem Useful for raising awareness hurtful to anyone who was actually
robbed
Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study
the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to
attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Why focus on the edge
Consider a very simple example of two people meeting for the first time
If we focus on the nodes we mightconsider this meeting creates adyad
We know however that people donot exist as isolates and thisassumption is ignoring all ofexogenous social structure theseindividuals brings with them
In reality the meeting may reveal a largedegree of shared structure
By expressing this meeting in terms of thenodes as a function of their edges we maygain a much richer understanding of thestructural dynamics
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Representing network data the canonical
Perhaps the most natural way to represent the relationships between N actorsis with an NxN matrix often referred to as a ldquosociomatrixrdquo
X1 X2 XN
X1 0 1 0X2 1 0 1
XN 0 1 0
I Very intuitive representation
I Can support directional and weighted edges
I Application of matrix algebraic operation foranalysis
As an introduction to representing relationship a matrix provides insight to anetwork by itself In practice however this representation has many limitations
I Unwieldy as network sizes scale up
I Most networks are sparse too many zeroes
I Difficult to publish and share
Fortunately there are many other options for representing network data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Representing network data the practical
Remember it is all about the edges therefore more efficient representationswill be limited to edge data
Edge list
A text file with two columns (usuallydelimited by a space or tab) where firstcolumn in source and second is target
1 2
1 3
2 1
6 7
6 8
10 12
10 13
Adjacency list
Also a text file however here the firstcolumn is the source and allsubsequent entries are ldquoadjacentrdquonodes
1 2 3
2 1
6 7 8
10 12 13
While these are the most universal data formats network data representation isa bit of a cottage industry
I Pajek (net)I GraphML (gml)I GraphViz (dot)I and domain specific formats (eek)
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
A bit on tools
The number of software suites and packages available for conducting socialnetwork analysis has exploded over the past ten years
I In general this software can be categorized in two waysI Type - many SNA tools are developed to be standalone applications while
others are language specific packagesI Intent - consumers and producer of SNA come from a wide range of
technical expertise andor need therefore there exist simple tools for datacollection and basic analysis as well as complex suites for advanced research
Standalone Apps Modules amp Packages
Basic- ORA (Windows) - libSNA (Python)- Analyst Notebook (Windows) - UrlNet (Python)- KrakPlot (Windows) - NodeXL (MS Excel)
Advanced- UCINet (Windows) - NetworkX (Python)- Pajek (Multi) - JUNG (Java)- Network Workbench (Multi) - igraph (Python R amp Ruby)
Many of the above tools have visualization components but several tools aredesigned specifically for visualization Graphviz NetDraw Tom Sawyer GephietcWhat I use
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Comparing two network metrics to find key actors
Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure
I Degree (number of connections)
I Betweenness (number of shortest paths an actor is on)
I Closeness (relative distance to all other actors)
I Eigenvector centrality (leading eigenvector of sociomatrix)
One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note
I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor
I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
First visualize the data
Visualization can be the best first step in the analytical process
I This will give you a goodfeel for what is going onwith your relationships
I For large networks this isoften not possible
I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges
I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
The first steps are to load the data into memory and perform some basiccentrality analysis1
Load the data into igraph
library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal
Store metrics in new data frame
centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step
1Weeks et al (2002) httpdxdoiorg101023A1015457400897
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
Plot the data
library(ggplot2)
We use ggplot2 to make things a
bit prettier
plt-ggplot(centaes(x=bety=eig
label=rownames(cent)colour=res
size=abs(res)))+xlab(Betweenness
Centrality)+ylab(Eigenvector
Centrality)
We use the residuals to color and
shape the points of our plot
making it easier to spot outliers
p+geom_text()+opts(title=Key Actor
Analysis for Hartford Drug Users)
We use the geom_text function to plot
the actorsrsquo IDrsquos rather than points
so we know who is who
Key Actor Analysis for Hartford Drug Users
Betweenness Centrality
Eig
enve
ctor
C
entr
ality
1 2 3 45
6
7 8
9
1011
12
13 1415
16
17
1819
20
21
22
23
24
2526
27
28
29303132
33
34 35
36
37 3839 40
41
42
43
44
4546
47
48
49
50
51
52
53
5455
56
57
58
59 6061 62
63
64
65
66
67
6869
7071 72 7374
75
76
77
78 798081 8283
8485
86
87
88
89
90
91
92
93
94
9596 9798 99
100
101
102103
104
105
106107108109
110
111
112113 114 115116 117118119120121
122 123124
125
126
127128129
130
131132133134135
136
137138
139
140
141
142143 144145146
147
148149
150151152153 154
155
156 157158 159 160161
162
163164
165
166167
168
169
170
171
172
173
174
175176
177
178179180
181
182
183
184
185
186187188189 190191 19219319400
02
04
06
08
10
0 1000 2000 3000 4000 5000 6000
res
minus02
0
02
04
06
abs(res)
01
02
03
04
05
06
07
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Key Actor Plot
20
284447
50
53
58
79
102
141155 Network plot
Create positions for all of
the nodes w force directed
llt-layoutfruchtermanreingold(G
niter=500)
Set the nodesrsquo size relative to
their residual value
V(G)$sizelt-abs(res)10
Only display the labels of key
players
nodeslt-asvector(V(G)+1)
Key players defined as have a
residual value gt25
nodes[which(abs(res)lt25)]lt-NA
Save plot as PDF
pdf(lsquoactor_plotpdfrsquopointsize=7)
plot(Glayout=lvertexlabel=nodes
vertexlabeldist=025
vertexlabelcolor=lsquoredrsquoedgewidth=1)
devoff()
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Not all edges are created equally
I have spent a lot of time tonight describing network analysis as a way tounderstand relationships
I Depending on the source and context of the data these relationships can be interpreted as many differentthings
I Must consider the data generation process by which the edge was created
I With respect to online social graph data ask how do people use this service
Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you
I Geography andhistory less important
I Networks may clusteraround communitiesof interest
Ties here are driven bypersonal contactme rarr offline rarr you
I Offline relationshipsare drivingldquofriendingrdquo
I Considerable amountof meta-data alreadyin FB what does thisadd
Google SocialGraph
Connecting all of Googlersquossocial sites togetherme rarr anything rarr you
I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Ethics and network data
Why should we be discusses ethics at a data analytics seminar
ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo
Mark Zuckerburg CEO Facebook
ldquoMoney is a terrible master but anexcellent servantrdquo
ldquoTherersquos a sucker born every minuterdquo
PT Barnum
In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways
I Social networking services are moving previously private data to the public
I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles
I Analysts are caught in the middle There is no IRB on the Internet
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their
friend structureI Problem Extremely private information questionable methods
PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when
people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization
informationI Problem Useful for raising awareness hurtful to anyone who was actually
robbed
Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study
the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to
attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Representing network data the canonical
Perhaps the most natural way to represent the relationships between N actorsis with an NxN matrix often referred to as a ldquosociomatrixrdquo
X1 X2 XN
X1 0 1 0X2 1 0 1
XN 0 1 0
I Very intuitive representation
I Can support directional and weighted edges
I Application of matrix algebraic operation foranalysis
As an introduction to representing relationship a matrix provides insight to anetwork by itself In practice however this representation has many limitations
I Unwieldy as network sizes scale up
I Most networks are sparse too many zeroes
I Difficult to publish and share
Fortunately there are many other options for representing network data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Representing network data the practical
Remember it is all about the edges therefore more efficient representationswill be limited to edge data
Edge list
A text file with two columns (usuallydelimited by a space or tab) where firstcolumn in source and second is target
1 2
1 3
2 1
6 7
6 8
10 12
10 13
Adjacency list
Also a text file however here the firstcolumn is the source and allsubsequent entries are ldquoadjacentrdquonodes
1 2 3
2 1
6 7 8
10 12 13
While these are the most universal data formats network data representation isa bit of a cottage industry
I Pajek (net)I GraphML (gml)I GraphViz (dot)I and domain specific formats (eek)
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
A bit on tools
The number of software suites and packages available for conducting socialnetwork analysis has exploded over the past ten years
I In general this software can be categorized in two waysI Type - many SNA tools are developed to be standalone applications while
others are language specific packagesI Intent - consumers and producer of SNA come from a wide range of
technical expertise andor need therefore there exist simple tools for datacollection and basic analysis as well as complex suites for advanced research
Standalone Apps Modules amp Packages
Basic- ORA (Windows) - libSNA (Python)- Analyst Notebook (Windows) - UrlNet (Python)- KrakPlot (Windows) - NodeXL (MS Excel)
Advanced- UCINet (Windows) - NetworkX (Python)- Pajek (Multi) - JUNG (Java)- Network Workbench (Multi) - igraph (Python R amp Ruby)
Many of the above tools have visualization components but several tools aredesigned specifically for visualization Graphviz NetDraw Tom Sawyer GephietcWhat I use
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Comparing two network metrics to find key actors
Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure
I Degree (number of connections)
I Betweenness (number of shortest paths an actor is on)
I Closeness (relative distance to all other actors)
I Eigenvector centrality (leading eigenvector of sociomatrix)
One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note
I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor
I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
First visualize the data
Visualization can be the best first step in the analytical process
I This will give you a goodfeel for what is going onwith your relationships
I For large networks this isoften not possible
I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges
I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
The first steps are to load the data into memory and perform some basiccentrality analysis1
Load the data into igraph
library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal
Store metrics in new data frame
centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step
1Weeks et al (2002) httpdxdoiorg101023A1015457400897
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
Plot the data
library(ggplot2)
We use ggplot2 to make things a
bit prettier
plt-ggplot(centaes(x=bety=eig
label=rownames(cent)colour=res
size=abs(res)))+xlab(Betweenness
Centrality)+ylab(Eigenvector
Centrality)
We use the residuals to color and
shape the points of our plot
making it easier to spot outliers
p+geom_text()+opts(title=Key Actor
Analysis for Hartford Drug Users)
We use the geom_text function to plot
the actorsrsquo IDrsquos rather than points
so we know who is who
Key Actor Analysis for Hartford Drug Users
Betweenness Centrality
Eig
enve
ctor
C
entr
ality
1 2 3 45
6
7 8
9
1011
12
13 1415
16
17
1819
20
21
22
23
24
2526
27
28
29303132
33
34 35
36
37 3839 40
41
42
43
44
4546
47
48
49
50
51
52
53
5455
56
57
58
59 6061 62
63
64
65
66
67
6869
7071 72 7374
75
76
77
78 798081 8283
8485
86
87
88
89
90
91
92
93
94
9596 9798 99
100
101
102103
104
105
106107108109
110
111
112113 114 115116 117118119120121
122 123124
125
126
127128129
130
131132133134135
136
137138
139
140
141
142143 144145146
147
148149
150151152153 154
155
156 157158 159 160161
162
163164
165
166167
168
169
170
171
172
173
174
175176
177
178179180
181
182
183
184
185
186187188189 190191 19219319400
02
04
06
08
10
0 1000 2000 3000 4000 5000 6000
res
minus02
0
02
04
06
abs(res)
01
02
03
04
05
06
07
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Key Actor Plot
20
284447
50
53
58
79
102
141155 Network plot
Create positions for all of
the nodes w force directed
llt-layoutfruchtermanreingold(G
niter=500)
Set the nodesrsquo size relative to
their residual value
V(G)$sizelt-abs(res)10
Only display the labels of key
players
nodeslt-asvector(V(G)+1)
Key players defined as have a
residual value gt25
nodes[which(abs(res)lt25)]lt-NA
Save plot as PDF
pdf(lsquoactor_plotpdfrsquopointsize=7)
plot(Glayout=lvertexlabel=nodes
vertexlabeldist=025
vertexlabelcolor=lsquoredrsquoedgewidth=1)
devoff()
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Not all edges are created equally
I have spent a lot of time tonight describing network analysis as a way tounderstand relationships
I Depending on the source and context of the data these relationships can be interpreted as many differentthings
I Must consider the data generation process by which the edge was created
I With respect to online social graph data ask how do people use this service
Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you
I Geography andhistory less important
I Networks may clusteraround communitiesof interest
Ties here are driven bypersonal contactme rarr offline rarr you
I Offline relationshipsare drivingldquofriendingrdquo
I Considerable amountof meta-data alreadyin FB what does thisadd
Google SocialGraph
Connecting all of Googlersquossocial sites togetherme rarr anything rarr you
I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Ethics and network data
Why should we be discusses ethics at a data analytics seminar
ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo
Mark Zuckerburg CEO Facebook
ldquoMoney is a terrible master but anexcellent servantrdquo
ldquoTherersquos a sucker born every minuterdquo
PT Barnum
In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways
I Social networking services are moving previously private data to the public
I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles
I Analysts are caught in the middle There is no IRB on the Internet
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their
friend structureI Problem Extremely private information questionable methods
PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when
people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization
informationI Problem Useful for raising awareness hurtful to anyone who was actually
robbed
Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study
the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to
attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Representing network data the practical
Remember it is all about the edges therefore more efficient representationswill be limited to edge data
Edge list
A text file with two columns (usuallydelimited by a space or tab) where firstcolumn in source and second is target
1 2
1 3
2 1
6 7
6 8
10 12
10 13
Adjacency list
Also a text file however here the firstcolumn is the source and allsubsequent entries are ldquoadjacentrdquonodes
1 2 3
2 1
6 7 8
10 12 13
While these are the most universal data formats network data representation isa bit of a cottage industry
I Pajek (net)I GraphML (gml)I GraphViz (dot)I and domain specific formats (eek)
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
A bit on tools
The number of software suites and packages available for conducting socialnetwork analysis has exploded over the past ten years
I In general this software can be categorized in two waysI Type - many SNA tools are developed to be standalone applications while
others are language specific packagesI Intent - consumers and producer of SNA come from a wide range of
technical expertise andor need therefore there exist simple tools for datacollection and basic analysis as well as complex suites for advanced research
Standalone Apps Modules amp Packages
Basic- ORA (Windows) - libSNA (Python)- Analyst Notebook (Windows) - UrlNet (Python)- KrakPlot (Windows) - NodeXL (MS Excel)
Advanced- UCINet (Windows) - NetworkX (Python)- Pajek (Multi) - JUNG (Java)- Network Workbench (Multi) - igraph (Python R amp Ruby)
Many of the above tools have visualization components but several tools aredesigned specifically for visualization Graphviz NetDraw Tom Sawyer GephietcWhat I use
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Comparing two network metrics to find key actors
Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure
I Degree (number of connections)
I Betweenness (number of shortest paths an actor is on)
I Closeness (relative distance to all other actors)
I Eigenvector centrality (leading eigenvector of sociomatrix)
One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note
I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor
I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
First visualize the data
Visualization can be the best first step in the analytical process
I This will give you a goodfeel for what is going onwith your relationships
I For large networks this isoften not possible
I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges
I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
The first steps are to load the data into memory and perform some basiccentrality analysis1
Load the data into igraph
library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal
Store metrics in new data frame
centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step
1Weeks et al (2002) httpdxdoiorg101023A1015457400897
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
Plot the data
library(ggplot2)
We use ggplot2 to make things a
bit prettier
plt-ggplot(centaes(x=bety=eig
label=rownames(cent)colour=res
size=abs(res)))+xlab(Betweenness
Centrality)+ylab(Eigenvector
Centrality)
We use the residuals to color and
shape the points of our plot
making it easier to spot outliers
p+geom_text()+opts(title=Key Actor
Analysis for Hartford Drug Users)
We use the geom_text function to plot
the actorsrsquo IDrsquos rather than points
so we know who is who
Key Actor Analysis for Hartford Drug Users
Betweenness Centrality
Eig
enve
ctor
C
entr
ality
1 2 3 45
6
7 8
9
1011
12
13 1415
16
17
1819
20
21
22
23
24
2526
27
28
29303132
33
34 35
36
37 3839 40
41
42
43
44
4546
47
48
49
50
51
52
53
5455
56
57
58
59 6061 62
63
64
65
66
67
6869
7071 72 7374
75
76
77
78 798081 8283
8485
86
87
88
89
90
91
92
93
94
9596 9798 99
100
101
102103
104
105
106107108109
110
111
112113 114 115116 117118119120121
122 123124
125
126
127128129
130
131132133134135
136
137138
139
140
141
142143 144145146
147
148149
150151152153 154
155
156 157158 159 160161
162
163164
165
166167
168
169
170
171
172
173
174
175176
177
178179180
181
182
183
184
185
186187188189 190191 19219319400
02
04
06
08
10
0 1000 2000 3000 4000 5000 6000
res
minus02
0
02
04
06
abs(res)
01
02
03
04
05
06
07
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Key Actor Plot
20
284447
50
53
58
79
102
141155 Network plot
Create positions for all of
the nodes w force directed
llt-layoutfruchtermanreingold(G
niter=500)
Set the nodesrsquo size relative to
their residual value
V(G)$sizelt-abs(res)10
Only display the labels of key
players
nodeslt-asvector(V(G)+1)
Key players defined as have a
residual value gt25
nodes[which(abs(res)lt25)]lt-NA
Save plot as PDF
pdf(lsquoactor_plotpdfrsquopointsize=7)
plot(Glayout=lvertexlabel=nodes
vertexlabeldist=025
vertexlabelcolor=lsquoredrsquoedgewidth=1)
devoff()
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Not all edges are created equally
I have spent a lot of time tonight describing network analysis as a way tounderstand relationships
I Depending on the source and context of the data these relationships can be interpreted as many differentthings
I Must consider the data generation process by which the edge was created
I With respect to online social graph data ask how do people use this service
Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you
I Geography andhistory less important
I Networks may clusteraround communitiesof interest
Ties here are driven bypersonal contactme rarr offline rarr you
I Offline relationshipsare drivingldquofriendingrdquo
I Considerable amountof meta-data alreadyin FB what does thisadd
Google SocialGraph
Connecting all of Googlersquossocial sites togetherme rarr anything rarr you
I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Ethics and network data
Why should we be discusses ethics at a data analytics seminar
ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo
Mark Zuckerburg CEO Facebook
ldquoMoney is a terrible master but anexcellent servantrdquo
ldquoTherersquos a sucker born every minuterdquo
PT Barnum
In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways
I Social networking services are moving previously private data to the public
I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles
I Analysts are caught in the middle There is no IRB on the Internet
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their
friend structureI Problem Extremely private information questionable methods
PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when
people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization
informationI Problem Useful for raising awareness hurtful to anyone who was actually
robbed
Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study
the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to
attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
A bit on tools
The number of software suites and packages available for conducting socialnetwork analysis has exploded over the past ten years
I In general this software can be categorized in two waysI Type - many SNA tools are developed to be standalone applications while
others are language specific packagesI Intent - consumers and producer of SNA come from a wide range of
technical expertise andor need therefore there exist simple tools for datacollection and basic analysis as well as complex suites for advanced research
Standalone Apps Modules amp Packages
Basic- ORA (Windows) - libSNA (Python)- Analyst Notebook (Windows) - UrlNet (Python)- KrakPlot (Windows) - NodeXL (MS Excel)
Advanced- UCINet (Windows) - NetworkX (Python)- Pajek (Multi) - JUNG (Java)- Network Workbench (Multi) - igraph (Python R amp Ruby)
Many of the above tools have visualization components but several tools aredesigned specifically for visualization Graphviz NetDraw Tom Sawyer GephietcWhat I use
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Comparing two network metrics to find key actors
Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure
I Degree (number of connections)
I Betweenness (number of shortest paths an actor is on)
I Closeness (relative distance to all other actors)
I Eigenvector centrality (leading eigenvector of sociomatrix)
One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note
I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor
I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
First visualize the data
Visualization can be the best first step in the analytical process
I This will give you a goodfeel for what is going onwith your relationships
I For large networks this isoften not possible
I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges
I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
The first steps are to load the data into memory and perform some basiccentrality analysis1
Load the data into igraph
library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal
Store metrics in new data frame
centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step
1Weeks et al (2002) httpdxdoiorg101023A1015457400897
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
Plot the data
library(ggplot2)
We use ggplot2 to make things a
bit prettier
plt-ggplot(centaes(x=bety=eig
label=rownames(cent)colour=res
size=abs(res)))+xlab(Betweenness
Centrality)+ylab(Eigenvector
Centrality)
We use the residuals to color and
shape the points of our plot
making it easier to spot outliers
p+geom_text()+opts(title=Key Actor
Analysis for Hartford Drug Users)
We use the geom_text function to plot
the actorsrsquo IDrsquos rather than points
so we know who is who
Key Actor Analysis for Hartford Drug Users
Betweenness Centrality
Eig
enve
ctor
C
entr
ality
1 2 3 45
6
7 8
9
1011
12
13 1415
16
17
1819
20
21
22
23
24
2526
27
28
29303132
33
34 35
36
37 3839 40
41
42
43
44
4546
47
48
49
50
51
52
53
5455
56
57
58
59 6061 62
63
64
65
66
67
6869
7071 72 7374
75
76
77
78 798081 8283
8485
86
87
88
89
90
91
92
93
94
9596 9798 99
100
101
102103
104
105
106107108109
110
111
112113 114 115116 117118119120121
122 123124
125
126
127128129
130
131132133134135
136
137138
139
140
141
142143 144145146
147
148149
150151152153 154
155
156 157158 159 160161
162
163164
165
166167
168
169
170
171
172
173
174
175176
177
178179180
181
182
183
184
185
186187188189 190191 19219319400
02
04
06
08
10
0 1000 2000 3000 4000 5000 6000
res
minus02
0
02
04
06
abs(res)
01
02
03
04
05
06
07
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Key Actor Plot
20
284447
50
53
58
79
102
141155 Network plot
Create positions for all of
the nodes w force directed
llt-layoutfruchtermanreingold(G
niter=500)
Set the nodesrsquo size relative to
their residual value
V(G)$sizelt-abs(res)10
Only display the labels of key
players
nodeslt-asvector(V(G)+1)
Key players defined as have a
residual value gt25
nodes[which(abs(res)lt25)]lt-NA
Save plot as PDF
pdf(lsquoactor_plotpdfrsquopointsize=7)
plot(Glayout=lvertexlabel=nodes
vertexlabeldist=025
vertexlabelcolor=lsquoredrsquoedgewidth=1)
devoff()
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Not all edges are created equally
I have spent a lot of time tonight describing network analysis as a way tounderstand relationships
I Depending on the source and context of the data these relationships can be interpreted as many differentthings
I Must consider the data generation process by which the edge was created
I With respect to online social graph data ask how do people use this service
Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you
I Geography andhistory less important
I Networks may clusteraround communitiesof interest
Ties here are driven bypersonal contactme rarr offline rarr you
I Offline relationshipsare drivingldquofriendingrdquo
I Considerable amountof meta-data alreadyin FB what does thisadd
Google SocialGraph
Connecting all of Googlersquossocial sites togetherme rarr anything rarr you
I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Ethics and network data
Why should we be discusses ethics at a data analytics seminar
ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo
Mark Zuckerburg CEO Facebook
ldquoMoney is a terrible master but anexcellent servantrdquo
ldquoTherersquos a sucker born every minuterdquo
PT Barnum
In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways
I Social networking services are moving previously private data to the public
I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles
I Analysts are caught in the middle There is no IRB on the Internet
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their
friend structureI Problem Extremely private information questionable methods
PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when
people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization
informationI Problem Useful for raising awareness hurtful to anyone who was actually
robbed
Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study
the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to
attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Comparing two network metrics to find key actors
Often network analysis is used to identify key actors within a social group Toidentify these actors various centrality metrics can be computed based on anetworkrsquos structure
I Degree (number of connections)
I Betweenness (number of shortest paths an actor is on)
I Closeness (relative distance to all other actors)
I Eigenvector centrality (leading eigenvector of sociomatrix)
One method for using these metrics to identify key actors is to plot actorsrsquoscores for Eigenvector centrality versus Betweenness Theoretically thesemetrics should be approximately linear therefore any non-linear outliers will beof note
I An actor with very high betweenness but low EC may be a criticalgatekeeper to a central actor
I Likewise an actor with low betweenness but high EC may have uniqueaccess to central actors
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
First visualize the data
Visualization can be the best first step in the analytical process
I This will give you a goodfeel for what is going onwith your relationships
I For large networks this isoften not possible
I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges
I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
The first steps are to load the data into memory and perform some basiccentrality analysis1
Load the data into igraph
library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal
Store metrics in new data frame
centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step
1Weeks et al (2002) httpdxdoiorg101023A1015457400897
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
Plot the data
library(ggplot2)
We use ggplot2 to make things a
bit prettier
plt-ggplot(centaes(x=bety=eig
label=rownames(cent)colour=res
size=abs(res)))+xlab(Betweenness
Centrality)+ylab(Eigenvector
Centrality)
We use the residuals to color and
shape the points of our plot
making it easier to spot outliers
p+geom_text()+opts(title=Key Actor
Analysis for Hartford Drug Users)
We use the geom_text function to plot
the actorsrsquo IDrsquos rather than points
so we know who is who
Key Actor Analysis for Hartford Drug Users
Betweenness Centrality
Eig
enve
ctor
C
entr
ality
1 2 3 45
6
7 8
9
1011
12
13 1415
16
17
1819
20
21
22
23
24
2526
27
28
29303132
33
34 35
36
37 3839 40
41
42
43
44
4546
47
48
49
50
51
52
53
5455
56
57
58
59 6061 62
63
64
65
66
67
6869
7071 72 7374
75
76
77
78 798081 8283
8485
86
87
88
89
90
91
92
93
94
9596 9798 99
100
101
102103
104
105
106107108109
110
111
112113 114 115116 117118119120121
122 123124
125
126
127128129
130
131132133134135
136
137138
139
140
141
142143 144145146
147
148149
150151152153 154
155
156 157158 159 160161
162
163164
165
166167
168
169
170
171
172
173
174
175176
177
178179180
181
182
183
184
185
186187188189 190191 19219319400
02
04
06
08
10
0 1000 2000 3000 4000 5000 6000
res
minus02
0
02
04
06
abs(res)
01
02
03
04
05
06
07
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Key Actor Plot
20
284447
50
53
58
79
102
141155 Network plot
Create positions for all of
the nodes w force directed
llt-layoutfruchtermanreingold(G
niter=500)
Set the nodesrsquo size relative to
their residual value
V(G)$sizelt-abs(res)10
Only display the labels of key
players
nodeslt-asvector(V(G)+1)
Key players defined as have a
residual value gt25
nodes[which(abs(res)lt25)]lt-NA
Save plot as PDF
pdf(lsquoactor_plotpdfrsquopointsize=7)
plot(Glayout=lvertexlabel=nodes
vertexlabeldist=025
vertexlabelcolor=lsquoredrsquoedgewidth=1)
devoff()
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Not all edges are created equally
I have spent a lot of time tonight describing network analysis as a way tounderstand relationships
I Depending on the source and context of the data these relationships can be interpreted as many differentthings
I Must consider the data generation process by which the edge was created
I With respect to online social graph data ask how do people use this service
Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you
I Geography andhistory less important
I Networks may clusteraround communitiesof interest
Ties here are driven bypersonal contactme rarr offline rarr you
I Offline relationshipsare drivingldquofriendingrdquo
I Considerable amountof meta-data alreadyin FB what does thisadd
Google SocialGraph
Connecting all of Googlersquossocial sites togetherme rarr anything rarr you
I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Ethics and network data
Why should we be discusses ethics at a data analytics seminar
ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo
Mark Zuckerburg CEO Facebook
ldquoMoney is a terrible master but anexcellent servantrdquo
ldquoTherersquos a sucker born every minuterdquo
PT Barnum
In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways
I Social networking services are moving previously private data to the public
I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles
I Analysts are caught in the middle There is no IRB on the Internet
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their
friend structureI Problem Extremely private information questionable methods
PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when
people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization
informationI Problem Useful for raising awareness hurtful to anyone who was actually
robbed
Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study
the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to
attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
First visualize the data
Visualization can be the best first step in the analytical process
I This will give you a goodfeel for what is going onwith your relationships
I For large networks this isoften not possible
I For this example we willuse the main componentof the social networkcollected on drug users inHartford CT The networkhas 194 nodes and 273edges
I Here I am using theGUESS visualizer in NWBwith a Kamada-Kawailayout
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
The first steps are to load the data into memory and perform some basiccentrality analysis1
Load the data into igraph
library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal
Store metrics in new data frame
centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step
1Weeks et al (2002) httpdxdoiorg101023A1015457400897
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
Plot the data
library(ggplot2)
We use ggplot2 to make things a
bit prettier
plt-ggplot(centaes(x=bety=eig
label=rownames(cent)colour=res
size=abs(res)))+xlab(Betweenness
Centrality)+ylab(Eigenvector
Centrality)
We use the residuals to color and
shape the points of our plot
making it easier to spot outliers
p+geom_text()+opts(title=Key Actor
Analysis for Hartford Drug Users)
We use the geom_text function to plot
the actorsrsquo IDrsquos rather than points
so we know who is who
Key Actor Analysis for Hartford Drug Users
Betweenness Centrality
Eig
enve
ctor
C
entr
ality
1 2 3 45
6
7 8
9
1011
12
13 1415
16
17
1819
20
21
22
23
24
2526
27
28
29303132
33
34 35
36
37 3839 40
41
42
43
44
4546
47
48
49
50
51
52
53
5455
56
57
58
59 6061 62
63
64
65
66
67
6869
7071 72 7374
75
76
77
78 798081 8283
8485
86
87
88
89
90
91
92
93
94
9596 9798 99
100
101
102103
104
105
106107108109
110
111
112113 114 115116 117118119120121
122 123124
125
126
127128129
130
131132133134135
136
137138
139
140
141
142143 144145146
147
148149
150151152153 154
155
156 157158 159 160161
162
163164
165
166167
168
169
170
171
172
173
174
175176
177
178179180
181
182
183
184
185
186187188189 190191 19219319400
02
04
06
08
10
0 1000 2000 3000 4000 5000 6000
res
minus02
0
02
04
06
abs(res)
01
02
03
04
05
06
07
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Key Actor Plot
20
284447
50
53
58
79
102
141155 Network plot
Create positions for all of
the nodes w force directed
llt-layoutfruchtermanreingold(G
niter=500)
Set the nodesrsquo size relative to
their residual value
V(G)$sizelt-abs(res)10
Only display the labels of key
players
nodeslt-asvector(V(G)+1)
Key players defined as have a
residual value gt25
nodes[which(abs(res)lt25)]lt-NA
Save plot as PDF
pdf(lsquoactor_plotpdfrsquopointsize=7)
plot(Glayout=lvertexlabel=nodes
vertexlabeldist=025
vertexlabelcolor=lsquoredrsquoedgewidth=1)
devoff()
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Not all edges are created equally
I have spent a lot of time tonight describing network analysis as a way tounderstand relationships
I Depending on the source and context of the data these relationships can be interpreted as many differentthings
I Must consider the data generation process by which the edge was created
I With respect to online social graph data ask how do people use this service
Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you
I Geography andhistory less important
I Networks may clusteraround communitiesof interest
Ties here are driven bypersonal contactme rarr offline rarr you
I Offline relationshipsare drivingldquofriendingrdquo
I Considerable amountof meta-data alreadyin FB what does thisadd
Google SocialGraph
Connecting all of Googlersquossocial sites togetherme rarr anything rarr you
I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Ethics and network data
Why should we be discusses ethics at a data analytics seminar
ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo
Mark Zuckerburg CEO Facebook
ldquoMoney is a terrible master but anexcellent servantrdquo
ldquoTherersquos a sucker born every minuterdquo
PT Barnum
In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways
I Social networking services are moving previously private data to the public
I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles
I Analysts are caught in the middle There is no IRB on the Internet
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their
friend structureI Problem Extremely private information questionable methods
PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when
people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization
informationI Problem Useful for raising awareness hurtful to anyone who was actually
robbed
Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study
the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to
attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
The first steps are to load the data into memory and perform some basiccentrality analysis1
Load the data into igraph
library(igraph)Glt-readgraph(drug_maintxtformat=edgelist)Glt-asundirected(G) By default igraph inputs edgelist data as a directed graph In this step we undo this and assume that all relationships are reciprocal
Store metrics in new data frame
centlt-dataframe(bet=betweenness(G)eig=evcent(G)$vector) evcent returns lots of data associated with the EC but we only need the leading eigenvectorreslt-lm(eig~betdata=cent)$residualscentlt-transform(centres=res) We will use the residuals in the next step
1Weeks et al (2002) httpdxdoiorg101023A1015457400897
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
Plot the data
library(ggplot2)
We use ggplot2 to make things a
bit prettier
plt-ggplot(centaes(x=bety=eig
label=rownames(cent)colour=res
size=abs(res)))+xlab(Betweenness
Centrality)+ylab(Eigenvector
Centrality)
We use the residuals to color and
shape the points of our plot
making it easier to spot outliers
p+geom_text()+opts(title=Key Actor
Analysis for Hartford Drug Users)
We use the geom_text function to plot
the actorsrsquo IDrsquos rather than points
so we know who is who
Key Actor Analysis for Hartford Drug Users
Betweenness Centrality
Eig
enve
ctor
C
entr
ality
1 2 3 45
6
7 8
9
1011
12
13 1415
16
17
1819
20
21
22
23
24
2526
27
28
29303132
33
34 35
36
37 3839 40
41
42
43
44
4546
47
48
49
50
51
52
53
5455
56
57
58
59 6061 62
63
64
65
66
67
6869
7071 72 7374
75
76
77
78 798081 8283
8485
86
87
88
89
90
91
92
93
94
9596 9798 99
100
101
102103
104
105
106107108109
110
111
112113 114 115116 117118119120121
122 123124
125
126
127128129
130
131132133134135
136
137138
139
140
141
142143 144145146
147
148149
150151152153 154
155
156 157158 159 160161
162
163164
165
166167
168
169
170
171
172
173
174
175176
177
178179180
181
182
183
184
185
186187188189 190191 19219319400
02
04
06
08
10
0 1000 2000 3000 4000 5000 6000
res
minus02
0
02
04
06
abs(res)
01
02
03
04
05
06
07
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Key Actor Plot
20
284447
50
53
58
79
102
141155 Network plot
Create positions for all of
the nodes w force directed
llt-layoutfruchtermanreingold(G
niter=500)
Set the nodesrsquo size relative to
their residual value
V(G)$sizelt-abs(res)10
Only display the labels of key
players
nodeslt-asvector(V(G)+1)
Key players defined as have a
residual value gt25
nodes[which(abs(res)lt25)]lt-NA
Save plot as PDF
pdf(lsquoactor_plotpdfrsquopointsize=7)
plot(Glayout=lvertexlabel=nodes
vertexlabeldist=025
vertexlabelcolor=lsquoredrsquoedgewidth=1)
devoff()
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Not all edges are created equally
I have spent a lot of time tonight describing network analysis as a way tounderstand relationships
I Depending on the source and context of the data these relationships can be interpreted as many differentthings
I Must consider the data generation process by which the edge was created
I With respect to online social graph data ask how do people use this service
Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you
I Geography andhistory less important
I Networks may clusteraround communitiesof interest
Ties here are driven bypersonal contactme rarr offline rarr you
I Offline relationshipsare drivingldquofriendingrdquo
I Considerable amountof meta-data alreadyin FB what does thisadd
Google SocialGraph
Connecting all of Googlersquossocial sites togetherme rarr anything rarr you
I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Ethics and network data
Why should we be discusses ethics at a data analytics seminar
ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo
Mark Zuckerburg CEO Facebook
ldquoMoney is a terrible master but anexcellent servantrdquo
ldquoTherersquos a sucker born every minuterdquo
PT Barnum
In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways
I Social networking services are moving previously private data to the public
I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles
I Analysts are caught in the middle There is no IRB on the Internet
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their
friend structureI Problem Extremely private information questionable methods
PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when
people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization
informationI Problem Useful for raising awareness hurtful to anyone who was actually
robbed
Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study
the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to
attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Finding Key Actors with R
Plot the data
library(ggplot2)
We use ggplot2 to make things a
bit prettier
plt-ggplot(centaes(x=bety=eig
label=rownames(cent)colour=res
size=abs(res)))+xlab(Betweenness
Centrality)+ylab(Eigenvector
Centrality)
We use the residuals to color and
shape the points of our plot
making it easier to spot outliers
p+geom_text()+opts(title=Key Actor
Analysis for Hartford Drug Users)
We use the geom_text function to plot
the actorsrsquo IDrsquos rather than points
so we know who is who
Key Actor Analysis for Hartford Drug Users
Betweenness Centrality
Eig
enve
ctor
C
entr
ality
1 2 3 45
6
7 8
9
1011
12
13 1415
16
17
1819
20
21
22
23
24
2526
27
28
29303132
33
34 35
36
37 3839 40
41
42
43
44
4546
47
48
49
50
51
52
53
5455
56
57
58
59 6061 62
63
64
65
66
67
6869
7071 72 7374
75
76
77
78 798081 8283
8485
86
87
88
89
90
91
92
93
94
9596 9798 99
100
101
102103
104
105
106107108109
110
111
112113 114 115116 117118119120121
122 123124
125
126
127128129
130
131132133134135
136
137138
139
140
141
142143 144145146
147
148149
150151152153 154
155
156 157158 159 160161
162
163164
165
166167
168
169
170
171
172
173
174
175176
177
178179180
181
182
183
184
185
186187188189 190191 19219319400
02
04
06
08
10
0 1000 2000 3000 4000 5000 6000
res
minus02
0
02
04
06
abs(res)
01
02
03
04
05
06
07
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Key Actor Plot
20
284447
50
53
58
79
102
141155 Network plot
Create positions for all of
the nodes w force directed
llt-layoutfruchtermanreingold(G
niter=500)
Set the nodesrsquo size relative to
their residual value
V(G)$sizelt-abs(res)10
Only display the labels of key
players
nodeslt-asvector(V(G)+1)
Key players defined as have a
residual value gt25
nodes[which(abs(res)lt25)]lt-NA
Save plot as PDF
pdf(lsquoactor_plotpdfrsquopointsize=7)
plot(Glayout=lvertexlabel=nodes
vertexlabeldist=025
vertexlabelcolor=lsquoredrsquoedgewidth=1)
devoff()
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Not all edges are created equally
I have spent a lot of time tonight describing network analysis as a way tounderstand relationships
I Depending on the source and context of the data these relationships can be interpreted as many differentthings
I Must consider the data generation process by which the edge was created
I With respect to online social graph data ask how do people use this service
Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you
I Geography andhistory less important
I Networks may clusteraround communitiesof interest
Ties here are driven bypersonal contactme rarr offline rarr you
I Offline relationshipsare drivingldquofriendingrdquo
I Considerable amountof meta-data alreadyin FB what does thisadd
Google SocialGraph
Connecting all of Googlersquossocial sites togetherme rarr anything rarr you
I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Ethics and network data
Why should we be discusses ethics at a data analytics seminar
ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo
Mark Zuckerburg CEO Facebook
ldquoMoney is a terrible master but anexcellent servantrdquo
ldquoTherersquos a sucker born every minuterdquo
PT Barnum
In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways
I Social networking services are moving previously private data to the public
I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles
I Analysts are caught in the middle There is no IRB on the Internet
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their
friend structureI Problem Extremely private information questionable methods
PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when
people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization
informationI Problem Useful for raising awareness hurtful to anyone who was actually
robbed
Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study
the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to
attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
The study of relationshipsData representationAnalysis amp Visualization
Key Actor Plot
20
284447
50
53
58
79
102
141155 Network plot
Create positions for all of
the nodes w force directed
llt-layoutfruchtermanreingold(G
niter=500)
Set the nodesrsquo size relative to
their residual value
V(G)$sizelt-abs(res)10
Only display the labels of key
players
nodeslt-asvector(V(G)+1)
Key players defined as have a
residual value gt25
nodes[which(abs(res)lt25)]lt-NA
Save plot as PDF
pdf(lsquoactor_plotpdfrsquopointsize=7)
plot(Glayout=lvertexlabel=nodes
vertexlabeldist=025
vertexlabelcolor=lsquoredrsquoedgewidth=1)
devoff()
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Not all edges are created equally
I have spent a lot of time tonight describing network analysis as a way tounderstand relationships
I Depending on the source and context of the data these relationships can be interpreted as many differentthings
I Must consider the data generation process by which the edge was created
I With respect to online social graph data ask how do people use this service
Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you
I Geography andhistory less important
I Networks may clusteraround communitiesof interest
Ties here are driven bypersonal contactme rarr offline rarr you
I Offline relationshipsare drivingldquofriendingrdquo
I Considerable amountof meta-data alreadyin FB what does thisadd
Google SocialGraph
Connecting all of Googlersquossocial sites togetherme rarr anything rarr you
I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Ethics and network data
Why should we be discusses ethics at a data analytics seminar
ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo
Mark Zuckerburg CEO Facebook
ldquoMoney is a terrible master but anexcellent servantrdquo
ldquoTherersquos a sucker born every minuterdquo
PT Barnum
In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways
I Social networking services are moving previously private data to the public
I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles
I Analysts are caught in the middle There is no IRB on the Internet
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their
friend structureI Problem Extremely private information questionable methods
PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when
people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization
informationI Problem Useful for raising awareness hurtful to anyone who was actually
robbed
Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study
the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to
attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Not all edges are created equally
I have spent a lot of time tonight describing network analysis as a way tounderstand relationships
I Depending on the source and context of the data these relationships can be interpreted as many differentthings
I Must consider the data generation process by which the edge was created
I With respect to online social graph data ask how do people use this service
Recent study showsTwitter is used as newsaggregation serviceme rarr interests rarr you
I Geography andhistory less important
I Networks may clusteraround communitiesof interest
Ties here are driven bypersonal contactme rarr offline rarr you
I Offline relationshipsare drivingldquofriendingrdquo
I Considerable amountof meta-data alreadyin FB what does thisadd
Google SocialGraph
Connecting all of Googlersquossocial sites togetherme rarr anything rarr you
I The combining ofmultiple platformsinto a singleldquonetworkrdquo makesanalysis andinterpretationdifficult
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Ethics and network data
Why should we be discusses ethics at a data analytics seminar
ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo
Mark Zuckerburg CEO Facebook
ldquoMoney is a terrible master but anexcellent servantrdquo
ldquoTherersquos a sucker born every minuterdquo
PT Barnum
In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways
I Social networking services are moving previously private data to the public
I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles
I Analysts are caught in the middle There is no IRB on the Internet
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their
friend structureI Problem Extremely private information questionable methods
PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when
people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization
informationI Problem Useful for raising awareness hurtful to anyone who was actually
robbed
Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study
the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to
attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Ethics and network data
Why should we be discusses ethics at a data analytics seminar
ldquoPeople have really gotten comfortablenot only sharing more information anddifferent kinds but more openly andwith more people That social norm isjust something that has evolved overtimerdquo
Mark Zuckerburg CEO Facebook
ldquoMoney is a terrible master but anexcellent servantrdquo
ldquoTherersquos a sucker born every minuterdquo
PT Barnum
In isolation the data we provide online about our relationships and preferencesare fairly innocuous Their summation however can illuminate aspects ofour lives that can be exploited for any number of ways
I Social networking services are moving previously private data to the public
I Simply because data is available does not mean that individuals want orrealize that it can be used to make inferences about who they are or theirlifestyles
I Analysts are caught in the middle There is no IRB on the Internet
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their
friend structureI Problem Extremely private information questionable methods
PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when
people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization
informationI Problem Useful for raising awareness hurtful to anyone who was actually
robbed
Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study
the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to
attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Edge contextNetwork data and ethics
Examples of ethically questionable network analyses
MIT study predicts sexual orientation from Facebook friendsI Two undergraduate students start a project lsquoGaydarrsquoI Claim they can predict which men are homosexual simply based on their
friend structureI Problem Extremely private information questionable methods
PleaserobmecomI Combined information from fousquarecom and Twitter to publicize when
people were clearly not at homeI Used as a proof of concept to show the danger of publicizing localization
informationI Problem Useful for raising awareness hurtful to anyone who was actually
robbed
Project Grey GooseI Large group of former and current DoDIC analysts collaborated to study
the identity of hackersI I was personally involved in this projectI Using public web forum data built up user profiles and social networks to
attempt to identify hackers affiliated with Russian governmentI Problem While no names were published bordering on vigilantism
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Where to get social graph data
Recently there has been an explosion of resources for scraping social graph
Service Data API Docs
Following(ers) -replies datetimegeo httpapiwikitwittercom
Friends Wall Posts datetime httpdevelopersfacebookcomdocsapi
All SocialGraph relationships httpcodegooglecomapissocialgraph
Friends Check-ins httpfoursquarecomdevelopers
ldquoTaste graphrdquo recommendations httphunchcomdevelopers
Congressional votes campaign finance httpdevelopernytimescomdocs
There is clearly no shortage of data
I Each service provides different relational context
I Data formats are generally JSON Atom XML or some combination
I For a more extensive list of API resources see HackNY wiki of localstartups
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Building the social network among LiveJournal users
Using a ldquoseedrdquo user we will build out anetwork
I In Python use NetworkX cjsonand a other standard scientificlibraries parse the SocialGraphdata
I Through a process calledldquok-snowball searchingrdquoseed rarr friend rarr middot middot middot rarr friendk
I Seed imichaeldotorglivejournalcomI k = 3
I Note the low value of k
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The code part 1
Loading the libraries and setting things up
from cjson import
from urllib import
from networkx import
from time import
from scipy import arrayunique
if __name__ == __main__
seed_url=lsquolsquohttpimichaeldotorglivejournalcom
sg=get_sg(seed_url)
netnewnodes=create_egonet(sg)
info(net)
Name [lsquohttpimichaeldotorglivejournalcomrsquo]
Type DiGraph
Number of nodes 5
Number of edges 5
Average in degree 10
Average out degree 10
Get the JSON from SocialGraph
def get_sg(seed_url)
sgapi_url=httpsocialgraphapisgooglecomlookupq=+seed_url+ampedo=1ampedi=1ampfme=1amppretty=0
try
furl=urlopen(sgapi_url)
fr=furlread()
furlclose()
return fr
except IOError
print Could not connect to website
print sgapi_url
return
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build egonet and snowball
Creating the egonet
def create_egonet(s)
try
raw=decode(s)
G=DiGraph()
pendants=[]
n=raw[rsquonodesrsquo]
nk=nkeys()
Gname=str(nk)
pendants=[]
for a in range(0len(nk))
for b in range(0len(nk))
if a=b
Gadd_edge(nk[a]nk[b])
for k in nk
ego=n[k]
ego_out=ego[rsquonodes_referencedrsquo]
for o in ego_out
Gadd_edge(ko)
pendantsappend(o)
ego_in=ego[rsquonodes_referenced_byrsquo]
for i in ego_in
Gadd_edge(ik)
pendantsappend(i)
pendants=array(pendantsdtype=str)
pendantsflatten()
pendants=unique(pendants)
return Gpendants
except DecodeError
except KeyError
Rolling the snowball
def snowball_round(Gseedsmyspace=False)
t0=time()
if myspace
seeds=get_myspace_url(seeds)
sb_data=[]
for s in range(0len(seeds))
s_sg=get_sg(seeds[s])
new_egopen=create_egonet(s_sg)
for p in pen
sb_dataappend(p)
if slt1
sb_net=compose(Gnew_ego)
else
sb_net=compose(new_egosb_net)
del new_ego
if s==round(len(seeds)02)
sb_netname=rsquo20 completersquo
sb_netinfo()
print rsquoAT rsquo+strftime(rsquomdY HMSrsquo gmtime())
print rsquorsquo
More time keeping probably a MUCH better way to do this
sb_data=array(sb_data)
sb_dataflatten()
sb_data=unique(sb_data)
sb_netinfo()
return sb_netsb_data
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
Build the whole network
Step Nodes Edges Mean Degree Density
Seed 5 5 20 025k = 2 75 115 30 002
k = 3 4938 8659 35 36(10minus4)
I Our seed is abnormally isolated with only fourneighbors
I Large jump after first snowball
I Massive structural leap at k = 3
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Getting dataFirst scrape and build
The full network
To get a feeling for the size of the full network
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data
Network basicsResearch design
Digging into dataLive demonstration
Live demo
Live demonstration time
Drew Conway Mining and Analyzing Online Social Graph Data