Classifying Online Social Network Users Through the Social Graph Cristina P´ erez Sol` a and Jordi Herrera Joancomart´ ı Departament d’Enginyeria de la Informaci´o i les Comunicacions Universitat Aut`onoma de Barcelona October 25th, 2012
Classifying Online Social Network Users Throughthe Social Graph
Cristina Pérez Solà and Jordi Herrera Joancomart́ı
Departament d’Enginyeria de la Informació i les ComunicacionsUniversitat Autònoma de Barcelona
October 25th, 2012
Introduction Classifier proposal The experiments Conclusions and further work
1 Introduction
2 Classifier proposal
3 The experiments
4 Conclusions and further work
2 / 23
Introduction Classifier proposal The experiments Conclusions and further work
About the title
Classifying...
Definition
Classification is the problem of identifying to which of a set of categories anew observation belongs. The decision is made on the basis of a training setof data containing observations whose category membership is already known.
3 / 23
Introduction Classifier proposal The experiments Conclusions and further work
About the title
... Online Social Network Users...
4 / 23
Introduction Classifier proposal The experiments Conclusions and further work
About the title
...Through the Social Graph
Definition
A social graph is a graph where nodes represent users in a socialnetwork and edges represent relationships between these users.
5 / 23
Introduction Classifier proposal The experiments Conclusions and further work
What do we want to do?
Goals
Design a user (node) classifier that uses the graph structurealone (no semantic information is needed).
Apply the previously designed classifier to label OSN users.
Demonstrate that OSN user classification is possible withnaively anonymized graphs.
6 / 23
Introduction Classifier proposal The experiments Conclusions and further work
Why is it interesting?
Motivation
User classification as a privacy attack
User classification allows an attacker to infer (private) attributesfrom the user.
Attributes may be sensitive by themselves.
Attribute disclosure may have undesirable consecuences forthe user.
In any case, the user is not able to control the disclosure of theinformation about himself anymore...
7 / 23
Introduction Classifier proposal The experiments Conclusions and further work
1 Introduction
2 Classifier proposalArchitecture overviewClassifier modulesSpecific design details
3 The experiments
4 Conclusions and further work
8 / 23
Introduction Classifier proposal The experiments Conclusions and further work
Architecture overview
Classifier Architecture
The proposed classifier is implemented with a 5 modulearchitecture, which includes two different classifiers: an initialclassifier and a relational classifier.
Initial
classifierRelational
classifier
Data
preprocessingData
preprocessingClus. coeff.
&
degrees
Class
labels New class
labels
Neighborhood
analysis
9 / 23
Introduction Classifier proposal The experiments Conclusions and further work
Classifier modules
Initial classifier
The initial classifier analyzes the graph structure and maps eachnode to a 2-dimensional sample: degree & clustering coefficient.The output is an initial assignation of nodes to categories.
10 / 23
Introduction Classifier proposal The experiments Conclusions and further work
Classifier modules
Neighborhood analysis
The neighborhood analysis module reports to which kind of nodesis every node connected, using the labels assigned by the initialclassifier.
11 / 23
Introduction Classifier proposal The experiments Conclusions and further work
Classifier modules
Relational classifier
The relational classifier maps users to n-dimensional samples, usingboth degree & clustering coefficient and the neighborhoodinformation to classify users. The output is a new assignation ofnodes to categories, which can differ from the initial classification.
12 / 23
Introduction Classifier proposal The experiments Conclusions and further work
Specific design details
Some details about the classifier
The graph is directed, so we distinguish between indegree andoutdegree (instead of having just degree).
This distinction increases by 2 the number of dimensions inthe neighborhood analysis.
We can have as many categories as we want: we just have toadd more dimensions!
Classifiers are instantiated with Support Vector Machines withsoft margins.
The relational classifier is applied iteratively.
13 / 23
Introduction Classifier proposal The experiments Conclusions and further work
1 Introduction
2 Classifier proposal
3 The experimentsExperiment designExperiment results
4 Conclusions and further work
14 / 23
Introduction Classifier proposal The experiments Conclusions and further work
Experiment design
The main goal
Research question
Is an attacker able to recover attributes from OSN users knowingjust the social graph structure and the attributes of a small subsetof the nodes in the graph?
We are facing a within network classification problem, where nodesfor which the labels are unknown are linked to nodes for which thelabel is known.
15 / 23
Introduction Classifier proposal The experiments Conclusions and further work
Experiment design
Data used in the experiments
We collected data from 936.423 Twitter users, which were allthe neighbors of a subset of 300 nodes.
We constructed two disjoint graphs G1 = (V1,E1) andG2 = (V2,E2) with users and their relationships.
We labeled the nodes of the graphs to obtain the ground oftruth:
Binary classification: individual or company.Multiclass classification: normal user, blogger, celebrity, mediaand organization.
16 / 23
Introduction Classifier proposal The experiments Conclusions and further work
Experiment design
An experiment
Each of the experiments consisted on:
Randomly selecting a subset of nodes (Vtrain) to be used astraining samples: 65%, 50%, 35% and 20% of nodes.
Training the classifiers with those samples.
Classifying the rest of the nodes (Vtest = V r Vtrain).
Evaluating the overall performance using the ground of truth.
We performed 100 experiments for each of the training set sizesand for both classification problems.
17 / 23
Introduction Classifier proposal The experiments Conclusions and further work
Experiment results
Binary Classification Results
0 1 2 3 4 5 6 7 8 9 10
0.5
0.55
0.6
0.65
0.7
0.75
Iteration
Corr
ect ra
teCorrect rates
D1−65% train
D1−50% train
D1−35% train
D1−20% train
D2−65% train
D2−50% train
D2−35% train
D2−20% train
18 / 23
Introduction Classifier proposal The experiments Conclusions and further work
Experiment results
Multiclass Classification Results
0 1 2 3 4 5 6 7 8 9 10
0.3
0.35
0.4
0.45
0.5
0.55
0.6
Iteration
Cor
rect
rat
eCorrect rates
Cata − 65% train
Cata − 50% train
Cata − 35% train
Cata − 20% train
19 / 23
Introduction Classifier proposal The experiments Conclusions and further work
1 Introduction
2 Classifier proposal
3 The experiments
4 Conclusions and further work
20 / 23
Introduction Classifier proposal The experiments Conclusions and further work
Conclusions
Conclusions
Information found in the social graph is enough to performclassification.
It is possible to classify OSN users using a naively anonymizedcopy of a social graph.
Naive anonymization does not protect OSN users fromattribute disclosure.
Success rate varies depening on the training set sizes.
21 / 23
Introduction Classifier proposal The experiments Conclusions and further work
Further work
Further work
Integrate both structural and semantic information to improveclassification.
Study the impact of different graph anonymization techniques(other than the naive anonymization) on the classification.
Analyze the performance of other classification techniques forrelational data.
22 / 23
Classifying Online Social Network Users Throughthe Social Graph
Cristina Pérez Solà and Jordi Herrera Joancomart́ı
Departament d’Enginyeria de la Informació i les ComunicacionsUniversitat Autònoma de Barcelona
October 25th, 2012
Linear SVM
24 / 23
Non linear SVM
25 / 23
Introduction
Classifier proposalArchitecture overviewClassifier modulesSpecific design details
The experimentsExperiment designExperiment results
Conclusions and further work
Appendix