Harnessing the Social Web: The Science of Identity Disambiguation

Harnessing the Social Web: The Science of Identity Disambiguation


Matthew Rowe and Fabio CiravegnaOrganisations, Information and Knowledge Group

University of Sheffield, UK

Web Science 2010


Outline

• Problem– Dissemination of personal information across the Web

• Motivation– Need for automation

• Harnessing the Social Web• Identity Disambiguation

– Inference Rules– Self-training

• Evaluation– Results

• Conclusions


Problem

• Large amount of information now residing on the World Wide Web is personal information– Disseminated voluntarily: homepages, profiles pages– Or involuntarily: telephone directories, electoral registers

• Sensitive nature of this information has lead to:– Identity Theft: act of stealing a person’s identity and reusing it

• Currently costs UK economy £1.2 billionhttp://www.identitytheft.org.uk/faqs.asp

– Lateral Surveillance: act of watching someone without their knowledge• Often performed by employers vetting potential employees• And by socialities vetting prospective dates• Could affect reputation if detrimental content exists


Motivation

• To avoid such practices, web users must manually collect web resources which may cite them and then decide which do– The latter stage of this process is referred to as disambiguation

• Decides which web resources are references and produces a unary set of identity web references for a given person

– However, this practice is • Time consuming • Expensive • Must be repeated often as the more and more data is published on the Web

• Automated disambiguation techniques can replace this manual processing– To function effectively however, seed data (background knowledge about a

person) is required:• Expensive to produce (e.g. filling in an extensive form)• Must contain sufficient features describing a person’s identity


Harnessing the Social Web

• Overcome the problem of producing seed data manually by harnessing the Social Web– Social Web platforms such as Facebook, Twitter and MySpace allow web users to build an online

persona/identity visible to others• Sociological studies have argued of the similarity between online and offline identities

– (Hart et al, 2008) states that online social networks are merely extensions of offline lives– (Ellison et al, 2007) states that Social Web platforms are used to reinforce established offline relationships

• A user study was conducted to assess the relationship between digital identities constructed on Social Web platforms and their real world equivalent using 50 participants from the University of Sheffield (25 male, 25 female) with a wide age range (18 – 45)

• Study consisted of three stages1. Participants listed their real world social network2. Digital social network was extracted from Facebook for each participant3. Digital and real world networks were compared

• Relevance: proportion of digital social network containing strong-tied relationships• Coverage: extent of to which the real world network is replicated online

• Results from the user study show– Coverage range of 0.5 to 1 with an average of 0.77

• Indicating that, on average, 77% of a person’s real world social network is replicated online– Average relevance of 0.23

• Indicating that, on average, 23% of a person’s digital social network contains strong tied relationships


Collecting Seed Data from the Social Web


Collecting Seed Data from the Social Web<http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me>

rdf:type foaf:Person ;foaf:name "Matthew Rowe" ;foaf:homepage <www.dcs.shef.ac.uk/~mrowe> ;foaf:mbox <[email protected]> ;foaf:based_near <http://www.geonames.org/2638077> ;foaf:knows <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#fabio> ;foaf:knows <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#sam> .

<http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#fabio>rdf:type foaf:Person ;foaf:name "Fabio Ciravegna";foaf:mbox <[email protected]>;foaf:homepage <http://www.dcs.shef.ac.uk/~fabio> .

<http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#sam>rdf:type foaf:Person ;foaf:name "Sam Chapman" ;foaf:mbox <[email protected]> ;foaf:homepage <http://www.dcs.shef.ac.uk/~sam> .

<http://www.geonames.org/2638077>rdf:type geo:Feature ;geo:name “Sheffield” ;geo:inCountry “UK” .


Identity Disambiguation: Inference Rules

• Rules provides a means to logically infer conclusions based on the presence of information

• In the context of identity disambiguation, rules replicate the cognitive process by which a human decides if a web resource refers to a given entity– Using background knowledge known about the entity– Uses a supervised approach by only using the provided seed data to make

decisions• Rules are built from the seed data as follows:

– RDF instances are extracted from the seed data (e.g. an instance of a person or location)

– A rule is constructed from the information in each instance description– Rules are then added to the rule base which are then applied to a collected set

of web resources to disambiguate identity web references– If the triple pattern in the antecedent (the if part) of the rule matches the

knowledge structure of a web resource then a web reference is inferred


Identity Disambiguation: Inference Rules


Identity Disambiguation: Self-training

• Self-training provides a semi-supervised approach to disambiguation:1. Seed data collected from the Social Web provides the positive training data2. Possible web citations provide the unlabelled data3. Negative training data is generated using Rocchio classification over the unlabelled data4. Positive and negative training data is then used to train an initial classifier5. Classifier is applied to the unlabelled data and labels each example6. Training sets (positive and negative) are enlarged with the examples from the unlabelled data which

exhibit the strongest classification confidences7. Examples are removed from the unlabelled data, reducing its size8. Steps 4-7 are repeated until all unlabelled data has been classified

• Tested 3 different machine learning classifiers: Perceptron, Support Vector Machines and Naïve Bayes• RDF models (for both the seed data and the web resources) are converted into machine learning

instances– RDF instances from the models are used as features for the machine learning instances– This permits the variation of distinct feature similarity measures between 3 different RDF graph matching

techniques: • RDF Entailment: does one graph subsume that of another?• Inverse Functional Property Matching: do property values match in distinct graphs where the property is inverse

functional?• Jaccard similarity (strict graph equivalence): are the graphs identical?


Identity Disambiguation: Self-training

• Intuition is that as the classifier learns from unlabelled data it will learn from previously unknown features– Seed data only covers a portion of a person’s identity– Will lead to the detection of more web references

• This is similar to the cognitive process by which humans identify web citations – Only a portion of background knowledge is known at the start– As more web references are found, the knowledge of the person is expanded


Evaluation

• Dataset– 50 members of the Semantic Web and Web 2.0 communities– Collected seed data from Facebook and Twitter– Collected possible web citations from searching WWW and the Semantic Web for

each participant• Converted each returned resource into an RDF model representation• ~346 web resources to be analysed for each participant

• Evaluation Measures– Information retrieval metrics: precision, recall and f-measure– Web presence level: proportion of web resources that refer to each participant (e.g.

50 of 350 web resources refer to a given person, then web presence is 14%)• Baseline Measure: Human Processing

– Group of 12 raters manually processed a portion of the dataset for each participant– 3 raters performed disambiguation for each participant, then used interrater

agreement (Hripcsak & Rothschild, 2005) to calculate IR metrics


Evaluation: Inference Rules• Yields high levels of precision, but

poor recall scores– Specific nature of rules leads to

poor application to new instances

• Consistently outperforms humans in terms of precision for all web presence levels

• At low levels of web presence, where web references are sparse, humans perform poorly– This is characterised by a

“Needle in a Haystack” problem– Inhibited by the lack web

references to learn fromPrecision Recall F-Measure

Rules 0.955 0.436 0.553

Humans 0.765 0.725 0.719


Evaluation: Self-training• Perceptron and SVM are combined with Entailment outperform humans

– Due to the large levels of recall achieved by these permutations– Entailment leads to a reduction in overfitting to the training data

• Precision is lowered, but recall is improved significantly– Performance also remains consistent for all web presence levels

• Jaccard uses strict matching between RDF instances, leading to high precision levels– Poor recall levels due to overfitting to training data – unable to generalise to new instances

Precision Recall F-Measure

Perceptron Entailment 0.629 0.905 0.728

IFP 0.630 0.878 0.715

Jaccard 0.651 0.820 0.700

SVM Entailment 0.613 0.910 0.731

IFP 0.628 0.864 0.711

Jaccard 0.755 0.695 0.691

Naïve Bayes Entailment 0.629 0.629 0.628

IFP 0.649 0.652 0.649

Jaccard 0.713 0.619 0.633

Humans 0.765 0.725 0.719


Conclusions

• Social Web platforms provide a useful source for identity information– Significant similarity between real world and digital social networks– This can in turn be used to support automated disambiguation techniques

• Inference Rules, using a supervised strategy, yields high precision levels yet fails to detect a large portion of identity web references

• Self-training overcomes the limitations of supervised techniques by learning from disambiguation decisions– High recall levels demonstrate the effectiveness of such methods to

detect a large portion of web references• Future work will look to combine these two methods together

– Enlarging the positive training data using Inference Rules – given their high precision levels

– Then applying Self-training to increase recall levels


Questions?

Twitter: @mattroweshowWeb: http://www.dcs.shef.ac.uk/~mrowe

Email: [email protected]

(Hart et al, 2008) - J. Hart, C. Ridley, F. Taher, C. Sas, and A. Dix. Exploring the facebook experience: a new approach to usability. In NordiCHI ’08: Proceedings of the 5th Nordic conference on Human-computer interaction, pages 471–474, New York, NY, USA, 2008. ACM(Ellison et al, 2007) - N. B. Ellison, C. Steinfield, and C. Lampe. The benefits of facebook friends: Social capital and college students’ use of online social network sites. Journal of Computer Mediated Communication, 12:1143–1168, 2007.(Hripcsak & Rothschild, 2005) – G. Hripcsak and A. S. Rothschild. Agreement, the f-measure, and reliability in information retrieval. Journal of American Medical Informatics Association, 12(3):296–298, 2005.

Harnessing the Social Web: The Science of Identity Disambiguation

Education

digital social

collecting

social web

seed data

social web

web presence

web references

unlabelled