Top Banner
Challenges of Computational Verification in Social Media Christina Boididou 1 , Symeon Papadopoulos 1 , Yiannis Kompatsiaris 1 , Steve Schifferes 2 , Nic Newman 2 1 Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI) 2 City University London – Journalism Department WWW’14, April 8, Seoul, Korea
24

Computational Verification Challenges in Social Media

Aug 20, 2015

Download

Technology

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Computational Verification Challenges in Social Media

Challenges of Computational Verification in Social MediaChristina Boididou1, Symeon Papadopoulos1, Yiannis Kompatsiaris1, Steve Schifferes2, Nic Newman2

1Centre for Research and Technology Hellas (CERTH) – Information Technologies Institute (ITI)

2City University London – Journalism Department

WWW’14, April 8, Seoul, Korea

Page 2: Computational Verification Challenges in Social Media

How trustworthy is Web multimedia?

#2

Real photocaptured April 2011 by WSJbutheavily tweeted during Hurricane Sandy(29 Oct 2012)

Tweeted by multiple sources & retweeted multiple times

Original online at:http://blogs.wsj.com/metropolis/2011/04/28/weather-journal-clouds-gathered-but-no-tornado-damage/

Page 3: Computational Verification Challenges in Social Media

Disseminating (real?) content on Twitter

• Twitter is the platform for sharing newsworthy content in real-time.

• Pressure for airing stories very quickly leaves very little room for verification.

• Very often, even well-reputed news providers fall for fake news content.

• Here, we examine the feasibility and challenges of conducting verification of shared media content with the help of a machine learning framework.

#3

Page 4: Computational Verification Challenges in Social Media

Related: Web & OSN Spam

• Web spam is a relatively old problem, wherein the spammer tries to “trick” search engines into thinking that a webpage is high-quality, while it’s not (Gyongyi & Garcia-Molina, 2005).

• Spam revived in the age of social media. For instance, spammers try to promote irrelevant links using popular hashtags (Benevenuto et al., 2010; Stringhini et al., 2010).

Mainly focused on characterizing/detecting sources of spam (websites, twitter accounts) rather than spam content.

#4

Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First international workshop on adversarial information retrieval on the web (AIRWeb), 2005F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida. Detecting spammers on twitter. In Collaboration, Electronic messaging, Anti-abuse and Spam conference (CEAS), volume 6, 2010G. Stringhini, C. Kruegel, and G. Vigna. Detecting spammers on social networks. In Proceedings of the 26th Annual Computer Security Applications Conference, pages 1–9. ACM, 2010.

Page 5: Computational Verification Challenges in Social Media

Related: Diffusion of Spam

• In many cases, the propagation patterns between real and fake content are different, e.g. in the case of the large Chile earthquakes (Mendoza et al., 2010)

• Using a few nodes of the network as “monitors”, one could try to identify sources of fake rumours (Seo and Mohapatra, 2012).

Still, such methods are very hard to use in real-time settings or very soon after an event starts.

#5

M. Mendoza, B. Poblete, and C. Castillo. Twitter under crisis: Can we trust what we rt? In Proceedings of the first Workshop on Social Media Analytics, pages 71–79. ACM, 2010E. Seo, P. Mohapatra, and T. Abdelzaher. Identifying rumors and their sources in social networks. In SPIE Defense, Security, and Sensing, 2012

Page 6: Computational Verification Challenges in Social Media

Related: Assessing Content Credibility

• Four types of features are considered: message, user, topic and propagation (Castillo et al., 2011).

• Classify tweets with images as fake or not using a machine learning approach (Gupta et al., 2013) Reports an accuracy of ~97%, which is a gross over-estimation of expected real-world accuracy.

#6

C. Castillo, M. Mendoza, and B. Poblete. Information credibility on twitter. In Proceedings of the 20th international conference on World Wide Web, pages 675–684. ACM, 2011.A. Gupta, H. Lamba, P. Kumaraguru, and A. Joshi. Faking sandy: characterizing and identifying fake images on twitter during hurricane sandy. In Proceedings of the 22nd international conference on World Wide Web companion, pages 729–736, 2013

Page 7: Computational Verification Challenges in Social Media

Goals/Contributions

• Distinguish between fake and real content shared on Twitter using a supervised approach

• Provide closer to reality estimates of automatic verification performance

• Explore methodological issues with respect to evaluating classifier performance

• Create reusable resources– Fake (and real) tweets (incl. images) corpus– Open-source implementation

#7

Page 8: Computational Verification Challenges in Social Media

Methodology

• Corpus Creation– Topsy API– Near-duplicate image detection

• Feature Extraction– Content-based features– User-based features

• Classifier Building & Evaluation– Cross-validation– Independent photo sets– Cross-dataset training

#8

Page 9: Computational Verification Challenges in Social Media

Corpus Creation

• Define a set of keywords K around an event of interest.• Use Topsy API (keyword-based search) and keep only

tweets containing images T.• Using independent online sources, define a set of fake

images IF and a set of real ones IR.• Select TC ⊂ T of tweets that contain any of the images in

IF or IR.• Use near-duplicate visual search (VLAD+SURF) to extend

TC with tweets that contain near-duplicate images.• Manually check that the returned near-duplicates indeed

correspond to the images of IF or IR.

#9

Page 10: Computational Verification Challenges in Social Media

Features

#10

# User Feature

1 Username

2 Number of friends

3 Number of followers

4 Number of followers/number of friends

5 Number of times the user was listed

6 If the user’s status contains URL

7 If the user is verified or not

# Content Feature

1 Length of the tweet

2 Number of words

3 Number of exclamation marks

4 Number of quotation marks

5 Contains emoticon (happy/sad)

6 Number of uppercase characters

7 Number of hashtags

8 Number of mentions

9 Number of pronouns

10 Number of URLs

11 Number of sentiment words

12 Number of retweets

Page 11: Computational Verification Challenges in Social Media

Training and Testing the Classifier

• Care should be taken to make sure that no knowledge from the training set enters the test set.

• This is NOT the case when using standard cross-validation.

#11

Page 12: Computational Verification Challenges in Social Media

The Problem with Cross-Validation

#12

Training/Test tweets are randomly selected.

One of the reference fake images Multiple tweets per reference image.

Page 13: Computational Verification Challenges in Social Media

Independence of Training-Test Set

#13

Training/Test tweets are constraint to correspond to different reference images.

Page 14: Computational Verification Challenges in Social Media

Cross-dataset Training-Testing

• In the most unfavourable case, the dataset used for training should refer to a different event than the one used for testing.

• Simulates real-world scenario of a breaking story, where no prior information is available to news professionals.

• Variants: – Different event, same domain– Different event, different domain (very challenging!)

#14

Page 15: Computational Verification Challenges in Social Media

Evaluation

• Datasets– Hurricane Sandy– Boston Marathon bombings

• Evaluation of two sets of features (content/user)

• Evaluation of different classifier settings

#15

Page 16: Computational Verification Challenges in Social Media

Dataset – Hurricane Sandy

#16

Natural disaster held around the USA from October 22nd to 31st, 2012. Fake images and content, such as sharks inside New York and flooded Statue of Liberty, went viral.

HashtagsHurricane Sandy #hurricaneSandy

Hurricane #hurricane

Sandy #Sandy

Page 17: Computational Verification Challenges in Social Media

Dataset – Boston Marathon Bombings

#17

The bombings occurred on 15 April, 2013 during the Boston Marathon when two pressure cooker bombs exploded at 2:49 pm EDT, killing three people and injuring an estimated 264 others.

Hashtags

Boston Marathon #bostonMarathon

Boston bombings #bostonbombings

Boston suspect #bostonSuspect

manhunt #manhunt

watertown #watertown

Tsarnaev #Tsarnaev

4chan #4chan

Sunil Tripathi #prayForBoston

Page 18: Computational Verification Challenges in Social Media

Dataset Statistics

#18

Tweets with other image URLs 343939Tweets with fake images 10758Tweets with real images 3540

Hurricane Sandy Boston Marathon

Tweets with other image URLs 112449

Tweets with fake images 281

Tweets with real images 460

Page 19: Computational Verification Challenges in Social Media

Prediction accuracy (1)

#19

• 10-fold cross validation results using different classifiers

~80%

Page 20: Computational Verification Challenges in Social Media

Prediction accuracy (2)• Results using different training and testing set from the

Hurricane Sandy dataset

#20

• Results using Hurricane Sandy for training and Boston Marathon for testing

~75%

~58%

Page 21: Computational Verification Challenges in Social Media

Sample Results

#21

• Real tweet

My friend's sister's Trampolene in Long Island. #HurricaneSandy

Classified as real

• Real tweet 23rd street repost from @wendybarton #hurricanesandy #nyc

Classified as fake

• Fake tweet Sharks in people's front yard #hurricane #sandy #bringing #sharks #newyork #crazy http://t.co/PVewUIE1

Classified as fake

• Fake tweet Statue of Liberty + crushing waves. http://t.co/7F93HuHV #hurricaneparty #sandy

Classified as real

Page 22: Computational Verification Challenges in Social Media

Conclusion

• Challenges– Data Collection: (a) Fake content is often removed (either by

user or by OSN admin), (b) API limitations make very difficult the collection after an event takes place

– Classifier accuracy: Purely content-based classification can only be of limited use, especially when used in a context of a different event. However, one could imagine that separate classifiers might be built for certain types of incidents, cf. AIDR use for the recent Chile Earthquake

• Future Work– Extend features: (a) geographic location of user (wrt.

location of incident), (b) time the tweet was posted– Extend dataset: More events, more fake examples

#22

Page 23: Computational Verification Challenges in Social Media

Thank you!

• Resources:Slides: http://www.slideshare.net/sympapadopoulos/computational-verification-challenges-in-social-mediaCode: https://github.com/socialsensor/computational-verificationDataset: https://github.com/MKLab-ITI/image-verification-corpus

Help us make it bigger!

• Get in touch:@sympapadopoulos / [email protected]@CMpoi / [email protected]

#23

Page 24: Computational Verification Challenges in Social Media

#24

Sample fake and real images in Sandy• Fake pictures shared on social media

• Real pictures shared on social media