Top Banner
Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng Department of Computer Science and Engineering HKUST Hong Kong DASFAA’12
41

Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

Jan 27, 2015

Download

Technology

Jan Vosecky

How to automatically distinguish between high-quality and low-quality content in Twitter?

Twitter is a rapidly growing microblogging platform, which provides a large amount, diversity and varying quality of content. In order to provide higher quality content (e.g. posts mentioning news, events, useful facts or well-formed opinions) when a user searches for tweets on Twitter, we propose a new method to filter and rank tweets according to their quality. In order to model the quality of tweets, we devise a new set of link-based features, in addition to content-based features. We examine the implicit links between tweets, URLs, hashtags and users, and then propose novel metrics to reflect the popularity as well as quality-based reputation of websites, hashtags and users. We then evaluate both the content-based and link-based features in terms of classification effectiveness and identify an optimal feature subset that achieves the best classification accuracy.

Presentation given at the DASFAA 2011 conference (15-18 April 2012, Busan, South Korea).
Authors: Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng
Full paper: http://www.cse.ust.hk/~wilfred/paper/dasfaa12.pdf
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

Searching for Quality Microblog Posts:Filtering and Ranking based on Content Analysis and Implicit Links

Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng

Department of Computer Science and Engineering

HKUST

Hong Kong

DASFAA’12

Page 2: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

2

Agenda

Introduction Proposed method Quality features of tweets Experiments Conclusions

Introduction Method Features Experiments Conclusions

Page 3: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

3 Introduction

Introduction Method Features Experiments Conclusions

Page 4: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

4

Microblogs

Both social network and social media Links between users (follow, mention, re-

tweet) Users post updates (tweets)

Tweet 1

Tweet 2

usertimestamp

URL link

hashtag

mentioned user

Introduction Method Features Experiments Conclusions

Page 5: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

5

Searching for “ipad” on Twitter

Around 50 tweets mentioning “iPad” posted within a1-minute period

Introduction Method Features Experiments Conclusions

Page 6: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

6

Research challenge

Twitter: user-generated content Short messages, often comments or opinions High volume Varying quality

“Most tweets are not of general interest (57%)” (Alonso et al.’10)

Information overload Research questions:

How to distinguish content worth reading from useless or less important messages?

How to promote ‘high quality’ content?

Introduction Method Features Experiments Conclusions

Page 7: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

7

Defining ‘quality’

General (global) definition for assessing tweet quality

3 criteria: Well-formedness

+ Well-written, grammatically correct, understandable

- Heavy slang, misspellings, excessive punctuation Factuality

+ News, events, announcements- Unclear message, private conversations, generic

personal feelings Navigational quality (URL links)

+ Reputable external resources (e.g. news articles)- Personal information sharing (e.g. photo sharing

websites)

Introduction Method Features Experiments Conclusions

Page 8: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

8

Quality-based tweet filtering

+--+-

Introduction Method Features Experiments Conclusions

Page 9: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

9

Quality-based tweet ranking

5

4

3

1

1

Introduction Method Features Experiments Conclusions

Page 10: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

10

Research goals

Quality-based tweet filtering Filtering out low-quality tweets

In twitter feeds In search results

Quality-based tweet ranking Re-ranking Twitter search results

For a given time period

Introduction Method Features Experiments Conclusions

Page 11: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

11 Proposed Method

Introduction Method Features Experiments Conclusions

Page 12: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

12

Representation of tweets

Vector-space model: not sufficient Short tweet length, terms often malformed Ignores special features in Twitter

Feature-vector representation Extract features from tweet Traditional features: e.g. length, spelling Twitter-specific features:

Exploiting hashtags, URL links, mentioned usernames

Introduction Method Features Experiments Conclusions

Page 13: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

13 Quality Features of Tweets

Introduction Method Features Experiments Conclusions

Page 14: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

14

Feature categoriesIntroduction Method Features Experiments Conclusions

1. Punctuation and Spelling

2. Syntactic and semantic complexity

Number of exclamation marksNumber of question marksMax. no. of repeated letters% of correctly spelled words No. of capitalized wordsMax. no. of consecutive capitalized words

Max. & Avg. word lengthLength of tweetPercentage of stopwordsContains numbersContains a measureContains emoticonsUniqueness score

3. Grammaticality 4. Link-basedHas first-person part-of-speechFormality scoreNumber of proper namesMax. no. of consecutiveproper namesNumber of named entities

Contains linkIs reply-tweetIs re-tweetNo. of mentions of usersNumber of hashtagsURL domain reputation scoreRT source reputation scoreHashtag reputation score

5. TimestampDay of the week of posting Hour of the day of posting

Page 15: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

15

1. Punctuation and spelling

Excessive punctuation Number of exclamation marks Number of question marks Max. number of consecutive dots

Capitalization Presence of all-capitalized words Largest number of consecutive words in capital letters

Spellchecking Number of correctly spelled words Percentage of words found in a dictionary

RT @_ChocolateCoco: WHO IS CHUCK NORRIS??!!!?? lls. He's only the greatest guy next to jesus lmao

Introduction Method Features Experiments Conclusions

Page 16: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

16

2. Syntactic and semantic complexity

Syntactic complexity Tweet length Max. & avg. word length Percentage of stopwords Presence of emoticons and other sentiment indicators Presence of measure symbols ($, %) Numbers – number of digits

Tweet uniqueness Uniqueness of the tweet relative to other tweets by the

author

Introduction Method Features Experiments Conclusions

where

Page 17: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

17

3. Grammaticality

Parts-of-speech labelling Presence of first person parts-of-speech Formality score [Heylighen’02]

F = (noun frequency + adjective freq. + preposition freq.+ article freq. − pronoun freq. − verb freq. − adverb freq. − interjection freq. + 100)/2

Names Number of ‘proper names’ as words with a single initial

capital letter Number of consecutive ‘proper names’ Number of Named entities

F. Heylighen and J.-M. Dewaele. Variation in the contextuality of language: An empirical measure. Context in Context. Special issue Foundations of Science, 7(3):293–340, 2002.

Introduction Method Features Experiments Conclusions

Page 18: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

18

4. Link-based features

Links to other items Re-tweet (RT), reply tweet, mention of other

users Presence of a URL link Number of hashtags as indicated by the “#” sign

Link target’s quality reputation metrics to reflect the quality of tweets which

relate to a URL domain Hashtag a user

Introduction Method Features Experiments Conclusions

Page 19: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

19

URL domain reputation

Observation: Tweets which link to news articles usually better quality

than tweets which link to photo sharing websites

Questions: What does the quality of tweets linking to a website say

about its quality? Can we predict quality of future tweets linking to that

website?

Tweetpic.com

Tweet 1

Tweet 2Tweet 3

Q = 1

Q = 3Q = 2

NYtimes.com

Tweet 4

Tweet 5Tweet 6

Q = 5

Q = 4Q = 5

Introduction Method Features Experiments Conclusions

Page 20: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

20

URL domain reputation

Step 1: URL translationShort link to original link

bit.ly/e2jt9F http://www.reuters.com/4151120

Step 2: summarize tweets linking to a URL domain Accumulate “quality reputation” over time

Introduction Method Features Experiments Conclusions

Page 21: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

21

URL domain reputation

Average URL domain quality

Td = set of tweets linking to domain d

qt = quality label of tweet t

Weakness: Does not reflect the number of inlink tweets in the

score Favours domains with few inlink tweets

Introduction Method Features Experiments Conclusions

Page 22: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

22

URL domain reputation

Domain reputation score

where AvgQ(d) is between [-1, +1]

“Collecting evidence” behaviour: Score getting higher with more good quality inlink

tweets

1 10 100 1000

-4.00-3.00-2.00-1.000.001.002.003.004.00

-1-0,500,51

DRS

|Td|

AvgQ

Introduction Method Features Experiments Conclusions

Page 23: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

23

URL domain reputation

Domain AvgQ Inlinks RSgallup.com 0,96 99 1,92mashable.com 0,79 97 1,58hrw.org 0,86 57 1,51foxnews.com 0,68 38 1,08good.is 0,68 31 1,01intuit.com 0,57 60 1,01forbes.com 0,68 19 0,87reuters.com 1,00 6 0,78cnn.com 0,36 85 0,70

Domain AvgQ Inlinks RStweetphoto.com -0,77 106 -1,57twitpic.com -0,75 113 -1,54twitlonger.com -0,85 66 -1,54myloc.me -0,85 54 -1,48instagr.am -0,62 52 -1,06formspring.me -0,78 18 -0,98yfrog.com -0,55 53 -0,94lockerz.com -0,63 16 -0,75qik.com -0,75 8 -0,68

10 domains with a high DRS: 10 domains with a low DRS:

MainlyNews-oriented

sites

MainlyImage and

location sharing sites

Introduction Method Features Experiments Conclusions

Page 24: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

24

Reputation of hashtag & user

Hashtag reputation

Re-tweet source user reputation

#justforfunTweet 1

Tweet 2Tweet 3

Q = 1

Q = 3Q = 2

#DASFAATweet 4

Tweet 5Tweet 6

Q = 5

Q = 4Q = 5

Introduction Method Features Experiments Conclusions

#DASFAA vs. #justforfun

@barackobama vs. @wysz22212

Page 25: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

25 Experiments

Introduction Method Features Experiments Conclusions

Page 26: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

27

Dataset

10,000 tweets 100 users, 100 recent tweets per user

Users: 50 random users 50 influential users

Selected from listorious.com 5 categories: technology, business, politics,

celebrities, activism 10 users per category

Introduction Method Features Experiments Conclusions

Page 27: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

28

Labelling

Crowdsourcing Amazon Mechanical Turk

3 labels per tweet from different reviewers Possible labels: 1 to 5

1 = low quality, 5 = high quality Random order of tweets

Introduction Method Features Experiments Conclusions

Page 28: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

29

Labelling results

Tweet quality distribution

Quality score:

Introduction Method Features Experiments Conclusions

Page 29: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

30

Feature analysis

Total 29 features Top 5 features based on Information Gain:

0.374 Domain reputation 0.287 Contains link 0.130 Formality score 0.127 Num. proper names 0.113 Max. proper names

Introduction Method Features Experiments Conclusions

Page 30: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

31

Feature selection

Greedy attribute selection 15 selected features:

Domain reputation RT source reputation

Formality Tweet uniqueness

No. named entities % correct. spelled words

Max. no. repeat. Letters No. hash-tags

Contains numbers No. capitalized words

Is reply-tweet Is re-tweet

Avg. word length Contains first-person

No. exclam. Marks

Introduction Method Features Experiments Conclusions

Page 31: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

32

Classification and Ranking Method Classification:

SVM, binary classification (high-quality, low-quality)

50/50 split for training/testing Ranking:

Learning-to-rank (Rank SVM) 30 queries from 5 topic categories Process:

1. Retrieve tweets matching a query2. Extract features from the tweets3. ‘Query-tweet vector’ pairs + quality scores of

the tweets passed as input to Rank SVM

Introduction Method Features Experiments Conclusions

Page 32: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

33

Classification results

Features#attributes

High-Quality Low-Quality

OverallAUC

P R P R

Link only 1 0.798 0.702 0.894 0.934 0.818

TF-IDF 3322 0.862 0.665 0.885 0.96 0.813

Subset.Reputation 3 0.812 0.746 0.909 0.936 0.841

Subset.SVM (“greedy”)

15 0.715 0.758 0.912 0.936 0.847

All quality features 29 0.815 0.66 0.882 0.944 0.802

All quality ftr’s + TF-IDF

3351 0.739 0.775 0.915 0.899 0.837

Introduction Method Features Experiments Conclusions

Optimal feature set (15 attrs.) outperforms TF-IDF (3322 attrs.)

Link-based “reputation” features (3 attrs.) achieve the 2nd best result

Combining quality features + TF-IDF does not improve result

Page 33: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

34

Classification results

Features#attributes AUC

Link only 1 0.818

TF-IDF 3322 0.813

Subset.Reputation 3 0.841

Subset.SVM (“greedy”)

15 0.847

All quality features 29 0.802

All quality ftr’s + TF-IDF

3351 0.837

Storage cost

Training time

Introduction Method Features Experiments Conclusions

Optimal feature set achieves reduced training time and storage cost

Page 34: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

35

Ranking results

Features#attributes

NDCG@N

MAP1 2 5 10

Link only 1 0.067 0.111 0.22 0.324 0.398

Subset.Reputation 3 0.822 0.777

0.777 0.764 0.661

Subset.SVM (“greedy”)

15 0.867

0.767 0.778

0.769

0.653

All quality features 29 0.733 0.733 0.763 0.753 0.637

where

Introduction Method Features Experiments Conclusions

Optimal feature set (15 attrs.) achieves the best result

Link-based “reputation” features (3 attrs.) achieve the 2nd best result

Page 35: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

36 Conclusions

Introduction Method Features Experiments Conclusions

Page 36: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

37

Summary

Method for quality-based classification and ranking of tweets

Proposed and evaluated a set of tweet’s features to capture the tweet’s quality

Link-based features lead to the best performance

Introduction Method Features Experiments Conclusions

Page 37: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

38

Future work

Consider different types of queries in Twitter E.g. searching for hot topics, movie

reviews, facts, opinions, etc. Different features may be important in

different scenarios Incorporating recent hot topics Personalized re-ranking

Introduction Method Features Experiments Conclusions

Page 38: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

39

Q / AIntroduction Method Features Experiments Conclusions

Page 39: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

40

Thank YouIntroduction Method Features Experiments Conclusions

Page 40: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

41

Related work

Spam detection Bag-of-words, keyword-based Feature-based approaches Combinations

Social networks Finding quality answers in Q-A systems

E.g. Yahoo Answers Feature-based

Web search Quality-based ranking of web documents

Feature-based quality score (WSDM’11)

Page 41: Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

42

ROC Curve

Area under the ROC curve: probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one