Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

Searching for Quality Microblog Posts:Filtering and Ranking based on Content Analysis and Implicit Links

Jan Vosecky, Kenneth Wai-Ting Leung, and Wilfred Ng

Department of Computer Science and Engineering

HKUST

Hong Kong

DASFAA’12

2

Agenda

Introduction Proposed method Quality features of tweets Experiments Conclusions

Introduction Method Features Experiments Conclusions

3 Introduction


4

Microblogs

Both social network and social media Links between users (follow, mention, re-

tweet) Users post updates (tweets)

Tweet 1

Tweet 2

usertimestamp

URL link

hashtag

mentioned user


5

Searching for “ipad” on Twitter

Around 50 tweets mentioning “iPad” posted within a1-minute period


6

Research challenge

Twitter: user-generated content Short messages, often comments or opinions High volume Varying quality

“Most tweets are not of general interest (57%)” (Alonso et al.’10)

Information overload Research questions:

How to distinguish content worth reading from useless or less important messages?

How to promote ‘high quality’ content?


7

Defining ‘quality’

General (global) definition for assessing tweet quality

3 criteria: Well-formedness

+ Well-written, grammatically correct, understandable

- Heavy slang, misspellings, excessive punctuation Factuality

+ News, events, announcements- Unclear message, private conversations, generic

personal feelings Navigational quality (URL links)

+ Reputable external resources (e.g. news articles)- Personal information sharing (e.g. photo sharing

websites)


8

Quality-based tweet filtering

+--+-


9

Quality-based tweet ranking

5

4

3

1

1


10

Research goals

Quality-based tweet filtering Filtering out low-quality tweets

In twitter feeds In search results

Quality-based tweet ranking Re-ranking Twitter search results

For a given time period


11 Proposed Method


12

Representation of tweets

Vector-space model: not sufficient Short tweet length, terms often malformed Ignores special features in Twitter

Feature-vector representation Extract features from tweet Traditional features: e.g. length, spelling Twitter-specific features:

Exploiting hashtags, URL links, mentioned usernames


13 Quality Features of Tweets


14

Feature categoriesIntroduction Method Features Experiments Conclusions

1. Punctuation and Spelling

2. Syntactic and semantic complexity

Number of exclamation marksNumber of question marksMax. no. of repeated letters% of correctly spelled words No. of capitalized wordsMax. no. of consecutive capitalized words

Max. & Avg. word lengthLength of tweetPercentage of stopwordsContains numbersContains a measureContains emoticonsUniqueness score

3. Grammaticality 4. Link-basedHas first-person part-of-speechFormality scoreNumber of proper namesMax. no. of consecutiveproper namesNumber of named entities

Contains linkIs reply-tweetIs re-tweetNo. of mentions of usersNumber of hashtagsURL domain reputation scoreRT source reputation scoreHashtag reputation score

5. TimestampDay of the week of posting Hour of the day of posting

15

1. Punctuation and spelling

Excessive punctuation Number of exclamation marks Number of question marks Max. number of consecutive dots

Capitalization Presence of all-capitalized words Largest number of consecutive words in capital letters

Spellchecking Number of correctly spelled words Percentage of words found in a dictionary

RT @_ChocolateCoco: WHO IS CHUCK NORRIS??!!!?? lls. He's only the greatest guy next to jesus lmao


16

2. Syntactic and semantic complexity

Syntactic complexity Tweet length Max. & avg. word length Percentage of stopwords Presence of emoticons and other sentiment indicators Presence of measure symbols ($, %) Numbers – number of digits

Tweet uniqueness Uniqueness of the tweet relative to other tweets by the

author


where

17

3. Grammaticality

Parts-of-speech labelling Presence of first person parts-of-speech Formality score [Heylighen’02]

F = (noun frequency + adjective freq. + preposition freq.+ article freq. − pronoun freq. − verb freq. − adverb freq. − interjection freq. + 100)/2

Names Number of ‘proper names’ as words with a single initial

capital letter Number of consecutive ‘proper names’ Number of Named entities

F. Heylighen and J.-M. Dewaele. Variation in the contextuality of language: An empirical measure. Context in Context. Special issue Foundations of Science, 7(3):293–340, 2002.


18

4. Link-based features

Links to other items Re-tweet (RT), reply tweet, mention of other

users Presence of a URL link Number of hashtags as indicated by the “#” sign

Link target’s quality reputation metrics to reflect the quality of tweets which

relate to a URL domain Hashtag a user


19

URL domain reputation

Observation: Tweets which link to news articles usually better quality

than tweets which link to photo sharing websites

Questions: What does the quality of tweets linking to a website say

about its quality? Can we predict quality of future tweets linking to that

website?

Tweetpic.com

Tweet 1

Tweet 2Tweet 3

Q = 1

Q = 3Q = 2

NYtimes.com

Tweet 4

Tweet 5Tweet 6

Q = 5

Q = 4Q = 5


20


Step 1: URL translationShort link to original link

bit.ly/e2jt9F http://www.reuters.com/4151120

Step 2: summarize tweets linking to a URL domain Accumulate “quality reputation” over time


21


Average URL domain quality

Td = set of tweets linking to domain d

qt = quality label of tweet t

Weakness: Does not reflect the number of inlink tweets in the

score Favours domains with few inlink tweets


22


Domain reputation score

where AvgQ(d) is between [-1, +1]

“Collecting evidence” behaviour: Score getting higher with more good quality inlink

tweets

1 10 100 1000

-4.00-3.00-2.00-1.000.001.002.003.004.00

-1-0,500,51

DRS

|Td|

AvgQ


23


Domain AvgQ Inlinks RSgallup.com 0,96 99 1,92mashable.com 0,79 97 1,58hrw.org 0,86 57 1,51foxnews.com 0,68 38 1,08good.is 0,68 31 1,01intuit.com 0,57 60 1,01forbes.com 0,68 19 0,87reuters.com 1,00 6 0,78cnn.com 0,36 85 0,70

Domain AvgQ Inlinks RStweetphoto.com -0,77 106 -1,57twitpic.com -0,75 113 -1,54twitlonger.com -0,85 66 -1,54myloc.me -0,85 54 -1,48instagr.am -0,62 52 -1,06formspring.me -0,78 18 -0,98yfrog.com -0,55 53 -0,94lockerz.com -0,63 16 -0,75qik.com -0,75 8 -0,68

10 domains with a high DRS: 10 domains with a low DRS:

MainlyNews-oriented

sites

MainlyImage and

location sharing sites


24

Reputation of hashtag & user

Hashtag reputation

Re-tweet source user reputation

#justforfunTweet 1

Tweet 2Tweet 3

Q = 1

Q = 3Q = 2

#DASFAATweet 4

Tweet 5Tweet 6

Q = 5

Q = 4Q = 5


#DASFAA vs. #justforfun

@barackobama vs. @wysz22212

25 Experiments


27

Dataset

10,000 tweets 100 users, 100 recent tweets per user

Users: 50 random users 50 influential users

Selected from listorious.com 5 categories: technology, business, politics,

celebrities, activism 10 users per category


28

Labelling

Crowdsourcing Amazon Mechanical Turk

3 labels per tweet from different reviewers Possible labels: 1 to 5

1 = low quality, 5 = high quality Random order of tweets


29

Labelling results

Tweet quality distribution

Quality score:


30

Feature analysis

Total 29 features Top 5 features based on Information Gain:

0.374 Domain reputation 0.287 Contains link 0.130 Formality score 0.127 Num. proper names 0.113 Max. proper names


31

Feature selection

Greedy attribute selection 15 selected features:

Domain reputation RT source reputation

Formality Tweet uniqueness

No. named entities % correct. spelled words

Max. no. repeat. Letters No. hash-tags

Contains numbers No. capitalized words

Is reply-tweet Is re-tweet

Avg. word length Contains first-person

No. exclam. Marks


32

Classification and Ranking Method Classification:

SVM, binary classification (high-quality, low-quality)

50/50 split for training/testing Ranking:

Learning-to-rank (Rank SVM) 30 queries from 5 topic categories Process:

1. Retrieve tweets matching a query2. Extract features from the tweets3. ‘Query-tweet vector’ pairs + quality scores of

the tweets passed as input to Rank SVM


33

Classification results

Features#attributes

High-Quality Low-Quality

OverallAUC

P R P R

Link only 1 0.798 0.702 0.894 0.934 0.818

TF-IDF 3322 0.862 0.665 0.885 0.96 0.813

Subset.Reputation 3 0.812 0.746 0.909 0.936 0.841

Subset.SVM (“greedy”)

15 0.715 0.758 0.912 0.936 0.847

All quality features 29 0.815 0.66 0.882 0.944 0.802

All quality ftr’s + TF-IDF

3351 0.739 0.775 0.915 0.899 0.837


Optimal feature set (15 attrs.) outperforms TF-IDF (3322 attrs.)

Link-based “reputation” features (3 attrs.) achieve the 2nd best result

Combining quality features + TF-IDF does not improve result

34

Classification results

Features#attributes AUC

Link only 1 0.818

TF-IDF 3322 0.813

Subset.Reputation 3 0.841


15 0.847

All quality features 29 0.802

All quality ftr’s + TF-IDF

3351 0.837

Storage cost

Training time


Optimal feature set achieves reduced training time and storage cost

35

Ranking results

Features#attributes

NDCG@N

MAP1 2 5 10

Link only 1 0.067 0.111 0.22 0.324 0.398

Subset.Reputation 3 0.822 0.777

0.777 0.764 0.661


15 0.867

0.767 0.778

0.769

0.653

All quality features 29 0.733 0.733 0.763 0.753 0.637

where


Optimal feature set (15 attrs.) achieves the best result

Link-based “reputation” features (3 attrs.) achieve the 2nd best result

36 Conclusions


37

Summary

Method for quality-based classification and ranking of tweets

Proposed and evaluated a set of tweet’s features to capture the tweet’s quality

Link-based features lead to the best performance


38

Future work

Consider different types of queries in Twitter E.g. searching for hot topics, movie

reviews, facts, opinions, etc. Different features may be important in

different scenarios Incorporating recent hot topics Personalized re-ranking


39

Q / AIntroduction Method Features Experiments Conclusions

40

Thank YouIntroduction Method Features Experiments Conclusions

41

Related work

Spam detection Bag-of-words, keyword-based Feature-based approaches Combinations

Social networks Finding quality answers in Q-A systems

E.g. Yahoo Answers Feature-based

Web search Quality-based ranking of web documents

Feature-based quality score (WSDM’11)

42

ROC Curve

Area under the ROC curve: probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one

Searching for Quality Microblog Posts: Filtering and Ranking based on Content Analysis and Implicit Links

Technology

tweet traditional features

special features

tweet relative

retweet users

url link number of hashtags

sufficient short tweet

high quality content

proposed method