Top Banner
Introduction Experimental Setting Method Experimental Results Conclusion You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas A&M University Presented by Yi Zhu November 16, 2010 Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 1 / 32
32

You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Aug 14, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

You Are Where You Tweet:A Content-Based Approach to Geo-locating

Twitter Users

CIKM’10Zhiyuan Cheng, James Caverlee and Kyumin Lee

Texas A&M University

Presented by Yi Zhu

November 16, 2010

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 1 / 32

Page 2: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Outline

1 IntroductionMotivationProblem

2 Experimental SettingDatasetEvaluation Metrics

3 MethodBaselineIdentifying Local WordsTweet Sparsity

4 Experimental Results

5 Conclusion

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 2 / 32

Page 3: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Examples

Mongkok

Hong Kong

Object: Locating a Twitter user based on the contentof tweets.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 3 / 32

Page 4: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Motivation

Motivation

Location sparsity problem of Twitter26% users have listed a user location as granular as acity name.Twitter begin to support per-tweet geo-tagging sinceAugust 2009. However, fewer than 0.42% tweets aretagged.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 4 / 32

Page 5: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Motivation

Motivation

Personalized information servicesLocal news providingRegional advertisementsLocation-based application (earthquake detection)

Avoid the need for sensitive data (private userinformation, IP address)

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 5 / 32

Page 6: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Problem

Challenges

Tweets status updates are noisy. Mixing a variety ofdaily interests.

Twitter users often rely on shorthand andnon-standard vocabulary for informalcommunication.

A user may span multiple locations beyond theirimmediate home location.

A user may have more than one associatedlocations.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 6 / 32

Page 7: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Problem

Problem Defined

Given tweets of Twitter users, our goal is to estimatethe city-level location of a user based purely on thecontent of their tweets.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 7 / 32

Page 8: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Problem

Problem Defined

Formally, the location estimation problem is definedas follows:

Given a set of tweets Stweets(u) posted by user u;Estimate a user’s probability of being located in city i :p(i |Stweets(u)), such that the city with maximumprobability lest(u) is the user’s actual location lact(u).

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 8 / 32

Page 9: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Dataset

Data Crawling

API: twitter4j (open-source library for java).

Two crawling strategies:Crawling through Twitter’s public timeline API. (ActiveTwitter Users)Crawling by breadth-first search through social edges tocrawl each user’s friends. (Sub Social Graph of Twitter)

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 9 / 32

Page 10: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Dataset

Dataset Description

From Sep 2009 to Jan 2010

Users: 1,074,375

Tweets: 29,479,600

75.05% users list location, but overly general(California) or nonsensical (Wonderland).

21% users list a location as granular as a city name.

5% users list latitude/longitude coordinate.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 10 / 32

Page 11: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Dataset

Dataset Filter

Filter all listed locations that have a valid city-levellabel.

Users: 130,689

Tweets: 4,124,960

Test Set:Extract users with 1000+ tweets and latitude/longitudecoordinates. (Generated by smartphone)Users: 5,190Tweets: more than 5 million

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 11 / 32

Page 12: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Evaluation Metrics

Evaluation Metrics

Error Distance for user uErrDist(u) = d(lact(u), lest(u))

Average Error Distance for all users U:

AvgErrDist(U) =∑

u∈U ErrDist(u)|U|

Accuracy:

Accuracy(U) = |{u|u∈U ∧ ErrDist(u)6100}||U|

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 12 / 32

Page 13: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Baseline

Baseline Location Estimation

p(i |Swords(u)) =∑

w∈Swords(u)p(i |w)× p(w).

Swords(u) is the set of words extracted from user u.

p(w) is the probability of the word w in the wholedataset, p(w) = count(w)

t

p(i |w) the likelihood that each word w is issued by auser located in city i .

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 13 / 32

Page 14: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Baseline

Baseline Location Estimation Result

Accuracy: 10.12%

AvgErrDist: 1773 miles

Problem:Local Words: isolate the words which can distinguishlocation of the user.Tweet Sparsity: location sparsity of words in tweets.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 14 / 32

Page 15: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Identifying Local Words

Spatial variation model

Given a word, decide if it is local or non-local.

Spatial variation model (Backstrom et al., WWW’08)Analysis of geographic distribution of terms in searchengine query logs.Cd−α is the approximately probability of the queryissued from a place with a distance d from the center.C is a constant to specify the frequency of the center.α control the speed of the frequency falls.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 15 / 32

Page 16: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Identifying Local Words

Identifying Local Words in Tweets

C and α can be used to determine if the word islocal.

For a word w , given a center and the centralfrequency is C, compute the maximum-likelihoodvalue.

For each city i , users from i tweet word w n times:n > 0, then multiply the overall probability by (Cd−αi )n.n = 0, then multiply the overall probability by 1− Cd−αi .di is the distance between city i and the center of wordw .

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 16 / 32

Page 17: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Identifying Local Words

Identifying Local Words in Tweets

To avoid underflow, logarithms are added.

Suppose S is the set of occurrences for word w ,then:

f (C, α) =∑i∈S

log Cdi−α +

∑i /∈S

log(1− Cdi−α)

It has exactly one local maximum (unimodal)LatticesGolden section search

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 17 / 32

Page 18: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Identifying Local Words

Identifying Local Words in Tweets

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 18 / 32

Page 19: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Identifying Local Words

Identifying Local Words in Tweets

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 19 / 32

Page 20: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Identifying Local Words

Identifying Local Words in Tweets

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 20 / 32

Page 21: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Tweet Sparsity

Laplace Smoothing (Add-One Smoothing)

p(i |w) = 1+count(w ,i)V+N(w)

,

count(w , i): term count of word w in city i ;

V : the size of vocabulary;

N(w): total count of w in all cities.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 21 / 32

Page 22: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Tweet Sparsity

State-Level Smoothing

State probability:

ps(s|w) =∑

i∈Sc p(i|w)

|Sc | ,

Sc : set of cities in the state s.

State-level smoothing:

p′(i |w) = λ× p(i |w) + (1− λ)× ps(s|w),i : a city in the state s;1− λ: amount of smoothing.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 22 / 32

Page 23: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Tweet Sparsity

Lattice-Based Neighborhood Smoothing

Per-lattice probability:

p(lat |w) =∑

i∈Scp(i |w),

lat : a lattice.Sc : set of cities in lat .

Lattice probability:

p′(lat |w) = µ∗p(lat |w)+(1− µ) ∗∑

lati∈Sneighbors

p(lati |w),

µ: parameter.neighbors: 8 lattice around lat .

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 23 / 32

Page 24: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Tweet Sparsity

Lattice-Based Neighborhood Smoothing

Lattice-based neighborhood smoothing:

p′(i |w) = λ ∗ p(i |w) + (1− λ) ∗ p′(lat |w),i : a city in the lattice lat ;λ: smoothing parameter.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 24 / 32

Page 25: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Tweet Sparsity

Model-Based Smoothing

p′(i |w) = C(w)d−α(w)i ,

C(w), α(w): optimized parameters for word w .

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 25 / 32

Page 26: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Tweet Sparsity

Smoothing Comparison

Geographic Range Parameters ComplexityLaplace None None Low

State-Level State λ HighNeighborhood Neighbor Lattices µ, λ HighestModel-Based Global None Lowest

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 26 / 32

Page 27: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Model and Smoothing Comparison

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 27 / 32

Page 28: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Model and Smoothing Comparison

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 28 / 32

Page 29: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Capacity of Estimator

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 29 / 32

Page 30: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Number of Tweets

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 30 / 32

Page 31: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Conclusion

A probabilistic framework for estimating city-levellocation of Twitter users based on the content oftweets.

Local words identifying and some smoothing canimprove the estimation

100 tweets are enough for locating.

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 31 / 32

Page 32: You Are Where You Tweet: A Content-Based Approach to Geo ... · A Content-Based Approach to Geo-locating Twitter Users CIKM’10 Zhiyuan Cheng, James Caverlee and Kyumin Lee Texas

Introduction Experimental Setting Method Experimental Results Conclusion

Thanks!

Q & A

Yi Zhu (CUHK) Content-Based Approach to Geo-locating Twitter Users 32 / 32