Spatio-temporal demographic classification of the Twitter users
Post on 06-Jul-2015
108 Views
Preview:
DESCRIPTION
Transcript
Spatio-temporal demographic classification of the Twitter users
Paul Longley, Muhammad Adnan, Guy LansleyDepartment of Geography, University College London
Web: http://www.uncertaintyofidentity.com
Outline
1. Introduction• Geodemographics • Social Media Geodemographics
2. Twitter
3. A geo-temporal demographic classification of Twitter users• Residence of Twitter users• Ethnic classification of Twitter users
• Age classification of Twitter users• Computing the demographic classification
Introduction
• Geodemographics• Analysis of people by where they live” [1] • Night time characteristics of the population
• Social Media Geodemographics • Moving beyond the night time geography
• Who: Ethnicity, Gender, and Age of social media users
• When: What time of day conversations happen
• Where: Where social media conversations happen
[1] Sleight, P. (2004). Targetting Customers-How to Use Geodemographic and Lifestyle Data in Your Business.
Twitter (www.twitter.com)
• Online social-networking and micro blogging service• Launched in 2006
• Users can send messages of 140 characters or less
• Approximately 200 million active users [2]
• 350 million tweets daily
• In 2013, UK and London were ranked 4th and 3rd, respectively, in terms of the number of posted tweets [3]
[2] Twitter. 2012. What is Twitter ?. Retrieved 31st December, 2012, from https://business.twitter.com/basics/what-is-twitter/.
[3] Bennet, S. 2013. Revealed: The Top 20 Countries and Cities of Twitter [STATS]. Retrieved 31st December, 2013, from http://www.mediabistro.com/alltwitter/twitter-top-countries_b26726.
Data available through the Twitter API
• User Creation Date
• Followers
• Friends
• User ID• Language• Location• Name
• Screen Name
• Time Zone
• Geo Enabled• Latitude• Longitude
• Tweet date and time
• Tweet text
Twitter data for the case study
• Approx. 8 million geo-tagged tweets (Jan – Dec, 2013)• Sent by 385,050 unique users
• 155,249 users sent 5 or more tweets (7.6 million tweets)
Variables for creating a geo-temporal classification
1. Residence• Where twitter users live
1. Ethnicity• Probable ethnic origins of Twitter users
1. Age• Probable Age of Twitter users
1. Land Use Category of a Tweet message• Residential; Non-domestic building; Park etc.
2. Temporal Scales• Day, Afternoon, Night, Peak travel hours
Residence of Twitter Users
• 170m X 170m grid was used to find the probable residence of users
• Probable residence was found for the 75,522 users
Extracting demographic attributes of Twitter users by using their forenames and surnames
A name is a statement of the bearer’s cultural, ethnic, and linguistic identity [4]
[4] Mateos P, Longley P A, O’Sullivan D 2011. Ethnicity and population structure in personal naming networks. PloS ONE (Public Library of Science) 6 (9) e22943.
Analysing Names on Twitter
• Some examples of NAME variations on Twitter
• Approx. 68% of the accounts have real names
Fake Names
Castor 5.
WHAT IS LOVE?
MysticMind
KIRILL_aka_KID
Vanessa
Justin Bieber Home
Real Names
Kevin Hodge
Andre Alves
Jose de Franco
Carolina Thomas, Dr.
Prof. Martha Del Val
Fabíola Sanchez Fernandes
Onomap: Names to Ethnicity classification
• Onomap was created by clustering names of 1 billion individuals around the world
• Applied ONOMAP (www.onomap.org) on forename – surname pairs
Kevin Hodge (English)
Pablo Mateos (Spanish)
…
…
…
…
• Monica dataset provided by CACI Ltd, UK• Supplemented with UK birth certificate records
Age estimation from ‘forenames’
[5] Longley, P., Adnan, M., Lansley, G. 2013. “The geo-temporal demographics of Twitter usage”. Environment and Planning A. (In Press)
Age distribution of Twitter users
Twitter Users vs. 2011 Census (Greater London)
[5] Longley, P., Adnan, M., Lansley, G. 2013. “The geo-temporal demographics of Twitter usage”. Environment and Planning A. (In Press)
Land-use Categories• Every tweet message was assigned a land-use category
Variables for creating a geo-temporal classification1. ResidenceV1: Tweet made near probable London residence
V2: Tweeter lives ‘outside the UK’
V3: Tweeter lives in the rest of the UK outside London
2. Total Number of TweetsV4: Total number of tweets made by the user
3. EthnicityV5: West European
V6: East European
V7: Greek or Turkish
V8: South East Asian
V9: Other Asian
V10: African & Caribbean
V11: Jewish
V12: Chinese
V13: Other minority
4. AgeV14: <=20
V15: 21 - 30
V16: 31 - 40
V17: 41 - 50
V18: 50+
5. Tweets outside the UKV19: In West Europe (not including UK)
V20: In East Europe
V21: In North America
V22: In Central or South American
V23: In Australasia
V24: In Africa
V25: In Middle East
V26: In Asia
V27: In Paris
Variables for creating a geo-temporal classification
6. Number of countries visitedV28: Number of countries tweeter has visited
7. London Land Use CategoryV29: Residential location
V30: Non-domestic buildings
V31: Transport links and locations
V32: Green-spaces
V33: All other land uses
8. 2011 London Output Area ClassificationV34: Intermediate Lifestyles
V35: High Density and High Rise Flats
V36: Settled Asians
V37: Urban Elites
V38: City Vibe
V39: London Life-Cycle
V40: Multi-Ethnic Suburbs
V41: Ageing-City Fringe
9. Temporal ScalesV42: Morning Peak Hours
V43: Week Day
V44: Afternoon
V45: Week Night
V46: Weekend
• Segmentations were created by using K-means clustering algorithm
• K-means tries to find cluster centroids by minimising
• Seven clusters
• Group A: London Residents
• Group B: Commuting Professionals
• Group C: Student Lifestyle
• Group D: The Daily Grind
• Group E: Spectators
• Group F: Visitors
• Group G: Workplace and tourist activity
Computing the geo-temporal classifications
∑∑ −= =
=n
x
n
yyxV z
1 1
2
)( µ
Group A: London Residents
• Tweets made near primary residential locations
• Tweets made on weeknights or weekends
Group B: Commuting Professionals
• Tweets made from• Transport locations• ‘Urban Elites’ LOAC classification
• Tweets made by individuals of intermediate age (21-30)
Group F: Visitors
• Tweeters live outside London
• Tweets originated from residential land uses
• Mixed age groups
Group G: Workplace and tourist activity
• Tweets sent from non-domestic buildings
• Full range of Twitter age cohorts
• Tweets originate from a mix of residents and international visitors
Conclusion
• Geo-temporal demographic classifications• Census (night time geography)
• Social media data (day and travel time geography)• Issues of representation
• An insight into the residential and travel geographies of individuals
• An insight into the spatial activity patterns of different kind of social media users
Any Questions ?
Thank you for Listening
top related