Top Banner
Muhammad Adnan Department of Geography, University College London Web: http://www.uncertaintyofidentity.com Twitter: @gisandtech Open Data: Analysis and Visualisation
97

Open Data: Analysis and Visualisation

Jan 12, 2015

Download

Technology

Muhammad Adnan

This presentation gives an overview of the Open data. A number of case studies are given on the spatio-temporal analysis and visualization of the Social Media data (Twitter). The presentation also explains the creation of a heatmap visualisation by using R.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Open Data: Analysis and Visualisation

Muhammad Adnan

Department of Geography, University College London

Web: http://www.uncertaintyofidentity.com

Twitter: @gisandtech

Open Data: Analysis and Visualisation

Page 2: Open Data: Analysis and Visualisation

Dr. Muhammad Adnan• Research Associate

– Working on an EPSRC funded project “Uncertainty of Identity”

– http://www.uncertaintyofidentity.com

• Data Mining• Social Media Analysis• Data Visualisation

Research Interests

Page 3: Open Data: Analysis and Visualisation

Outline

• Open Data

• Crowd-Sourced Data (Social Media)

• Analysis and Visualisation Challenges

• Twitter Case Study• Spatial Analysis• Temporal Analysis

• R• A brief introduction• How to create heat maps

Page 4: Open Data: Analysis and Visualisation

Open data

Data that is:

Open and Free to the public CompleteAccessibleTimely

Machine processableNon-discriminatory

Page 5: Open Data: Analysis and Visualisation

Dataset examples

• National Budgets• Car registries• National roads• Water heights• Schools• Weather• Public transport• Council tax bands• And many more

Page 6: Open Data: Analysis and Visualisation
Page 7: Open Data: Analysis and Visualisation
Page 8: Open Data: Analysis and Visualisation
Page 9: Open Data: Analysis and Visualisation
Page 10: Open Data: Analysis and Visualisation

Census Profiler• http://www.censusprofiler.org/• Users can visualise 2001 Census data

Page 11: Open Data: Analysis and Visualisation

Education Profiler• http://www.educationprofiler.org/• Users can visualise education datasets

Page 12: Open Data: Analysis and Visualisation

Open Data Profiler• http://www.opendataprofiler.com/• Users can visualise 60 different 2011 Census datasets

Page 13: Open Data: Analysis and Visualisation

Crowd Sourced datasets

• Twitter• Public streaming API can be used to download live tweets

• Four Square• Has an API which can be used to access the Four Square data

• Facebook• Facebook applications can access user information

• Flickr• Wikipedia• Youtube

Page 14: Open Data: Analysis and Visualisation

How big are crowd sourced datasets ?• Facebook

• Number of active users: 850 Million• Average daily uploaded photos: 360 Million• Total data size: 30+ Petabytes

• Twitter• Number of active users: 200 Million• Daily tweets (posts): 350 Million

• Foursquare• Number of active users: 15 Million• Total check-ins: 1.5 Billion

Page 15: Open Data: Analysis and Visualisation

What are the issues with these datasets ?

• How representative social media data sets are of the Census or Electoral roll data ?

• Who: Ethnicity, Gender, and Age of social media users

• Where: Where social media conversations are happening and who is leading them• Intelligence about where people are located and what they are doing

• When: What time of day conversations happen

Page 16: Open Data: Analysis and Visualisation

Twitter (www.twitter.com)

• Online social-networking and micro blogging service• Launched in 2006

• Users can send messages of 140 characters or less

• Approximately 200 million active users

• 350 million tweets daily

• In 2012, UK and London were ranked 4th and 3rd, respectively, in terms of the number of posted tweets

Page 17: Open Data: Analysis and Visualisation

Basic Analysis of the Twitter data

Page 18: Open Data: Analysis and Visualisation

Data available through the Twitter API

• User Creation Date• Followers• Friends• User ID• Language• Location• Name• Screen Name• Time Zone

• Geo Enabled• Latitude• Longitude• Tweet date and time• Tweet text

Users can download 1% sample of the live tweets through the API

Page 19: Open Data: Analysis and Visualisation

Created with approx. 100 million tweets

Page 20: Open Data: Analysis and Visualisation
Page 21: Open Data: Analysis and Visualisation

4 million geo-tagged tweets downloaded during August and December, 2012

Page 22: Open Data: Analysis and Visualisation

4 million geo-tagged tweets downloaded during August and December, 2012

Page 23: Open Data: Analysis and Visualisation

Hourly and Daily Twitter Activity in London

Page 24: Open Data: Analysis and Visualisation

Hourly Twitter Activity in London

Page 25: Open Data: Analysis and Visualisation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

2000

4000

6000

8000

10000

12000

Monday

Hour

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

2000

4000

6000

8000

10000

12000

Tuesday

Hour

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

2000

4000

6000

8000

10000

12000

Wednesday

Hour

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

2000

4000

6000

8000

10000

12000

Thursday

Hour

Daily Twitter Activity in London

Page 26: Open Data: Analysis and Visualisation

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

2000

4000

6000

8000

10000

12000

Friday

Hour

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

2000

4000

6000

8000

10000

12000

Saturday

Hour

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

0

2000

4000

6000

8000

10000

12000

Sunday

Hour

Daily Twitter Activity in London

Page 27: Open Data: Analysis and Visualisation

Analysis of User Names on Twitter

• A name is a statement of the person’s ethnic, linguistic, and cultural identity.• E.g. Alex Singleton is an Anglo-Saxon name. Similarly, Pablo

mateos is a Spanish (Hispanic) name.

Page 28: Open Data: Analysis and Visualisation

Analysing Names on Twitter

• Some examples of NAME variations on Twitter

Real Names

Kevin Hodge

Andre Alves

Jose de Franco

Carolina Thomas, Dr.

Prof. Martha Del Val

Fabíola Sanchez Fernandes

Fake Names

JustinBieber_Home.

WHAT IS LOVE?

MysticMind

KIRILL_aka_KID

Vanessa

Petuna

Page 29: Open Data: Analysis and Visualisation

Analysing Names on Twitter• Some examples of NAME variations on Twitter

Real Names

Kevin Hodge -> F: ‘Kevin’ ; S: ‘Hodge’

Andre Alves -> F: ‘Andre’ ; S: ‘Alves’

Jose De Franco -> F: ‘Jose’ ; S: ‘De Franco’

Carolina Thomas, Dr. -> F: ‘Carolina’ ; S: ‘Thomas’

Prof. Martha Del Val -> F: ‘Martha’ ; S: ‘Del Val’

Fabíola Sanchez Fernandes -> F: ‘Fabíola’ ; S: ‘Fernandes’

Page 30: Open Data: Analysis and Visualisation

Where they tweet from:

Surname: JONES

Page 31: Open Data: Analysis and Visualisation

Where they tweet from:

Surname: DEE

Page 32: Open Data: Analysis and Visualisation

Where they tweet from:

Surname: SHAH

Page 33: Open Data: Analysis and Visualisation

• A name is a statement of the person’s ethnic, linguistic, and cultural identity.• E.g. Alex Singleton is an Anglo-Saxon name. Similarly, Pablo

mateos is a Spanish (Hispanic) name.

Predicting Ethnicity of Twitter Users by using their ‘Names’

Page 34: Open Data: Analysis and Visualisation

Classifying Twitter Data to ethnic origins

• Applied ONOMAP (www.onomap.org) on FORENAME + SURNAME pairs

Kevin Hodge (ENGLISH)

Pablo Mateos (Spanish)

Page 35: Open Data: Analysis and Visualisation

Top 10 Ethnic Groups of Twitter Users

Page 36: Open Data: Analysis and Visualisation

English Italian

Pakistani Indian

TurkishGreek

Bangladeshi

Spanish

German French

Portuguese

Sikh

Tweeting Activity by different Ethnic Groups

Page 37: Open Data: Analysis and Visualisation

• Onomap groups were aggregated to match the appropriate groups from the Census

London TotalWhite British

White other

Indian Pakistani BangladeshiBlack African

Chinese

Week Night

53611 71.35% 12.12% 2.63% 2.63% 1.82% 1.52% 1.74%

Week Day 80676 73.12% 11.80% 2.41% 2.41% 1.56% 1.25% 1.61%

Weekend 67351 72.86% 12.17% 2.61% 2.61% 1.67% 1.39% 1.73%

Comparison of Ethnic Groups between ‘2011 Census’ and ‘Twitter’

2011 Census 44.89% 12.65% 6.64% 2.74% 2.72% 7.02% 1.52%

Page 38: Open Data: Analysis and Visualisation

Comparison of the distribution of ethnicity with the 2011 Census

2011 Census Twitter

White British (Quintiles)

Page 39: Open Data: Analysis and Visualisation

Gender and Age Analysis of Twitter Users by using their ‘forenames’

Page 40: Open Data: Analysis and Visualisation

Gender Analysis of Twitter Users

Male Female Unisex Not Found0%

10%

20%

30%

40%

50%

60%

Number of Tweets Number of Unique Users

Page 41: Open Data: Analysis and Visualisation

Age estimation from ‘forenames’

0-4 5-9 10-14

15-19

20-24

25-29

30-34

35-39

40-44

45-49

50-54

55-59

60-64

65-69

70-74

75-79

80-84

85+0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

PAUL BETTY GUY MUHAMMAD

Age group

Per

cen

t

Data: Monica (CACI, Ltd.) and Birth Certificate Data (Office of National Statistics)

Page 42: Open Data: Analysis and Visualisation

Age-Sex structure of Twitter Users and 2011 Census

Male Female

Page 43: Open Data: Analysis and Visualisation

Tweets by different Land-use Categories

Page 44: Open Data: Analysis and Visualisation

Temporal Activity: Tweets from different Land-use Categories

Page 45: Open Data: Analysis and Visualisation

Ethnic Segregation of Twitter Users

Page 46: Open Data: Analysis and Visualisation

Segregation Analysis

• To find out the level of integration/segregation of different types of Twitter users

• During different hours of the week and weekends

• Information Theory Index

 

Page 47: Open Data: Analysis and Visualisation

Segregation Analysis

• The value of the information theory index is between 0 (low segregation) and 1 (high segregation).

Ethnic Groups H (Domestic buildings and

gardens)

H (Week Nights) H (Week Days) H (Weekend)

British 0.483 0.401 0.211 0.315

Irish 0.670 0.571 0.357 0.475

White Other 0.630 0.510 0.303 0.420

Pakistani 0.765 0.679 0.488 0.633

Indian 0.748 0.673 0.451 0.590

Bangladeshi 0.864 0.834 0.671 0.784

Black Caribbean 0.831 0.808 0.548 0.666

Black African 0.764 0.704 0.492 0.640

Chinese 0.712 0.608 0.403 0.524

Other 0.710 0.593 0.374 0.497

Page 48: Open Data: Analysis and Visualisation

Extending the analysis to other cities

Page 49: Open Data: Analysis and Visualisation

Tweet density map of London

Page 50: Open Data: Analysis and Visualisation

Tweet density map of Paris

Page 51: Open Data: Analysis and Visualisation

Tweet density map of New York City

Page 52: Open Data: Analysis and Visualisation

Top 10 ethnic groups in London

Page 53: Open Data: Analysis and Visualisation

Top 10 ethnic groups in Paris

Page 54: Open Data: Analysis and Visualisation

Top 10 ethnic groups in NYC

Page 55: Open Data: Analysis and Visualisation

English Spanish

GermanJewish

Irish Italian

Portuguese

Tweeting Activity by different Ethnic Groups (NYC)

Scottish

Black Caribbean

Chinese

Page 56: Open Data: Analysis and Visualisation

French

GermanTurkish

Spanish Italian

Portuguese

Tweeting Activity by different Ethnic Groups (Paris)

English

Polish

Page 57: Open Data: Analysis and Visualisation

Gender Analysis

Page 58: Open Data: Analysis and Visualisation

Exploring the Languages on Twitter

Page 59: Open Data: Analysis and Visualisation

Data available through the Twitter API

• User Creation Date• Followers• Friends• User ID• Language• Location• Name• Screen Name• Time Zone

• Geo Enabled• Latitude• Longitude• Tweet date and time• Tweet text

Page 60: Open Data: Analysis and Visualisation

Twitter Languages (World)

Page 61: Open Data: Analysis and Visualisation

Twitter Languages (Europe)

Page 62: Open Data: Analysis and Visualisation

Twitter Language Maps

Page 63: Open Data: Analysis and Visualisation

Twitter Language Maps

Page 64: Open Data: Analysis and Visualisation

Twitter Language Maps

Page 65: Open Data: Analysis and Visualisation

Temporal Analysis of the data sets

Page 66: Open Data: Analysis and Visualisation

Temporal Analysis of the Twitter Data

• Data: 12 September, 2012 – 25 September, 2013

• We extracted a total of approx. 800 million tweets over the last year

• A temporal activity analysis of different cities could potentially reveal a lot of information about the residents of the city

• But Twitter data is not clean and has lots of problems !

Page 67: Open Data: Analysis and Visualisation

Problems with the data

1) Extracting the data for individual cities or places

• Use of bounding boxes to extract the data• New York City NW: 40.91762, -73.7004 SW: 40.47662, -74.2589

• http://isithackday.com could be used to find the bounding boxes of different cities

Page 68: Open Data: Analysis and Visualisation

Problems with the data

2) Twitter data has a GMT and BST timestamp. Conversion to other time stamp is very time consuming

• 12p.m. in ‘London’ is 5a.m in Los Angeles, if the time stamp is GMT.• 12p.m. in ‘London’ is 6a.m in Los Angeles, if the time stamp is BST.

Page 69: Open Data: Analysis and Visualisation

Temporal Analysis of different cities

Jaka

rta

Ista

nbul

Paris

Sao P

aulo

New Y

ork C

ity

London

Los Angel

es

Rio d

e Ja

nerio

Mex

ico C

ity

Riyad

h

Tokyo

Chicag

o

Buenos

Aires

Mad

rid

Dalla

s

Philadel

phia

Man

ches

ter

Houston

Was

hingto

n

Toronto

Boston

Seoul (

Korea)

Dubai

San F

anci

sco

Osaka

(Jap

an)

Atlanta

Sydney

Mel

bourne

Glasg

ow

Dublin

0

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

30,000,000

35,000,000

40,000,000

Nu

mb

er o

f T

wee

ts (

Mill

ion

s)

• Approx. 170 million tweets were sent from the following 30 cities.

Page 70: Open Data: Analysis and Visualisation

Temporal Analysis of different cities

LONDON

Page 71: Open Data: Analysis and Visualisation

Temporal Analysis of different cities

LONDON

PARIS

Page 72: Open Data: Analysis and Visualisation

Temporal Analysis of different cities

JAKARTA

Page 73: Open Data: Analysis and Visualisation

Temporal Analysis of different cities

JAKARTARIYADH

Page 74: Open Data: Analysis and Visualisation

Temporal Analysis of different cities

JAKARTA

ISTANBUL

Page 75: Open Data: Analysis and Visualisation

Introduction to R

Page 76: Open Data: Analysis and Visualisation

What is R?

• The R statistical programming language is a free open source package based on the S language developed by Bell Labs.

• The language is very powerful for writing programs.

• Many statistical functions are already built in.

• Very easy to create maps and different visualizations.

Page 77: Open Data: Analysis and Visualisation

• You will have to write some code to get the things done !

• R is available @ www.r-project.org

• Supports both 32 and 64 bit Windows PCs, Linux, Unix, and Mac OS operating sytems

What is R?

Page 78: Open Data: Analysis and Visualisation

Getting Started

• The R GUI?

Page 79: Open Data: Analysis and Visualisation

Getting Started

Page 80: Open Data: Analysis and Visualisation

80

Interacting with R

> 1 + 1[1] 2

> 1 + 1 * 7[1] 8

> (1 + 1) * 7[1] 14

> sqrt(16)[1] 4

> x <- 1> x[1] 1 > y <- 2> y[1] 2> z <- x+y> z[1] 3

Math: Variables:

Page 81: Open Data: Analysis and Visualisation

Importing Data

• How do we get data into R?

• First make sure your data is in an easy to read format such as CSV (Comma Separated Values)

• Use code:– D <- read.csv(“path”,sep=“,”,header=T)– D <- read.table(“path”,sep=“,”,header=T)

Page 82: Open Data: Analysis and Visualisation

Working with data.

• Accessing columns.• D has our data in it…. But you can’t see it directly.• To select a column use D$column.

Page 83: Open Data: Analysis and Visualisation

Basic Graphics

• Histogram– hist(D$wg)

Page 84: Open Data: Analysis and Visualisation

How to create a heat map in R ?

Page 85: Open Data: Analysis and Visualisation

How to create a heat map in R ?

• Three steps:– Read a CSV file– Chose the colours for the heat map– Create the heat map

Page 86: Open Data: Analysis and Visualisation

How to create a heat map in R ?

• Step 1: Read a CSV fileread.csv(“FILE NAME", sep=",", header=T)

Page 87: Open Data: Analysis and Visualisation

How to create a heat map in R ?

• Step 1: Read a CSV fileread.csv(“FILE NAME", sep=",", header=T)

• Assign it to a variableInput <- read.csv(“FILE NAME", sep=",", header=T)

i.e. with ‘<‘ (less than) and ‘-’ (dash) symbols.

Page 88: Open Data: Analysis and Visualisation

How to create a heat map in R ?

• Step 1: Read a CSV file

Page 89: Open Data: Analysis and Visualisation

How to create a heat map in R ?

• Step 2: Chose the colours for the heat map

colours <- c(0) (Create an empty variable)

Page 90: Open Data: Analysis and Visualisation

How to create a heat map in R ?

• Step 2: Chose the colours for the heat map

colours <- c(0)

colours[1] <- "#FDD49E"

colours[2] <- "#FDBB84"

colours[3] <- "#FC8D59"

colours[4] <- "#EF6548"

colours[5] <- "#D7301F"

colours[6] <- "#B30000"

colours[7] <- "#7F0000"

Page 91: Open Data: Analysis and Visualisation

How to create a heat map in R ?

• Step 2: Chose the colours for the heat map

colours <- c(0)

colours[1] <- "#FDD49E"

colours[2] <- "#FDBB84"

colours[3] <- "#FC8D59"

colours[4] <- "#EF6548"

colours[5] <- "#D7301F"

colours[6] <- "#B30000"

colours[7] <- "#7F0000"

Page 92: Open Data: Analysis and Visualisation

How to create a heat map in R ?

• Step 3: Create the heat map

heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours)

Page 93: Open Data: Analysis and Visualisation

How to create a heat map in R ?

• Step 3: Create the heat map

heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours)

Input Data

Page 94: Open Data: Analysis and Visualisation

How to create a heat map in R ?

• Step 3: Create the heat map

heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours)

Whether to apply scaling on the data. Options are ‘col’, ‘row’, and ‘none’.

Page 95: Open Data: Analysis and Visualisation

How to create a heat map in R ?

• Step 3: Create the heat map

heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours)

Leave them as they are!

Page 96: Open Data: Analysis and Visualisation

How to create a heat map in R ?

• Step 3: Create the heat map

heatmap(Input1_matrix, scale="col", Rowv = NA, Colv = NA, col=colours)

Colours

Page 97: Open Data: Analysis and Visualisation

Any Questions ?

• Open Data• Crowd-Sourced Data (Social Media)• Analysis and Visualisation Challenges• Twitter Case Study

• Spatial Analysis• Temporal Analysis

• R• A brief introduction• How to create heat maps