Spatio-temporal linkage of real and virtual identity Muhammad Adnan (and Paul Longley) University College London
Nov 02, 2014
Spatio-temporal linkage of real and virtual identity
Muhammad Adnan (and Paul Longley)University College London
Geodemographics
• “Analysis of people by where they live [places]”(Sleight, 1993:3)
• Social similarity, not locational proximity
HomeAddressPerson
Area
Identity of individuals in the real world
• Name (Forename & Surname)
• Surnames have geographic concentrations
• Prospects for linkage with socio-economic data
• E.g. Analysing the socio-economic circumstances of different ethnic groups
An example – gbnames.publicprofiler.org
Longley Cheshire
An example – Output Area Classification
Kingston upon Hull Hereford
A socio-economic and ethnic classification
A socio-economic and ethnic classification
Wu
Source: Cheshire and Longley (2011)
12
Courtesy: James Cheshire
Wordle.net
The European scale
16 countries.
400 million people.
5.95 million unique surnames
Courtesy: James Cheshire
Onomap classification
Surnames
UK Electoral Roll
Forenames
Pablo Mateos
Garcia
Pérez
...Juan
Rosa
Marta
...
Sánchez
Rodríguez
...– Several iterations until self-contained cluster is exhausted– Cluster assigned a cultural, ethnic & linguistic Onomap type– Probability of ethnicity assigned to each name
Mateos et al (2007) CASA Working Paper 116
Forename-Surname clustering (based on Hanks and Tucker, 2000)
WorldNames CEL clusters
Source: Mateos et al (2011)
Uncertainty and virtual identity
• Identity increasingly shaped by online activities– => value may be leveraged from the fusion of physical
and virtual data sources• Data fusion and generalisation to relate physical
and virtual properties• Use of residence alongside activity patterns and
social network information
Most of us have virtual identities
• Email address; social media accounts
• People use different procedures and providers to establish virtual identities
• Harvesting these data has interesting potential applications• Cyber crime• Cyber geodemographics (Facebook has already started
this)
Most of us have virtual identities
• Facebook data mining engine• Analyses the words you use and tailors advertisement
accordingly
Starting Point
http://worldnames.publicprofiler.org
• Worldnames holds data for approximately 1 billion population around 28 countries of the world
• Approximately 1.6 million unique users have visited the website since 2008
Starting Point
http://worldnames.publicprofiler.org
• Worldnames has been archiving ‘Surname search’, ‘Email Address’, ‘Gender’, and ‘IP Address’ for searches over the past 6 months• c. 175,000 records: email validation• 150,000 usable ‘IP Address’ entries
IP Address to Latitude/Longitude conversion
http://quova.com
An API to convert “IP addresses” to their corresponding latitude / longitude values
IP Address to Latitude/Longitude conversion
http://quova.com
A search for an IP Address in UCL (128.40.214.196)
Top CountriesWebsite was searched from 155 countries over the past
6 months
UNITED STATES
UNITED KIN
GDOM
CANADA
GERMANYITALY
AUSTRALIA
BRAZIL
FRANCE
ARGENTINA
SPAIN
NEW ZEALAND
NETHERLANDS
GREECE
SWITZERLAND
BELGIU
M
POLAND
AUSTRIA
MEXICO
IRELA
ND
SWEDEN0
10000
20000
30000
40000
50000
60000
70000
80000
90000
UNITED STATES 76708UNITED KINGDOM 21892CANADA 8154GERMANY 7158ITALY 4058AUSTRALIA 2978BRAZIL 2440FRANCE 2028ARGENTINA 1958SPAIN 1830NEW ZEALAND 1236NETHERLANDS 1074GREECE 1040SWITZERLAND 992BELGIUM 940POLAND 880AUSTRIA 874MEXICO 834IRELAND 710SWEDEN 630
UK and Ireland
Europe
North America
South America
India, China, Japan, Singapore
Popular Surname Searches
SMITH
JONES
JOHNSON
ANDERSON
WILLIA
MS
MILLER
MARTIN
WILSON
BROWN
MOORE
THOMAS
TAYLOR
CLARK
LEE
ROBERTS
DAVIS
CAMPBELL
LEWIS
HARRIS
MITCHELL0
100
200
300
400
500
600
700
800
SMITH 708JONES 306JOHNSON 258ANDERSON 224WILLIAMS 222MILLER 218MARTIN 202WILSON 194BROWN 194MOORE 188THOMAS 178TAYLOR 170CLARK 164LEE 160ROBERTS 156DAVIS 152CAMPBELL 144LEWIS 138HARRIS 138MITCHELL 136
Popular Email Domains
GMAIL.COM
HOTMAIL.COM
YAHOO.COM
AOL.COM
COMCAST.NET
HOTMAIL.CO.U
K
MSN.COM
WEB.DE
YAHOO.CO.U
K
GMX.DE
SBCGLOBAL.N
ET
BTINTERNET.C
OM
HOTMAIL.IT
VERIZON.NET
GOOGLEMAIL.
COM
LIVE.C
OM
COX.NET
ATT.NET
MAILINATOR.C
OM
LIBERO.IT
0
5000
10000
15000
20000
25000
30000
35000
GMAIL.COM 31842HOTMAIL.COM 22098YAHOO.COM 15542AOL.COM 5550COMCAST.NET 2696HOTMAIL.CO.UK 1948MSN.COM 1624WEB.DE 1522YAHOO.CO.UK 1290GMX.DE 1260SBCGLOBAL.NET 1246BTINTERNET.COM 860HOTMAIL.IT 844VERIZON.NET 798GOOGLEMAIL.COM 742LIVE.COM 742COX.NET 708ATT.NET 632MAILINATOR.COM 616LIBERO.IT 616
Popular Email Domains by Surnames
Smith (English)GMAIL.COMYAHOO.COMHOTMAIL.COMAOL.COMMAILINATOR.COM
Jones (Welsh)GMAIL.COMHOTMAIL.COMYAHOO.COMCOMCAST.NETGOOGLEMAIL.COM
Johnson (English)GMAIL.COMHOTMAIL.COMYAHOO.COMMSN.COMVERIZON.NET
Perez (Spanish) Gupta (Indian)GMAIL.COMHOTMAIL.COMYAHOO.COMGOOGLAMAIL.COMINDIATIMES.COM
Meyer (German)
GMAIL.COMHOTMAIL.COMYAHOO.ESCHARTER.NETGRANDECOM.NET
GMAIL.COMHOTMAIL.COMYAHOO.COMAOL.COMGMX.DE
Popular Email Domains by Country
UK USA France
Germany Brazil JapanYAHOO.COMYAHOO.CO.JPGMAIL.COMHOTMAIL.COMMSN.COM
GMAIL.COMYAHOO.COMHOTMAIL.COMAOL.COMCOMCAST.NET
HOTMAIL.FRGMAIL.COMHOTMAIL.COMYAHOO.FRLAPOSTE.NET
GMAIL.COMHOTMAIL.COMHOTMAIL.CO.UKYAHOO.CO.UKYAHOO.COM
WEB.DEGMX.DET-ONLINE.DEYAHOO.DEGMAIL.COM
HOTMAIL.COMGMAIL.COMYAHOO.COM.BRIG.COM.BRBOL.COM.BR
Top GoogleMail.com users
BINDERWATKINSWHITEWOODSROBINSONSLEEMANBENNETTRITCHIESHARPROLLINGS
Top Surnames
GoogleMail.com users• Surname ‘Binder’
Germany Switzerland
GoogleMail.com users• Surname ‘Binder’
Germany Switzerland
GoogleMail.com users• Surname ‘Blackbourn’
New Zealand
Who use their surnames as part of their email address• Approximately 40% of the users have their surname
as part of their email address• [email protected] (Surname: Harper)• [email protected] (Surname: Kempe)
• Top Countries
SOUTH AFRIC
A
SLOVENIA
UNITED KIN
GDOM
IRELA
NDIN
DIA
MALAYSIA
PORTUGAL
GERMANY
COSTA RIC
A
AUSTRIA
LUXEMBOURG
BELGIU
M
CANADA
NEW ZEALAND
AUSTRALIA
CHINA
TURKEY
CROATIA
SWITZERLAND
UNITED STATES
05
101520253035404550
Who use long email addresses ? • Grand mean average email length of 8 characters
• Number of characters on the left side of ‘@’• United Kingdom, USA, Canada, and other European countries
• People from South American countries and India have long email addresses (Average length: 13 characters)
• South Indians have longer email address than North Indians
BRAZIL [email protected] (14 characters)CHILE [email protected] (25 characters)URUGUAY [email protected] (17 characters)INDIA [email protected] (18 characters)ARGENTINA [email protected] (13 characters)
What else we can infer from email addresses• Internet service provider
• A.GOODEVE@AOL. COM• [email protected]• [email protected] (Person lives in a rural area of northeast Oregon)
• Country of origin• [email protected] • [email protected]
• Probable temporal aspects• [email protected] • [email protected]• [email protected]
What else we can infer from email addresses• Probable forename of a person
• [email protected] • [email protected] • [email protected]
• How up to date someone is with technology• [email protected]• [email protected]
• Professional Affiliations• [email protected]
What else we can infer from email addresses• Work Locations
• [email protected] • [email protected]• [email protected]
• Studying• [email protected]• [email protected]• [email protected]
• There are some interesting patterns found in the study of email addresses• some problems (accuracy of geocoding techniques)
• Prospect of data linkage of data coded to unit postcode level• cluster analysis and data mining techniques
• Future work may involve the data mining of Facebook and Twitter data• issues of generalisation
• Visualisation of the data
Conclusion and future work
Any Questions ?
Thanks for Listening
A research agenda
1 Acquire relevant real and virtual data sources and devise DBMS2 Devise GB-wide classification of NICT usage at neighbourhood scale3 Devise GB-wide classification of social network traffic4 Develop enhanced worldnames site to harvest real and virtual user data5 Undertake text analysis of worldnames user data and use to link
classifications (2) and (3)6 Devise, implement and analyse social networking application and
cybergeodemographic classification