Olav ten BoschMSIS, Dublin, 14-16 April 2014
On the use of internet robots for official statistics
Overview
– Why internet as a data source (IAD)?– Internet robots, how do they work?– Applications:
‐ Airline tickets‐ Housing market‐ Clothing‐ “Robot assisted data collection”
– Conclusion
Why IAD? (1)
Administrative sources– Tax, social security services– Municipalities/ Provinces– Supermarkets
Surveys
Internet sources
Less!!!
Faster, better,
more efficient
New indicators
4
Which content is original, reliable, stable,representative and accessible?
Internet sources
Why IAD? (2)
– Internet prices for CPI ?– Real estate sites for housing statistics ?– Internet vacancies for job statistics ?– Social media sentiment for consumer
confidence ?– Trade in second-hand goods as
economic indicators ? – Travel activity for tourism statistics ?
Robots / crawlers / bots / spiders / scrapers: how do they work? (1)
Browser
Website
Internet Requests
code,images,
style,data,etc.
Graphicalmarkup
You
Commands
Robots / crawlers / bots / spiders / scrapers: how do they work? (2)
Robot/ spider/ crawler
Website
Internet Requests
Navigation
code,images,
style,data,etc.
Data
You
Robots / crawlers / bots / spiders / scrapers: how do they work? (3)
Robot/ spider/ crawler
Website
Internet Requests
Navigation
code,images,
style,data,etc.
Data
Monitoractively
Generic software for:- site navigation- product details- monitoring
DataData
DataData
Agil
e
Airline tickets (1)Robot collection versus manual collection
0
50
100
150
200
250
11 Feb 03 Mar 23 Mar 12 Apr 02 May 22 May 11 Jun 01 Jul 21 Jul 10 Aug
Ticket price Amsterdam -Milano
Robot
Manual
Airline tickets (2)Price of a ticket over time
-80%
-60%
-40%
-20%
0%
20%
40%
60%
-120 -90 -60 -30 0
Days before departure
Pric
e w
rt a
vera
ge
Barcelona
London
Milaan
Rome
Housing Market (1)
Housing market (2)Dynamics of the ‘database behind’ becomes visible
Clothing (1):
2 sites: very volatile data
Clothing (2):
Challenges:- from volatile data to stable statistics- how to classify multiple less structured
data sources
Seasonal pattern
Robot-assisted data collection (1)
– Use case: few price observations on many sites– Example: price of a cinema ticket– “Robot tool” to automatically check if prices are changed
Robot-assisted data collection (2)
16
Conclusion
– Using internet as a datasource we can measure statistical phenomena in a completely different way
– It is powerful to combine fast internet data with reliable (but slower) administrative data
– We should redesign statistics with the possibilities of internet data in mind
Challenges:– Legal framework– The internet changes continuously: how to turn volatile data sources into reliable statistics?– We need advanced statistical methods, processes and IT