Big Data ESSNet WP1: Web Scraping for Job Vacancy Statistics Nigel Swier
Big Data ESSNet WP1:
Web Scraping for Job Vacancy Statistics
Nigel Swier
Today’s talk is just the tip of the iceberg ….
Potential of On-line Job Vacancy (OJV) Data
Current Official
Estimates (Survey)
Online data
Frequency Quarterly Real-time?
Industry Sector
Enterprise Size
Job type / skills
Geography
National Totals
More frequent More timely More granular Less burden Cheaper???
The Partners
SGA-1 partners (from Feb 2016):
• UK (lead)
• Germany
• Slovenia
• Greece
• Italy
• Sweden
SGA-2 partners (from Aug 2017):
• Belgium
• France
• Portugal
The People
Wiesbaden, April 2016 Rome, November 2016
Thessaloniki, Sept 2017 Milan, March 2018
Six challenges with using
On-line Job Vacancy (OJV) data
for statistical purposes
Not all jobs are advertised on-line. Coverage is
incomplete and not representative.
Recruitment by Channels, Germany 2016 (Source JVS)
Challenge 1:
Challenge 2:
There is no definitive source of OJV data
• National Employment Agencies
• Job portals:
• Job Boards
• Job Search Engines
• Hybrid Portals
• Enterprise websites
• Data aggregators:
• Commercial providers
• CEDEFOP
Duplication
Image: Creative Commons
Challenge 3:
Much OJV data is unstructured. Text processing
and analysis is required to extract useful
information.
Challenge 4:
Some job ads are not within the scope of official
statistics definitions of a job vacancy
• International Jobs
• Ghost Vacancies
• Unpaid Student Internships
All images: Creative Commons
Challenge 5:
The official definition of a job vacancy does not
correspond directly to the concept of a live job ad
Challenge 5:
The official definition of a job vacancy does not
correspond directly to the concept of a live job ad
One ad, multiple
vacancies
Challenge 6:
The specific job vacancy data landscape varies
between countries:
• Size of country and number of job portals
• Digital penetration
• Characteristics of the economy and the labour market
• The role of National Employment Agencies
• Differences in the Job Vacancy Survey
• Language(s)
• Legal Issues
Image: Creative Commons
Summary of Challenges
OJV data is not representative of the labour market and
there are definitional issues that make it difficult to
compare directly with official statistics
Image: Creative Commons
Data Access
OJV Data Landscape
Job Boards
Private Employment
Agencies
Employers
Job Search
Engines
National Employment
Agency
Enterprise
Websites
Data Aggregators
Public Policy
Cedefop
Official Job Vacancy
Statistics
Approaches to Data Access
• Direct web scraping
• Point and click
• Progammatic (e.g. Python Scrapy)
• Web-scraping enterprise websites
• Agreed Access
• National employment agency
• Private job portals
• Commercial providers
• CEDEFOP
Images: Creative Commons
Data Access by Country
Data Handling
• Data cleaning and deduplication
• Text analysis and classification
• Flow to stock transformation
Classifying textual data with machine learning
Can industry
and occupation
be classified
from a job ad?
Occupation is fairly straightforward in this case
Industry is more difficult. This company is an employment
agency not the employer. But there are clues….
Text pre-processing and feature extraction
• Text Standardisation
• Stop word removal
• White/blacklists
• Stemming (e.g. “making” => “mak”)
• Lemmatization:
• Standard (e.g. “making” => “make”)
• Sophisticated:
• Feature Extraction:
• Bag of words / n-grams
• Term frequency
Image: Creative Commons
Machine Learning
• Training data
• Libraries:
• Scikit Learn
• Rtexttools
• Best performing algorithms/approaches
• SVM with Linear Kernel (Portugal)
• Logistic Regression (France)
• Multinomial Naïve Bayes (Germany)
• Ensemble (Belgium)
Images: Creative Commons
Results: Classifying Occupation
Occupation Coding Confusion Matrix, Portugal Study
Results: Classifying Industry
NACE Coding Confusion Matrix, Belgium Study
Other approaches to classifying data
• String matching
• Levenshtein distance
• Jaccard Similarity
• Phrase-based classification (PBC)
• Controlled vocabularies
• More precision
• Greater transparency
• Less Scalable
Methodology
• Quality Assessment Frameworks
• Assessing Coverage
• Matching and Linking
• Time series analysis / Nowcasting
Assessment against aggregates
Assessment against statistical units
Also, illustrates an LSTM neural network nowcasting model using multiple OJV sources
JV count comparison for a selected company, UK Study
Time Series Analysis
Time Series Analysis
Statistical Outputs
Experimental Outputs For Slovenia
Job Vacancy Flash Estimates
Job Vacancies by Local Areas
Key Conclusions (and Questions)
• Agreed access arrangements are generally better than direct
web scraping
• OJV data cannot replace the Job Vacancy Survey
• OJV data does not correspond to target concepts and only
measures part of the labour market. How useful are these
measures?
• If useful, how should these measures be presented alongside
the official estimates?
• A successful collaboration with CEDEFOP is essential. How do
we get the best possible quality data for official statistics
purposes?
Future Perspectives
Disruptive technologies
Drivers of Cedefop RLMI work
Complement skills intelligence toolkit
Better labour market information for better policies
Lack of comparable data and systematic analysis
Key characteristics of the project • Based on previous feasibility study
– Interesting and unique set of results – Data used for Eurostat hackathon – Data used for various activates of WP 1
• Key features – Preselected well analysed sources – All 28 EU MS / all EU official languages – Skills in ESCO v.1 + other attributes
• Time horizon – Early release (Dec. 2018) – CZ, DE, ES, FR, IT, IE, UK – Final version (Dec 2020)
Connect to ESS net and Eurostat
• Valuable two ways cooperation
– Big Data Task Force
– EU hackathon
– Data4policy Sherpa Meeting
– ESS net WP1
• What next?
– Validation
– Production