Top Banner
ESSnet Big Data Specific Grant Agreement No 1 (SGA-1) https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata https://ec.europa.eu/eurostat/cros/content/essnetbigdata_en Framework Partnership Agreement Number 11104.2015.006-2015.720 Specific Grant Agreement Number 11104.2015.007-2016.085 Work Package 0 Co-ordination Deliverable 0.2 Final Technical Report Final version 2017-08-31 ESSnet co-ordinator: Peter Struijs (CBS, Netherlands) [email protected] telephone : +31 45 570 7441 mobile phone : +31 6 5248 7775 Prepared by: Martin van Sebille (WP 0, CBS, Netherlands) Peter Struijs (WP 0, CBS, Netherlands) Nigel Swier (WP 1, ONS, United Kingdom) Monica Scannapieco (WP 2, ISTAT, Italy) Maiki Ilves (WP 3, EE, Estonia) Anke Consten (WP 4, CBS, Netherlands) David Salgado (WP 5, INE, Spain) Boro Nikic (WP 6, SURS, Slovenia) Anna Nowicka (WP 7, GUS, Poland) Marc Debusschere (WP 9, SB, Belgium)
64

ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

Aug 06, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

ESSnet Big Data

S p e c i f i c G r a n t A g r e e m e n t N o 1 ( S G A - 1 )

h t t p s : / / w e b g a t e . e c . e u r o p a . e u / f p f i s / m w i k i s / e s s n e t b i g d a t a h t t p s : / / e c . e u r o p a . e u / e u r o s t a t / c r o s / c o n t e n t / e s s n e t b i g d a t a _ e n

Framework Partnership Agreement Number 11104.2015.006-2015.720

Specific Grant Agreement Number 11104.2015.007-2016.085

W o rk P a c ka ge 0

Co - o rd i na t i o n

De l i vera bl e 0 . 2

F i na l T ec hni c a l R epo rt

Final version 2017-08-31

ESSnet co-ordinator:

ESSnet co-ordinator:

Peter Struijs (CBS, Netherlands)

[email protected]

telephone : +31 45 570 7441

mobile phone : +31 6 5248 7775

Prepared by:

Martin van Sebille (WP 0, CBS, Netherlands) Peter Struijs (WP 0, CBS, Netherlands)

Nigel Swier (WP 1, ONS, United Kingdom) Monica Scannapieco (WP 2, ISTAT, Italy)

Maiki Ilves (WP 3, EE, Estonia) Anke Consten (WP 4, CBS, Netherlands)

David Salgado (WP 5, INE, Spain) Boro Nikic (WP 6, SURS, Slovenia)

Anna Nowicka (WP 7, GUS, Poland) Marc Debusschere (WP 9, SB, Belgium)

Page 2: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

2

Contents

page

Executive summary 3

1. Introduction 6

1.1. Background 6

1.2. General approach 7

1.3. Organisation 8

2. Results of the work packages 9

2.1. Webscraping / Job Vacancies 9

2.2. Webscraping / Enterprise Characteristics 15

2.3. Smart Meters 18

2.4. AIS Data 26

2.5. Mobile Phone Data 34

2.6. Early Estimates 40

2.7. Multi Domains 49

3. Issues encountered 57

3.1. General issues 57

3.2. Issues at the level of the work packages 59

Annex: Communication and dissemination 63

Page 3: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

3

Executive Summary

This is deliverable 0.2, the Final Technical Report, of the Specific Grant Agreement No 1 (SGA-1) of

the Framework Partnership Agreement (FPA) for the ESSnet Big Data. The FPA, which has 22

partners, covers the period from January 2016 to May 2018. SGA-1 covers the period from February

2016 to July 2017, on which this deliverable reports. This includes the period covered by deliverable

0.1, the Intermediate Technical Report.

The ESSnet has organised the core of its work around seven work packages, each work package

dealing with one pilot and a concrete output. The pilots cover five phases: (1) data access, (2) data

handling, (3) methodology and technology, (4) statistical output, and (5) future perspectives. SGA-1

covers only some of the five phases for each of the work packages, the rest being covered by SGA-2.

These are the main results obtained in SGA-1:

WP 1 Webscraping / Job Vacancies

For SGA-1, this work package focussed mainly on job portals. Since this pilot involves each country

taking its own specific approach, there are a lot of country specific results. However, general

selection criteria have been identified for targeting portals for scraping. Taking into account the

distinction between job vacancy and job advertisement, a conceptual model is proposed of how on-

line job advertisements correspond to the target population. In practical terms this may be defined

as all vacancies that are available to be measured by existing job vacancy surveys. As well as

providing a conceptual framework for understanding the coverage of job vacancies from on-line

sources and how these relate to the measurement of all job vacancies, this approach may also

provide the conceptual basis for an estimation framework, including an approach to data integration.

Furthermore, the work package has identified an opportunity to work with the EU Centre for

Vocational training, CEDEFOP.

WP2 Webscraping / Enterprise Characteristics

Six use cases have been identified in the pilot: (1) enterprise URLs inventory, (2) e-commerce in

enterprises (about predicting whether or not an enterprise provides web sales facilities on its

website), (3) job vacancies ads on enterprises’ websites, (4) social media presence on enterprises

webpages, (5) sustainability reporting on enterprises’ websites (linked to the UN Sustainability

Development Goals), and (6) relevant categories of enterprises’ activity sector (NACE) aimed at

checking or completing statistical business registers. A common use case template was developed

and has been used. For the use cases, a total of sixteen pilots were performed and all of them were

mapped to a general “logical architecture”. Also, a report was produced on legal aspects related to

web scraping of enterprise websites, aimed at showing the real possibilities for the NSIs to perform

activities of web scraping. These appear to be generally favourable, although the situation differs

from country to country.

WP 3 Smart Meters

This pilot has investigated data access and data handling of smart meters electricity data. It has

carried out a literature study and a survey on access to smart meters data, which was sent to the

NSIs of all EU member countries in the spring of 2016, with 18 responses. It appeared that only two

Page 4: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

4

countries currently have access to data: Denmark and Estonia. Several countries were aware of

substantial legal barriers. Some countries such as Poland are in the process of drawing up legislation

that will enable smart meters data use. For two countries, Estonia and Denmark, the pilot has

defined and assessed the quality of smart meter electricity data, and a synthetic dataset was

analysed as well, aimed at generating demo output and developing and testing statistics and

algorithms for situations where linkage to enterprise or household characteristics is necessary.

WP 4 AIS Data

The work package investigates whether real-time measurement data of ship positions (measured by

the so-called AIS-system) can be used to improve the quality and internal comparability of existing

statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the

possibilities and pitfalls of creating a database with AIS-data for official statistics, (2) deriving harbour

visits and linking data from maritime statistics with AIS-data, and (3) sea traffic analyses using AIS-

data. While the possibility of using AIS data from EMSA is still being investigated, AIS data from

Dirkzwager was used, and the data quality analysed. Visualisations were made, showing the coverage

of the ships by the data, and showing the path of a ship through time. A method to build a reference

frame of maritime ships was developed. First results show that AIS data can be used as a backbone

for maritime statistics. This is important, since the added value of running a pilot with AIS-data at

European level is linked to the fact that the source data is generic worldwide and data can be

obtained at European level.

WP 5 Mobile Phone Data

This work package has focused exclusively on data access during the SGA-1, which will be needed for

SGA-2. A preliminary analysis of the issues regarding the access to mobile phone data was made,

which was the basis for the design of a questionnaire surveying the status of this access across the

ESS. Belgium, Finland, France, and Italy were found to have succeeded in their negotiations to have

access to a concrete mobile phone data set that can be used for SGA-2 (together with those of UK,

Netherlands, and Germany, as new partners for the second phase). Spain and Romania are still under

contact with MNOs pursuing this goal. A workshop was held in Luxembourg to bring together mobile

network operators, national statistical offices, Eurostat and other stakeholders, including some other

international organizations (UN, OECD, ITU, DG Connect, DG Digit). Finally, with the technical

assistance from the Estonian company Positium, which is an international expert in accessing and

processing mobile phone data for statistical purposes, a set of guidelines for the access to these data

has been produced with technical, business and practical recommendations for partners of the ESS.

WP 6 Early Estimates

The aim of this pilot is to investigate how a combination of multiple big data sources and existing

official statistical data can be used in order to create existing or new early estimates for statistics. A

list was compiled of possible data sources and the statistical domains where they could be employed,

and it was decided that the most promising and interesting ones concerned combining sources for

early estimates on economic indicators. The economic indicators and possible sources were further

specified. In this context two pilots were conducted, one by Statistics Finland and one by the

Slovenian NSI. The relationship between GDP and Slovenian traffic sensor data was investigated, and

Page 5: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

5

Statistics Finland produced nowcasts of turnover indices. These results were used for a business case

for the research of SGA-2.

WP 7 Multi Domains

The aim of this pilot is to find out how a combination of big data sources, administrative data and

statistical data may enrich current statistical output. Three statistical domains are being investigated:

(1) population, (2) tourism/border crossings and (3) agriculture. For population, three areas are

looked at: daily (life) satisfaction, the moods of population associated with public events (e.g., Brexit,

voting), and morbidity areas (e.g., flu). For tourism/border crossings, a number of possible data

sources have been identified and investigated, for instance with regard to traffic intensity

information. For agriculture, the focus is on recognizing crop types based on satellite data.

Page 6: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

6

1. Introduction

1.1 Background

This is deliverable 0.2, the Final Technical Report, of the Specific Grant Agreement No 1 (SGA-1) of

the Framework Partnership Agreement (FPA) for the ESSnet Big Data. The FPA stretches from January

2016 to May 2018. SGA-1, and thus this report, covers the period from February 2016 to July 2017. A

second SGA (SGA-2) covers the period of January 2017 till the end of the FPA. This means that SGA-1

and SGA-2 have a time overlap from January till July 2017, but this report does not include activities

of SGA-2.

This Final Technical Report builds on deliverable 0.1 of SGA-1, the Intermediate Technical Report, of

January 2017, and can be seen as an extension. In fact, much of its contents is the same, as several

work packages ended soon after the Intermediate Technical Report was written. However, contrary

to the Intermediate Technical Report, it does not have an annex with an overview of the activities

carried out by the work packages, as this will be included in the Final Report on the Implementation

of the Action, due 60 days following the closing date of the action. For the same reason, this report

does not comprise an evaluation of the budget needed and used.

The overall objective of this ESSnet is to prepare the ESS for integration of big data sources into the

production of official statistics. The FPA is founded on a consortium of 22 partners, consisting of 20

National Statistical Institutes (NSIs) and two Statistical Authorities. For SGA-1, all but two NSIs have

been involved as beneficiaries of the agreement, so SGA-1 was carried out by 18 partners.

For SGA-1 as well as SGA-2, the consortium has organised the core of its work around a number of

work packages, each work package (WP) dealing with one pilot and a concrete output. In SGA-1

there are seven work packages, focused on specific sources or domains:

1. WP 1 Webscraping / Job Vacancies

2. WP 2 Webscraping / Enterprise Characteristics

3. WP 3 Smart Meters

4. WP 4 AIS Data

5. WP 5 Mobile Phone Data

6. WP 6 Early Estimates

7. WP 7 Multi Domains

A separate work package, WP 0, was created for the co-ordination of the ESSnet. For dissemination a

separate work package was created as well, WP 9. That work package is also responsible for

facilitating communication. Given the overall objective, the findings need to be generalised. This will

be done in SGA-2, for which a new work package is added, WP 8 (methodology, but also covering

other overarching aspects, such as IT).

SGA-1 specifies the agreed outputs of the work packages, and its inputs, both in terms of number of

days contributed by partner and work package and in terms of material costs. For SGA-1, the total

budget available is one million euro, but only 90% of costs, as a maximum, will be reimbursed. (The

same budget and percentage apply to SGA-2.)

Page 7: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

7

For more specifics on the FPA, SGA-1 and SGA-2 reference is made to the actual agreements. For the

current deliverable it is useful to mention that an overview of the milestones and deliverables of

SGA-1 can be found on page 10 of the signed version of Annex II of SGA-1, an overview of the

distribution of manpower (by partner and work package) is given on page 43, and an overview of the

foreseen physical meetings on page 44 of the same document. For each work package the document

(Annex II of SGA-1) provides a description of tasks, specifying, among other things, the tasks to be

carried out, the milestones and deliverables, and the number of days each partner contributes to the

work package. A specification of the budget is given in Annex III of SGA-1.

The remainder of this chapter describes the approach generally taken to the pilots, and the way the

ESSnet has organised itself. The next chapter presents the results obtained so far, for the seven work

packages. The third chapter describes the issues encountered so far in the action, at a general level

and for the work packages, and also provides an outlook for SGA-2 for each of the work packages.

1.2 General approach

The pilots, as foreseen in the FPA, have one thing in common: they cover the complete statistical

process, from data acquisition to the production of statistical output. In addition, and in accordance

with the general objective to prepare the ESS for the integration of big data sources into the

production of official statistics, the pilots also consider future perspectives. Thus, all pilots recognise

the following five phases:

1. Data access

2. Data handling

3. Methodology and technology

4. Statistical output

5. Future perspectives

The tasks, milestones and deliverables of the work packages refer to these phases. However, SGA-1

covers only some of the five phases for each of the work packages, the rest being covered by SGA-2.

And the phases covered by SGA-1 are not the same for each pilot (work package), as for some areas

it was possible to plan ahead further (in time and phases) than for other areas. In particular, WP 5

concentrated on data access problems in SGA-1 and could not plan further ahead, as data processing

would depend on the results of the efforts to realise data access. Therefore, WP 5 was planned to

end in December 2016, whereas the other work packages would continue into 2017. For WP 6 and

WP 7, a longer exploration period was needed for the first two phases, therefore they were planned

to end in February 2017. This explains the overlap in time of SGA-1 and SGA-2.

At a practical level, this approach required to be facilitated in several respects. First of all, an

organisational approach was needed to ensure that the agreed output would be produced with the

resources foreseen. This is the subject of the next section, 1.3. In order for the partners of the ESSnet

to be able to process big data, some IT facilities were considered necessary, although these would

probably not be needed at the beginning of the work of the work packages, when data access had to

be arranged first. IT facilities were ensured by subscribing to the so-called Sandbox in Ireland. This is

explained further in section 3.1. Facilities were also needed for communication, in order to share and

work on documents together and for virtual meetings, among other things. This is the subject of the

Annex to this report.

Page 8: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

8

1.3 Organisation

The organisational has been carried out as foreseen in the agreement of SGA-1. Each work package

has a work package leader who is in charge of organising the realisation of the milestones and

deliverables of the work package. This includes the organisation of virtual and physical meetings of

the work package members. The results of the work packages are described in chapter 2.

At the level of the ESSnet as a whole, the main instrument for co-ordination is the monthly virtual

meeting of the work package leaders, including WP 0 and WP 9, supported by the secretary (Martin

van Sebille) provided by WP 0. These are called the meetings of the co-ordination group, or CG

meetings. Eighteen virtual meetings were held during SGA-1. One big physical co-ordination meeting

was held in Tallinn, Estonia, in June 2016, for which a separate report is available on the wiki of the

ESSnet:

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/8/85/Minutes_WP0_20160613-

15_Tallinn_meeting.pdf. That meeting was attended by almost all partners of the ESSnet. A smaller

physical co-ordination meeting was held in Brussels, Belgium, in November 2016, for which a

separate report has been made available on the wiki of the ESSnet:

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/Coordinating_Group_2016_11

_17-18_Brussels. In addition, a dissemination workshop was held in Sofia in February 2017 for a

wider audience, in which the main results of the ESSnet achieved up till then were presented and

discussed. Again, a separate report for the meeting is available on the wiki of the ESSnet:

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/d/db/Dissemination_Workshop_2

017_02_23-24_Sofia_Minutes.pdf

The aim of the virtual CG meetings and the physical co-ordination meetings was to stay in control of

the realisation of SGA-1. For most CG meetings the partners were asked to provide information on

the realisation of the foreseen budget in the form of a spread sheet, which was consolidated by the

secretariat (WP 0), thereby enabling the CG to link the progress in producing results to the resources

actually spent. The meetings were also used, of course, to discuss cross-cutting issues. In addition to

the work package leaders, the virtual CG meetings and other co-ordination meetings were also

attended by the Eurostat project manager of the ESSnet, Albrecht Wirthmann.

In order to ensure the quality of the deliverables of the ESSnet, a Review Board was created,

consisting of three members: Lilli Japec (chair), Anders Holmberg and Piet Daas (who is also the

leader of WP 8 of SGA-2). All work package leaders have arranged that their deliverables were

reviewed by the Review Board. This has worked well, both in practical terms (planning etc.) as in

terms of contents (usefulness of the reviews): all deliverables of SGA-1 have been reviewed and the

comments have been taken into account. The members of the Review Board are also invited to the

CG meetings.

The organisational arrangements of the ESSnet are considered to be quite adequate. There is no

need to make any changes to these arrangements in SGA-2.

Page 9: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

9

2. Results of the work packages

2.1 Webscraping / Job Vacancies

The aim of this pilot is to demonstrate by concrete estimates which approaches (techniques,

methodology etc.) are most suitable to produce statistical estimates in the domain of job vacancies

and under which conditions these approaches can be used within the ESS. The focus is on feasibility

and the pilot explores different sources including job portals, job adverts on enterprise websites, and

other sources (e.g. government employment agency data and commercial data providers).

The business justification for this pilot is that on-line job advertisements can provide much more

detailed and timely information about job vacancies than official job vacancy surveys. In particular

they contain information about the types of jobs (e.g. occupation, associated skills) and where in a

country those jobs are advertised. Therefore, these data could provide valuable additional

information about the labour market for policy making.

For SGA-1, the work package is focussed mainly on job portals. For SGA-2 the intention is to explore

the potential of capturing vacancies from enterprise websites using approaches developed by WP 2.

This together with legal issues has involved close collaboration with WP 2 (led by Italy).

The work package is led by the United Kingdom with support from Germany, Greece, Italy, Slovenia

and Sweden. Italy has observer status to help coordination with WP 2. The planned SGA-1 activities

of the work package have been grouped into the following high level tasks:

1. Data Access: This task includes preparing an inventory of relevant job portals in each participating

country, qualitative assessment of the information available, review of job vacancy statistics,

coverage assessment and conceptual analysis in comparison with current definitions, and feasibility

of accessing data from third party sources.

2. Data Handling: This task includes studying the technical and legal aspects of web scraping job

portals, evaluating legal aspects, designing web scraping experiments, exploration of web scraping

technologies, executing data including data over sustained time periods and quality assurance of

third party data sources. The aim is to test the feasibility of web scraping, but not to produce

production ready systems.

3. Methodology for Output Production: This task is focused on the processing steps required to

transform semi-structured web data from job portals into a structure suitable for analysis. This

includes data cleaning, evaluation and treatment of missing data, de-duplication of job adverts (both

within and across job portals), linking data with survey/business register data, coding and classifying

data, quality assessment of structured data and presentation of experimental results.

Deliverables

Three deliverables were produced during SGA-1:

1.1 Inventory and Qualitative Assessment of Job Portals (Completed July 2016)

Page 10: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

10

The first deliverable aimed to establish criteria for evaluating job portals to help inform decisions as

to which job portals should be targeted for web scraping. This was considered particularly important

for large countries such as Germany and the UK where there are large numbers of internet job

portals where it is infeasible to scrape all of them. Therefore, selection criteria are needed to identify

which portals should be prioritised for web scraping.

1.2 Interim Technical Report (completed November 2016)

The Interim Technical Report focused on the initial work on the data access and data handling tasks

covering the technical issues of web scraping and working with web scraped data. This incorporated

the results on the individual pilots executed within each country along with the work of the first two

virtual sprints covering de-duplication and matching. This also included a review of definitional

issues, in particular, the differences between the target concept of a job vacancy and the target

measure of an on-line job advertisement.

1.3 Final Technical Report (completed July 2017)

The Final Technical Report elaborated on the work of the first report covering the work completed by

each country up until the end of SGA-1 and also included results of the virtual sprint on quality

frameworks.

In addition, WP 1 provided some input into the WP 2 deliverable:

2.1 WP2 report on Legal Issues Relating to the Scraping of Enterprise websites (February 2017)

Main findings

Since this pilot involves each country taking different approaches, there are various country specific

results that are difficult to summarise in this report. The following is a summary of the main findings

from the WP 1 pilot during the SGA-1 period.

On-line job advertisements are part of a complex data eco-system typically involving many job

portals of different types that are both competing and cooperating with other portals. The job portal

market is characterised by evolving business models and changing market shares. The complexity of

this ecosystem varies greatly between countries with Germany having over 1600 job portals and

Slovenia having about 30 (and only 2 main ones). Advertisements are typically posted and

republished multiple times and so duplication is a major challenge when compiling a definitive set of

advertised job vacancies from multiple on-line sources. These sources may also include jobs

advertised on a company’s own website.

Although we do not yet have a full grasp of the proportion of on-line jobs in relation to all job

vacancies as measured by official job vacancy surveys, it is clear that some jobs are not advertised

on-line, that the degree of on-line penetration varies between countries, and that coverage has likely

been increasing over time. This presents fundamental statistical challenges in terms of how these

data may be used for measuring both the level of job vacancies and the change over time.

There are also important differences between the target concept of a job vacancy and the target

measure, which is a job advertisement. Apart from a vacancy being advertised multiple times, a

single advertisement may also be used to fill more than one vacancy. In addition some

Page 11: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

11

advertisements, known as ‘ghost vacancies’, may exist without a real underlying vacancy. Some on-

line jobs may only be advertised on either a job portal or an enterprise website. Some jobs

advertised through job portals may identify the employing business whereas others only identify the

employer.

Figure 1. Conceptual model for measuring job vacancies from on-line sources.

In addition, most job vacancies will exist for longer than the period for which they are advertised - a

vacancy needs to be created before it can be advertised and then normally takes some time to fill

after the advertisement has closed. Therefore, there is usually a lag from the time a job is first

advertised to the time it is filled. The Slovenian pilot estimated this to be about 45 days. These

factors mean that it is difficult to directly compare on-line data with official job vacancy survey

estimates.

The web scraping tools used within the pilot include simple “point and click” tools (e.g. Import.io) as

well as more sophisticated web scraping frameworks (e.g. Python scrappy, Selenium). Point and click

approaches have proved suitable for small volumes countries but more sophisticated approaches

would be needed for production systems involving larger volumes of data.

However, the much greater technical challenge is around transforming raw web scraped data into

data ready for analysis. On-line job advertisements usually consist of a small number of structured

elements and a larger amount of text containing the full job description. However, even the

structured elements can be messy and considerable effort is required to clean and classify data prior

to analysis. In addition, job portals usually have their own taxonomies for classifying data. There are

also legal and ethical issues around web scraping to consider and while many of these are common

between NSIs, legal uncertainty around web scraping means that NSIs have followed advice from

their own legal departments.

For all these reasons, NSIs (and particularly those in larger countries) should be aiming to build

partnerships with job portal owners, or others with access to job vacancy data, rather than looking to

Page 12: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

12

build their own large scale web scraping systems. WP 1 has identified an opportunity to work with

the EU Centre for Vocational training (CEDEFOP), which is currently undertaking an EU-wide project

to web scrape job vacancy data for all EU member states. A partnership is being developed with

CEDEFOP with a view to coordinating activities and in the longer term we expect that the data from

this initiative will become available to the wider ESS.

Thus, a key conclusion is that it is better for NSIs to focus on activities where they can add value

rather than simply replicate what CEDEFOP is already doing (and some commercial companies have

already done). One specific area where NSIs have a strategic advantage is in relation to official job

vacancy surveys and how they might be used to better understand on-line job vacancy data. This

involves matching survey reporting units with the on-line data and comparing counts from different

on-line sources, including both portals an enterprise websites. This approach has its own challenges

around matching survey reporting units with on-line advertisements. However, this approach can

form the basis for better understanding issues of data quality.

It is clear that the gap between on-line job advertisements and what is measured through on-line

surveys means that on-line data could not replace the existing surveys. However, both sources could

be used together with the survey providing control totals with the on-line data providing additional

granularity. An outline of how this could work is presented in Figure 2. There may also be scope to

use the more timely information available from on-line sources to produce nowcasts of job vacancy

rates that could be used to inform economic policy.

Figure 2. Outline Approach to Data Integration

Page 13: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

13

Concrete results

In terms of producing concrete results, Slovenia is the most advanced and some results are expected

early in SGA-2. Slovenia is a small country with a relatively small number of job portals and job

vacancies to be counted and so the problem is more manageable than for larger countries. For

example, it is estimated that 95 per cent of all jobs advertised on-line are captured in the two largest

Slovenian job portals accounting for about 25% of all Slovenian job vacancies. Administrative data on

all public sector jobs increases the overall coverage to 54%. Some capture and linkage of vacancies

listed on enterprise websites has already been done which suggests the coverage could be increased

further.

However, it is clear that producing meaningful concrete outputs remains challenging, especially for

larger countries. This can be illustrated by comparing a time series of total advertised vacancies

published by two large UK job search engines - Adzuna and Indeed - with the monthly, non-

seasonally adjusted estimates from the ONS JVS (Figure 3). This shows that all three series are quite

different both in terms of their levels and trends over time. Adzuna shows the highest level of

vacancies and has growth in vacancies during 2015 that is not seen in the other series. The Indeed

figures are lower than the ONS but closer than Adzuna and the long term trend is more similar.

However, neither Adzuna or Indeed have the seasonal pattern that is apparent from the JVS series.

Figure 3. Total job vacancies by selected sources (2013-2016)

Some of these differences may partly be explained by definitional differences. The stock of current

job vacancies as measured by the ONS JVS is very different to the average number of live job

advertisements over the month as measured by Adzuna. The large difference between the totals for

the two job portals clearly implies their definitions are very different. Another important definitional

issue is that the JVS does not include workers employed directly by employment agencies although

Page 14: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

14

many on-line job ads may be for such positions and it is often impossible to distinguish between

these based on information contained in the advertisement. The proposed approach of matching

survey and on-line data should help in that it provides a basis for understanding these differences

between sources for individual enterprises.

Outlook

The primary focus for the remainder of this pilot (i.e., SGA-2) is to make as much progress as possible

on producing concrete and meaningful outputs using web scraped job vacancy data. As described in

section 3.2, there are a stack of issues from the complex and changing data ecosystem of on-line job

advertisements, duplication, messy and unstructured data, data access, legal issues, definitional

differences and lack of representativity.

We are clear that our main focus for SGA-2 should be on development of methods for producing

outputs from web scraped data rather than focusing on problems on scraping, cleaning and

classifying web data, which CEDEFOP is already tackling. However, as the availability of these data

are still some way in the future, most WP 1 partners are continuing to work on improving access to

other sources of data including both government and commercial job portals.

To achieve the primary goal of producing concrete outputs in the available time, our strategy is to

focus on how to take advantage of existing JVS data and the outline data integration model provides

a promising avenue for combining survey and on-line data.

On 21-22 September 2017, WP 1 participants will be meeting in Thessaloniki. This is also the location

of the head office of CEDEFOP and so this will provide an opportunity for CEDEFOP colleagues to join

the meeting, exchange information and discuss plans for collaboration for the remainder of the

ESSnet. Another planned action for SGA-2 is to develop proposals for an informal network to

continue after the end of the ESSnet.

As stated, the intention is that SGA-2 will also look further at the feasibility of using the approaches

developed by WP 2 (web scraping of enterprise websites) and the collection and analysis of job

adverts as a specific use case. The consensus is that this is most likely to be useful in terms of

identifying enterprises that currently have job vacancies. It will be difficult to obtain more detailed

information because of the technical difficulties of creating structured data from websites where the

structure is not known in advance.

In summary, there are substantial challenges ahead, but there is a commitment from WP 1 partners

to do the best job possible.

Page 15: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

15

2.2 Webscraping / Enterprise Characteristics

The purpose of this work package is to investigate whether web scraping, text mining and inference

techniques can be used to collect, process and improve general information about enterprises.

Challenges compared to the “webscraping / job vacancies” pilot are application of more massive

scraping of websites and collecting and analysing more unstructured data.

In particular, the aim is twofold:

1. to demonstrate whether business registers can be improved by using webscraping techniques

and by applying model-based approaches in order to predict for each enterprise the values of

some key variables;

2. to verify the possibility to produce statistical outputs using predicted data, in combination or not

with other sources of data (survey or administrative data): in particular, a set of benchmark

estimates might be the ones produced by the survey on “ICT use by enterprises”, a survey

common to EU Member States.

The identified use cases are:

1. Enterprise URLs Inventory. This use case is about the generation of a URL inventory of

enterprises for the Business register.

2. E-Commerce in Enterprises. This use case is about predicting whether an enterprise provides or

not web sales facilities on its website.

3. Job vacancies ads on enterprises’ websites. This use case is about investigating how enterprises

use their websites to handle the job ads.

4. Social Media Presence on Enterprises webpages, aimed at providing information on existence of

enterprises in social media.

5. Sustainability reporting on enterprises’ websites. One of the Sustainability Development Goals

targets set up by the UN is to encourage enterprises to produce regular sustainability reports

highlighting the sustainability actions taken. In order to measure the companies’ response to

this, NSIs look at what companies publish on their official website and track changes over time.

6. Relevant categories of Enterprises’ activity sector (NACE). Aimed at identifying relevant

categories of Enterprises’ activity sector from enterprises’ websites to check or complete

Business registers.

Use case 3 is particularly useful for WP 1 to understand if the enterprises’ websites can be used as

information channels for WP 1.

WP 2 has six participating countries, namely Bulgaria, Italy (leader), Netherlands, Poland, Sweden

and the United Kingdom. The main results are described below.

Report on Legal Aspects

The participants to WP 2 made available on time the Deliverable 2.1 “Legal aspects related to Web

scraping of Enterprise Web Sites”.

The deliverable reviewed the EU and Member States legislation on the regulation of official statistics,

in particular in areas such as Personal data protection, Copyright protection and Database protection,

Page 16: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

16

in order to understand the real possibilities for the NSIs to perform activities of web scraping - on

small or large scale - on the websites of enterprises.

The deliverable shows that all the six Member States involved in WP 2 on web scraping / Enterprise

Characteristics have a Statistical Law guaranteeing to the NSIs similar prerogatives regard to data

access and data processing.

The laws on personal data protection, database protection and copyright protection have been

ratified in almost all countries and some countries in addition to these legislative measures also

mentioned the development of ethical codes (Italy, UK, Netherlands).

A picture quite favourable to the data collection through web scraping of websites of enterprises

seems, in general, to emerge, but with different shades. It ranges from most favourite situations

(Italy, Netherlands, Bulgaria) to expressing greater uncertainty (UK, Poland, Sweden).

As to the challenges and recommendations from NSIs' legal departments on possible alternative

approaches to achieve the agreement with enterprises on web scraping of their websites on a large

scale, different degrees of intervention appear to be necessary at country level.

A code of Netiquette for web scraping for official statistics has been proposed by NL and UK and

shared by all the partners.

Report on Methodological and Technological Issues and Solutions

The participants to WP 2 made available on time the Deliverable 2.2 “Methodological and IT issues

and solutions related to Web scraping of Enterprise Web Sites”.

Some figures on the performed work are:

Six different use cases were identified, which are the ones listed above.

Four use cases out of the six were selected to be demonstrated by specific pilots.

Sixteen different pilots were implemented, namely: six for use case 1 (with Bulgaria

implementing two pilots with two different technologies), four for use case 2, three for use case

3 and three for use case 4.

A detailed use case definition was carried out by first sharing a use case template. Participant

countries filled the template concerning the use cases they are involved in. All the use cases specified

according to the template are available on the project wiki.

The main findings of the technical work performed by WP 2 can be summarized as follows:

The complex pipeline for processing data scraped from enterprises’ websites has been defined in

detail and shared among the participants. This pipeline can be considered as a reference one to

which mapping specific technological and methodological choices. A set of logical building blocks

has been identified for each phase of the pipeline.

From a methodological perspective, both deterministic and machine learning methods were used

in the pilots. On one side, we learned that even with different methods good results can be

achieved. On the other side, however, we saw that in some cases there can be a convergence of

methods (e.g. the URL retrieval pilot where Italy, Bulgaria and the Netherlands applied the same

Page 17: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

17

methodology). Predicted values can be used for a twofold purpose: (i) at unit level, to enrich the

information contained in the register of the population of interest; (ii) at population level, to

produce estimates. The issue of measuring the quality of data pertaining to the unit level has

been faced in the piloting phase. In particular, for instance, when employing machine learning

methods the quality can be measured by considering the same indicators produced to evaluate

the model fitted in the training set. Under given conditions (if the training set is representative of

the whole population), the measure of the accuracy (and also of other indicators like sensitivity

and specificity) calculated in this subset can be considered as a good estimate of the overall

accuracy. The issue of measuring the quality of population estimates making use of predicted

values has also been in focus. However, specific solutions to that are still under investigation.

From an IT perspective, performance is a key issue especially when downloading and processing

whole websites. Processing unstructured information is very CPU and memory consuming,

especially with machine learning algorithms, and as a result not very efficient. A sustainability

issue is also very relevant; due to the fact that big data tools are changing very frequently as well

as the website technology, there is a need to provide an agile-like development of tools. For the

storage, the possible choices are between file system (CSV, JSON etc.), NoSQL database (Solr,

Cassandra, Hbase etc.) or relational database (MySQL, PostgreSQL, SQL Server etc.). The decision

on the particular data storage solution should be taken according to the volume of the data and

the type of data to be stored. Finally, although the frameworks are developed in particular

countries, it is possible to apply them in other countries as well without any major changes. For

instance URLSearcher developed by Istat was tested on Bulgarian and Polish websites as well.

Some indicators can be computed as outputs of the developed pilots, and can be considered as

experimental statistics. These include:

Rate(s) of retrieved URLs from a list of enterprises.

Rate(s) of enterprises engaged in e-commerce from enterprises websites.

Rate(s) of enterprises that have job advertisements on their websites.

Social media presence, in terms of both (i) Rate(s) of enterprises that are present on social media

from their websites and (ii) Percentage of enterprises using Twitter for a specific purpose.

In order to implement pilots, some specific software tools were implemented and made available

through the project wiki, the Sandbox and github.

Page 18: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

18

2.3 Smart Meters

The aim of the smart meters pilot study is to demonstrate the use of data from electricity meters for

production of official statistics. Smart meters are devices which can be read from a distance and that

can measure e.g. electricity, water, gas consumption, at a high frequency. This kind of data can be of

use for statistics on energy use and production, and it can be relevant also as an additional source for

calculating census housing statistics, household costs, or impact on environment. In the pilot study,

theoretical and practical issues are addressed and topics of data access and processing and linking

data, as well as calculation and visualization of statistics, are covered.

Although smart meters are currently deployed in a few countries only, they will be made available in

several countries by 2020.

The pilot is carried out by members from four national statistical offices: Statistics Austria, Statistics

Denmark, Statistics Estonia, and Statistics Sweden. Statistics Estonia is responsible for coordinating

the work in this work package.

During the pilot study the following tasks were carried out:

Task 1 Data access. Description of the current status and future perspectives regarding the

availability of smart meters in the partner countries.

Task 2 Data handling. Description of the available data (structure, records, attributes), description of

the IT environments and assessment of the quality of input data.

Task 3 Methodology and techniques used to analyse the data. This task includes linking smart

meters data with other data sources and development of the methodology to produce

electricity consumption statistics about businesses, households and dwellings. Countries that

do not get access to the real data would produce synthetic data on which different

estimation methods can be tested and results compared.

Based on the results of the three tasks mentioned, two reports were delivered. The first report

covered the results of tasks 1 and 2 and the second report covered the results of task 3.

This pilot study is not linked directly with other pilots in this project, but one can find similar issues

with other pilots like the methods, IT-technologies and quality issues. WP 3 will provide input to the

methodology work package (WP 8) that will start in the second phase of the project and which will

summarize approaches to methodology and quality assessment when dealing with big data sources.

Dissemination of WP 3 results is part of the dissemination work package (WP 9).

The European Commission has proposed a deployment plan for smart electricity meters in the EU

Member States on the basis of economic assessments of long-term costs and benefits and foresees

to achieve an almost 72% deployment rate by 2020. The use of smart meters raises privacy concerns

as, depending on the frequency of data collection, significant personal details about the lives and

private activities of customers can be revealed. Within the EU, consumer personal data is protected

by the EU's Directive on the processing of personal data. Most smart meters currently being installed

worldwide record electricity consumption data hourly, half-hourly or at 15 minutes intervals. This can

provide a strong indication of for example occupancy, but has much less potential to reveal individual

appliance use.

Page 19: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

19

While carrying out a literature study three use cases were found where smart meters data was used

in official statistics for either determining occupancy of the dwelling, identifying household

composition or to test different big data tools.

Expected improvements from smart meters data

Partners’ expectations from the smart meters data are the following:

Improvements and cost reductions to existing statistics;

Expansion of existing statistics - in terms of quality and aggregation levels;

Improved periodicity of electricity consumption statistics (from annual to quarterly, or even

monthly);

To use electricity data together with other sources to model electricity consumption

electricity consumption by the end-use;

To replace current data collection from businesses by on-line questionnaire by electricity

smart meters data source, i.e. to produce aggregated electricity consumption data according

to the requirements of Regulation (EC) No 1099/2008 of the European Parliament and of the

Council of 22 October 2008 on energy statistics and environmental statistics needs;

To investigate the possibility to improve the dwelling statistics. The registered place of living

does not necessarily agree with the actual place of living, and in order to estimate the

number of vacant or temporary dwellings (i.e. summer houses), electricity consumption

could be an important factor;

Smart meters data is also of interest for other types of statistics, for example as input to

price indices and for improving the Household Budget Survey (household cost for electricity);

It could be possible to use smart meters data in combination with other data sources, such as

mobile phone data. Day/night population estimates and statistics on residency are two

possibilities.

Data access

A survey on access to smart meters data was sent to the NSI of all EU member countries in the spring

of 2016, and there were 18 responses. Only two countries currently have access to data, Denmark

and Estonia.

Several countries were aware of substantial legal barriers. It was unclear if market participants could

even share data with each other. Some countries such as Poland are in the process of drawing up

legislation that will enable smart meters data use.

In terms of data hubs, only one country mentioned that one was under construction (Norway).

Denmark and Estonia already receive data through central data hubs, and a hub is being planned in

Sweden.

Table 1. Smart meters data in EU countries

NSI Plans to explore

smart meters?

Legal

obstacles

Data hub

available

Sweden Yes Yes No

Norway Yes No No

Hungary Yes No No

Page 20: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

20

France Yes Yes No

Lithuania No No No

Cyprus No No No

Bosnia and Herzegovina No No No

Poland Yes Yes No

Belgium No No No

Germany Yes Yes No

Portugal No No No

Luxembourg No NA No

The former Yugoslav

Republic of Macedonia

No NA No

Denmark Data received No Yes

Estonia Data received No Yes

Austria No Yes No

Greece Yes Yes No

Spain No No No

Data handling

The data received by Statistics Estonia contains hourly recordings from 709 000 metering points and

amounts to 1.5 TB. The most important tables in the data are metering data, metering points,

agreements and customers which contain information about by whom and where electricity was

consumed:

metering_data - hourly information of the amount of produced and consumed electricity,

metering_points - information about location and the type of the metering point (possible

types are: remotely readable, single and dual tariff manually readable),

agreements - information on when electricity contract was signed/ended and what type of

contract it is,

customers - information about private and legal persons who signed the contract.

With different structure but with similar content data was delivered to Statistics Denmark by the

datahub owners. In both cases the data was delivered on an external hard disk.

Analysis of 2014 Estonian data shows that of all metering points, 89% belong to households and 11%

to businesses, and 49% were smart readers.

For the use of smart meters data for statistics, four additional steps are necessary:

1. Geocoding or normalizing the metering point addresses, so that linking with other sources

could be carried out;

2. Transformation of the variables (e.g. changing formats, coding);

3. Anonymization of personal data in the customers table;

4. Cleaning the data.

An overview of the input data quality is given in Table 2.

Page 21: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

21

Table 2. Assessment of the electricity data based on input quality indicators.

Quality

indicator

Assessment of Estonian data Assessment of Danish data

Under- and

overcoverage Around 50% of households and companies did not

have smart meters by the end of 2014.

The smart meters do not measure electricity

produced for own consumption, they only measure

purchased electricity. The total amount of

unmeasured consumption is negligible compared to

total consumption.

1526 metering points were excluded from the

analysis as those metering points did not present

actual end consumption i.e. the metering points

recorded transfer or selling of the electricity.

Although these metering points formed only 0.2% of

all metering points, the consumption recorded by

them was 75.4% of total consumption recorded in

the data hub.

The 2013 data contains reading for almost all Danish

meters. It contains hourly readings for only a small

subset of companies using more than 100,000 KWH

a year. The undercoverage on smart meters cannot

be estimated at the moment. There are no known

examples of overcoverage in the core data. But there

can be overcoverage in some subpopulations due to

poor linking with other registers.

Percent of units

that fail checks

It is expected that the reading are in accordance with

real consumption as it is invoiced to customers.

There are no identified significant errors in the data.

The necessary quality checks are still in

development.

Percent of units

that are

adjusted

In Estonian data there were two cases when the end

date of the grid agreement differed from the supply

agreement's end date and there were two duplicate

agreements in the initial dataset.

All the address information of metering points was

normalized and corresponding address id-s and

address object id-s were identified and stored in a

separate table.

The metering data are not adjusted.

Percent imputed There are no missing values detected in the records.

The core meter data does not have unexplained

missing values, but missing data can occur in linked

data.

No imputation is applied and data is handled in read

only mode.

Periodicity Data are provided yearly at the moment - higher

frequency may be possible in the future.

The possibility of monthly delivery is being

examined.

Delay As the network operators have three months for

correcting data, the data are provided after that

period.

Still unknown.

Common units No duplicate records were discovered. The data hub aggregates all the data from the

different providers, and handles conflicting data.

Metadata on this process is not available to Statistics

Denmark, but there are no indications that it

concerns more than a miniscule proportion of

records.

Page 22: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

22

Partners not having access to real data are working with generated synthetic data. The objective of

the use of synthetic data is twofold:

1. Generate demo output with “realistic” results;

2. Test, scale and develop (new) statistics and algorithms where linkage to enterprise or

household characteristics is necessary.

For generating synthetic data Australian data was used for generating households’ electricity

consumption data.

Data linking

For linking purposes it was possible to use two strategies – linking by registry code or linking by

address information. The main challenge is to identify end production and consumption of statistical

units, therefore it is crucial to identify linkage between metering points in smart meters dataset and

units in administrative registers and also exclude all metering points not related to end consumption.

Data linking was carried out by two countries – Estonia and Denmark. For both countries it was

important to link statistical units by using address id and both had problems with finding a one-to-

one match between statistical unit and observed unit – a metering point. In the Estonian case study

the most of the problems were related to quality of address information, but in the Danish case there

were also difficulties identifying what was the periodicity of the reading and billing.

The main problems during the linking were:

The quality of address information. It was the main problem in the Estonian case to extract a

valid address id. In the Danish dataset a valid address id was used.

Many metering points or many statistical units on the same address. It was a problem in both

cases and reduced the rate of linking. In the Estonian case, there is 5% of cases where

metering points have the same address id.

Identifying the actual consumer, as the information in the smart meters dataset identifies

only the contract holder. It was more related to businesses and from 150 thousand

businesses in the statistical business register only 22 thousand could be matched.

Own consumption of producers. There is a growing trend that more end consumers install

their own electricity production units and their own consumption is difficult to identify. The

same applies also for big industries and their own consumption. By the dataset description, it

may not be a problem in Danish case.

Apartment associations – those are registered as business entities, but are apartment

buildings. Mostly they are used for living but also many small companies are active on the

same address.

The quality of registers. The data in the Estonian register of buildings is not up-to-date with

regard of several building characteristics and thus were not used in this project.

Producing statistics

There were three goals with regard to expected outputs. First, to assess whether current survey

based business statistics can be replaced by statistics produced from the electricity meter data,

second, to produce new household statistics and third, to identify vacant or seasonally vacant

dwellings.

Page 23: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

23

Expected outputs for business statistics were final energy consumption statistics of businesses by

economic activity, by region and monthly, quarterly and annual aggregation. The goal was to find a

link between statistical units (businesses) and observed units (metering points) and identify the end

consumption of businesses. For linking two strategies were used – first, link business customers of

the Data Hub to the business register by using registry key and identify energy consumption by the

area of the economic activity; second, use address id to link the address of the business entity with

the address of the metering point and get an estimation of consumption. In the latter case two

strategies were possible – to use all metering points or only those which contract was owned by

businesses. Before linking we excluded many metering points related to open suppliers and other

network companies due to the fact that the consumption was not the end consumption. It was not

possible to exclude all the open suppliers from the further analysis as open suppliers were active in

many fields and their own consumption was significant.

Our conclusion after the analysis was that the electricity smart meters data has potential as a source

for producing business statistics, but the methodology of the linking has to be improved. Current

estimates in certain areas differ significantly from the survey data, as it is difficult to estimate the real

consumer. On the other hand, the dynamics of electricity consumption could indicate the change in

economic activity by a sector and could be used as an early warning indicator.

Regarding the household statistics, the main goal was to link households with the electricity smart

meters data and then identify how the energy consumption is related to household size, number of

rooms in a living place and other indicators available in the registries. This work was carried out on

Estonian and Danish data.

Figure 1. Yearly mean consumption by household size and type of living place - house (H) or apartment (K).Estonian data.

Despite the fact that the linking quality was not very high due to the partly coverage of smart meters

and quality of address data, the results indicated there is potential to use smart meters data for

producing statistics of households. The main potential of using smart meters data is the possibility to

link with different registers and thereby reveal new information otherwise unavailable to the society.

Information about the average consumption by the household size, type of dwelling or any other

Page 24: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

24

characteristics of the dwelling could be made available to the public as experimental statistics. This

gives an opportunity to the public to give feedback on the usefulness of this information as well as to

request some other information not yet available about the households' consumption.

The third output foreseen in this pilot was identification of the vacant living spaces. Information

about the vacant living spaces is relevant housing statistics that can be used in the population and

housing census, tourism statistics or by different industries e.g. real estate sector. It is of interest to

know how many dwellings are empty for a long time and how many are occupied seasonally. In

addition to producing statistics on vacant dwellings, the information can be used on dwelling level to

validate the information in the registers. Statistics Estonia is planning to carry out the next

population and housing census by using only register information. The pilot census showed that one

of the largest quality problems is the accuracy of the address of the main residence. Namely, people

are quite often not living on the address given in the register. To overcome the problem, alternative

data sources are looked for. One possibility is to use mobile data but electricity smart meters data

could also be beneficial. The work carried out regarding this output included a list of algorithms to

identify the vacant dwellings, computing the indicator for each dwelling, referring to occupancy

either on a certain day or during some other specified time period. This information was used to

validate the household’s main residence information in the register.

Estimating occupancy or vacancy of a living space is not a simple task. Several methods were

suggested and tested on synthetic data. The most promising are the outlier detection method and

random forest method. However, on the real data simpler methods were applied due to the lack of

data for the training set. The results are quite promising for those households that were included in

the analysis. For those, we obtained results that show about 18% of households do not live on the

address the population register has for them. This coincides quite nicely with the estimates obtained

from survey data which show this number to be about 20%.

Statistics Estonia plans to continue working with the electricity data and find more ways to use a

vacancy indicator. In an ideal case data could be used to not only validate the results but to improve

the main residence address information in the register based census.

One important outcome of this project is a quality assessment framework, which was used in this

report to evaluate all the outputs and can also be used in the future projects.

The main advantage of using the smart meters data are:

Possibility to link it with other data sources and gain new knowledge;

Data source can be used to validate or improve current survey based statistics;

Data source could improve the speed of producing statistics and also increase the quality of

regional statistics.

The main problems related to the data source are:

The description of data source is not detailed enough and metadata about the variables was

often missing;

Address information is not standardized and geo-coding needs lots of resources. Address is

the key variable in the linking and the quality of address information is crucial for the

successful use of the data source;

Page 25: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

25

Without additional information it is difficult to identify which metering point records the

actual end consumption and which one records transfers of the electricity;

The observed unit (metering point in a building) does not match the statistical unit (business,

household, dwelling). This leads to the loss of accuracy of the estimates as consumption is

assigned to the owner of the building not to the actual consumer;

Manual metering readings can cause a lot of work as is the case in Denmark. The dataset

contains dates when the manual readings are reported but no information about the period

the reported consumption refers to. However, this problem will disappear with

complete instalment of the smart meters;

One metering point corresponds to many consumers (apartment unions, real estate

companies renting rooms) and one needs to develop methodology to extract the

consumption of the single consumer.

Outlook

To reveal the full potential of the data there is still work to be done. Some of this work includes

developing methods that improve the quality of the linking, using classification methods to identify

customer type (business or household) and combining different data sources to reveal new

information.

Ways to improve the quality and extend the use of the data are:

Use machine learning algorithms to clean the address data.

Produce statistics by using sub-sets of businesses that could be linked.

After improving the address quality conduct regional analyses of the energy consumption.

Use unsupervised classification algorithms to identify different types of electricity

consumption patterns. This can be used to identify whether electricity is used for heating, for

example.

Develop a model to identify producers of electricity for own usage and model the amount of

electricity used.

Link the electricity data with economic activity data to see the correlation between economic

activity and energy consumption.

Link electricity data with weather data and building register to identify the impact of weather

to the electricity consumption.

Foreseen activities during SGA-2

In the second part of the smart meters pilot study, the aim is to suggest potential uses other than the

ones analysed in SGA-1, especially when linked to other data sources (different registers, weather

information etc.). The expected output of this study is a list of potential outputs, suggestions how to

visualize the results and evaluation to the feasibility of producing the proposed outputs in the official

statistical system.

The second aim is to give recommendations and summarise lessons learned to other countries, so

they can apply them when starting to use smart meters data. When countries know what awaits

them, they can save time and resources when dealing with the data.

Page 26: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

26

2.4 AIS Data

Aim of this work package is to investigate whether real-time measurement data of ship positions

(measured by the so-called AIS-system) can be used 1) to improve the quality and internal

comparability of existing statistics and 2) for new statistical products relevant for the ESS.

The added value of running a pilot with AIS-data at European level is that the source data are generic

worldwide and data can be obtained at European level.

Methodological, quality and technical results of the work package, including intermediate findings,

will be used as inputs for WP 8 of SGA-2. When carrying out the tasks listed below, care is taken that

these results will be stored for later use, by using the facilities described at WP 9.

Task 1 – Data access

This task involves exploration of the possibilities to collect the data at a European level. AIS-data are

available for national territories and the entire European territory. Aim of this task is to decide how

European data could be used for this project, and to investigate the possibilities of acquiring data

from EMSA (European Maritime Safety Authority), to be coordinated with Eurostat. The advantage of

using one AIS-dataset for the entire European territory is a) a better comparison of international

traffic between the countries and b) more synergy as all participating countries work on the same

dataset. A disadvantage is that these data are stored by private companies and handling fees have to

be paid.

Task 2 – Data handling

Aim of this task is to process and store the data in such a way that they can be used for consistent

multiple outputs. Key elements of this task are: 1. which programming language and environment

should be used for transformation? 2. where will the data be processed? and 3. how can we create

an environment which is easily accessible for all partners?

Task 3 – Methodology and Techniques

Develop traffic statistics: Linking with data from maritime statistics

AIS-data may be linked to data from maritime statistics. Added value of linking AIS-data to data from

maritime statistics is that the same reference population (= ship number) is used in all ports. As the

journeys and port visits of ships can be derived from AIS this linking provides the ESS information

about the origin/destination of the cargo, too. Aims of this task are: build a reference frame of ships

in European water, find out how data from maritime statistics can be linked to AIS-data and check

whether information improves the quality of current statistical outputs.

Traffic analyses

The number of ships during a certain time interval at certain coordinates (like inland waterways or at

certain points at sea) can be calculated by AIS-data. This information could be interesting for traffic

analyses and economic analyses. Aims of this task are: calculate the number of ships at certain

coordinates and visualise the results to analyse variations in time.

Concrete results of the tasks mentioned above are reports on:

Page 27: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

27

1 creating a database with AIS-data for official statistics: possibilities and pitfalls

2 deriving harbour visits and linking data from maritime statistics with AIS-data

3 sea traffic analyses using AIS-data

Obtaining AIS data on European level

There were three levels to choose from for obtaining AIS data

National level for each country collaborating in WP 4;

European level: all waters within the perimeter of the European countries;

World level: a data set covering the whole world.

National data is not very useful for European analysis. The only advantages are their pricing (most of

the participating countries already have national data) and the size (the data is not that big).

European data is much more suited for European as well as national statistics. Since European data

contains information about the routes vessels take within Europe, the data gives information of

routes, but also about harbours of loading and unloading within Europe. Hence, this data could also

have potential extra value for the national level. Concerning the world wide data, one would be able

to locate the harbours of loading and unloading around the world. However, this data is gathered by

satellites and, as a consequence, it is very expensive. Based on the remarks about the three levels of

AIS data it was decided to obtain European data. Several sources were identified to purchase the

data from, i.e.: EMSA, Kystverket, Hellenic Coastguard, Dirkzwager, Marine Traffic (.com) and the

Joint Research Centre (JRC).

The result of this investigation on possible sources for European AIS data is that we know for sure

that we cannot obtain European data from Kystverket, Hellenic Coastguard and JRC, because of legal

issues. At the beginning of this work package we did decide not to use the European AIS data by

Marine traffic, because this is a very expensive alternative and does not fit within our budget1. We

are still investigating the possibility of using AIS data from EMSA (together with Eurostat), because

they would provide the EMSA data for free. Dirkzwager could provide us 6 months of European AIS

data on a short period of time and for a very good price. For that reason we decided to use this

Dirkzwager data within our work package. This dataset contains 6 months of AIS data (8 October

2015 - 12 April 2016) and contains AIS data from land based stations only. Satellite data is not

included. If the EMSA data becomes available during SGA-2 we will also investigate this data in WP 4.

Decisions concerning tools and environment

Figure 1 describes the results on deciding which programming language and environment we should

use for pre-processing and analysing the European AIS data. For legal reasons we choose to keep the

data in the Netherlands and decode the data in Python. After the data was decoded, all files were

zipped to save space. The sets of files with positions and files with voyage related data were

uploaded to the UNECE Sandbox by a secure copy (scp). The locations data comprises 144 GB of

compressed data (about 200 GB uncompressed) and the messages comprised about 5 GB of

compressed data (about 7 GB uncompressed). It took approximately 7 hours to copy the data. After

uploading, the data was copied to the Hadoop File System (HDFS) and available for use. For legal

1 However, in July Statistics Netherlands and Marine Traffic signed a Memorandum of Understanding aimed at

sharing data and knowledge between both organisations.

Page 28: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

28

issues we decided to create an AIS group on the Sandbox, so only the usernames from the members

of WP 4 can access the Dirkzwager data.

From the HDFS, the data can be accessed using the tools that are available in the Hadoop stack,

which are: Pig, Hive, RHadoop and Spark. We chose to use Spark, because Spark makes it possible to

perform much more complex processing on data which is stored in the HDFS. Spark is compatible

with the programming languages Scala, Python, R and Java. However, it is advised to use it with

Python or Scala. Furthermore, within Spark one will be able to write SQL queries using SparkSQL.

Figure 1. Pre-processing, processing and storing the AIS data

For data analysis it is decided that resulting aggregates will be downloaded from the UNECE Sandbox

using HUE. Researchers of the different NSI's will be able to analyse the data using their tools of

choice, i.e. SPSS, SAS, R, or even import the data into a local database.

Results SGA-1 on the quality of AIS data

Dirkzwager has receivers all over the coastline and main ports of the Netherlands and a couple

outside the Netherlands: Cherbourg, Gibraltar, Zeebrugge, Antwerp and Hamburg. AIS receivers on

land can only pick up signals within the range of about 40 sea miles. Therefore, land receivers have a

very limited coverage of signals transmitted from sea which results in loss of information of ships on

Page 29: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

29

open sea. The data we received was all data without the satellite data, thus also non-European data

from partners2.

As described in deliverable 4.2, the coverage of ships in the Dirkzwager AIS data, is good but there is

also data missing and quite a lot of noise, for example some vessels seemed to be located in the

Sahara (see https://maartenpouwels.carto.com/viz/8d319f16-8195-11e6-af04-

0ecd1babdde5/public_map).

Another visualisation for following a ship shows us that following a ship during a couple of days gives

us a reasonable view of the journey of a ship, but we have also missing data here (see:

https://maartenpouwels.carto.com/viz/8d2f3bde-8197-11e6-bf3f-0ee66e2c9693/public_map).

We know there are three types of errors in AIS data:

Technical errors - related to dynamic data such as position of ship, speed, course, rotation which

comes from AIS device (sensors, cables and antenna).

Human errors – related to static (MMSI3, IMO number, ship’s name, call sign, type, length) or voyage

data (draught, destination) which are manually entered in the AIS device so therefore are a common

cause of errors. These values should be entered during installation of AIS instrument (static) or if

voyage information changes. It is worth noting that voyage data must be manually updated after

each port visit.

Systematic errors - due to faulty or missing input by the ship crews. Apart from these systematic

errors, all of the parameters can be erroneous due to technical issues (e.g. meteorological factors,

distance to receiver). These errors can take any form. This can for example result in a wrong IMO or

MMSI.

AIS quality thus depends on correct installation of the AIS device, frequent manual updates of

information, and technical devices. Most of the issues we deal with by detecting and removing

erroneous data. As the amount of data is huge, there are many errors. However the amount of

remaining data is still ample for further analyses.

These results made us decide to further investigate different AIS data sources. This was done by

subjecting the Dirkzwager data to a quality and metadata framework and then comparing Dirkzwager

to other data sources. We were interested to see how quality of Dirkzwager data matched national

AIS data. Therefore, national AIS data from Denmark, Greece and Poland was compared to AIS data

from Dirkzwager.

When investigating the quality of AIS data it is important to keep in mind that:

AIS is a radio signal, parts of the messages can get lost or scrambled due to factors such as

meteorology or magnetics.

2 Dirkzwager has 6 partners, amongst which AIShub (seem to filter the data more and covers all of Europe),

Marinetraffic (covers mostly of Mediterranean: Greece and Italy), one English partner (covers English coast) and Portvision (covers USA). 3 Maritime Mobile Service Identity

Page 30: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

30

Messages are transmitted encoded. As a result, an error in one transmitted ‘byte’ can result

in an error in one or multiple fields in the decrypted message. Most of the times, these errors

are detectable as the result yields an invalid variable, but sometimes they result in valid

variables. For instance, coincidentally the resulting MMSI can be a technically valid, but

incorrect MMSI, resulting from an erroneous detection. These errors can arise for every

variable, so this can for example result in erroneous latitude and longitude, yielding faulty

locations that are quite far away from the actual location of the ship. In turn, this can result

in a very high journey distance of ship.

Receivers have timeslots in which data is received. In busy areas with many ships, not all data

from all ships may fit into this time slot. This may result in the loss of data on some ships in

that time slot.

Ships can turn off their AIS transponder, resulting in the disappearance of a ship.

AIS was intended originally for safety at sea, to warn nearby ships. As it was not meant for

producing statistics, the variables entered manually by the shippers are not always reliable.

To provide an overview of different aspects of the quality of the Dirkzwager AIS data, we filled out a

preliminary framework for national statistical offices to conceptualise the quality of big data in

deliverable 4.3. Almost all factors of the quality framework are judged as mostly positive. Only

“spatial coverage” and “transparency and soundness of methods and processes for the metadata and

the data” are insufficient. Not all European coastal areas are covered and Dirkzwager provides partly

pre-processed data, but documentation on this is not available to us on how. Privacy is also an issue

that needs to be researched further.

Dirkzwager adds timestamps and performs validation checks on aspects such as position of ships and

ordering overlapping data sources. National AIS data of Denmark, Poland and Greece are completely

unfiltered and untreated.

In deliverable 4.3 AIS data from Dirkzwager is compared to national data for Denmark, Greece and

Poland. In almost all cases, national data contained (much) more data than Dirkzwager data.

Dirkzwager misses data on complete areas in coastal Europe. Thus, ports visits and journeys cannot

be analyzed for all European ports and ship routes. The number of messages in areas covered by

Dirkzwager is usually lower in the Dirkzwager data compared to the national data. It is clear that

some of Dirkzwager data is filtered depending on the data sources, but the exact nature of this

filtering is not clear, as the reduction of messages per ships differs. In general, we are not satisfied

with this filtering (or information on this filtering), and coverage of the Dirkzwager data. Coverage

differs per country, but if we want to analyze the whole of Europe it does not suffice. If Dirkzwager

data does cover a port, the data is sufficient to determine the port visits. However, it is not sufficient

to determine ships’ journeys, especially in areas with a capricious geography. Our algorithm

(described in deliverable 4.2) can deal with this, in terms of calculating the right number of journeys,

but it will result in an underestimation of the calculated distances. The lower frequency of messages

can also impact calculated traffic estimates and underestimate emissions.

During SGA-1 we developed robust algorithms to handle the noise in the data. We developed an

output-driven method to define a journey. Using the departure of ships gives us the start of a

journey. The end of the journey can be determined in three ways:

Page 31: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

31

The ship enters another port

The ship anchors

The ship leaves the area of AIS coverage

Processing could be optimized by filtering out AIS data in which the speed and heading of the ship

have not changed since the last message. This optimization might be performed in the future, but is

not in scope of the current project.

Results SGA-1 on linking European AIS data to data from maritime statistics and results on

possibilities to improve the quality of current statistical outputs

First results show that AIS data can be used as a backbone for maritime statistics. We have

developed a method to build a reference frame of maritime ships. From this, we can compose the

number of port visits, which seems to be more accurate than the maritime statistics. However, AIS

data do not contain the level of detail needed for the type and gross tonnage of the ships to be able

to generate port visit statistics. One method to accomplish this would be to combine AIS data with

Lloyd’s register of ships. We are also looking into other methods of deriving this from AIS data (e.g.,

deriving this from the type of terminal the ship is visiting).

European AIS data can improve current statistics. By using European AIS data it is possible to

determine ship routes in European waters. We performed four Proof of Concepts (PoC’s). The

outcomes are promising (detailed results are described in deliverable 4.3). The first PoC, on

developing an algorithm to calculate the intra-port journey by using AIS data, succeeded. Intra-port

travel distances can become a new statistical product. In the future, it would be interesting to

develop an algorithm that can detect intra-port movements, i.e. where a ship that moves from one

terminal to the other within the same port can be automatically detected. Another interesting aspect

is the detection of anomalies in the movements of ships signalling problems in the ports.

The second PoC, on using AIS data to define ports, has succeeded: it is possible to build a data driven

algorithm for defining ports. In the near future, Statistics Netherlands and Marine Traffic will

collaborate on the possibilities of building a reference frame of ports. Then, it would also be

interesting to zoom in on defining the types of terminals.

From the third PoC we conclude that next destination as reported by captains is not a usable variable

compared to the observed next destination. We also conclude from this PoC that distance measures

in time and space can be done. More work is needed to handle areas where coverage is not perfect.

It is also interesting to compare the distances in the port to port distance from Eurostat to the port to

port distances based on AIS data. This may result in using actual AIS journey data instead of the

average distance matrix in the future.

Finally, the last PoC shows that AIS data is useful to investigate fluvio-maritime transport. From the

perspective of traffic intensity, emissions and transit trade, it is interesting to further investigate

these fluvio-maritime journeys. This could also be used to gain insight in the relationship between

maritime and inland waterway transport.

Page 32: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

32

Results SGA-1 on sea traffic analyses by using AIS-data

We explored the possibility of calculating the number of ships during a certain time interval at certain

coordinates by using AIS-data. This information could be interesting for traffic and economic

analyses. From our investigation we can conclude AIS data is also useful to analyse sea traffic and to

analyse variations in time (see deliverable 4.3 for a more detailed description).

The figure below gives an examples of the visualization we made. In the visualization, one can choose

the date out of an available date list in the lower end, where a slider is available for selecting a

saturation threshold for the visualization. This means that all cells being more occupied than the

threshold are displayed as dark red and all less visited locations are less red. Playing with the slider

gives insight in more and less occupied regions in Europe. For the regions that are not displayed in

dark or less red, there is no data available in the Dirkzwager dataset on that specific day. Also very

low intensities are made invisible.

Figure 2: result on traffic analyses: the amount of ships in each cell of the grid during one

day based on the Lambert Azimuthal equal area projection for a threshold of 50

Page 33: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

33

Outlook

Although we are not satisfied with the quality of the Dirkzwager data yet, AIS data itself can help

improve current statistics. By having new data sources like Marine Traffic, EMSA and Luxspace

(satellite data) available in the future, the possibilities of AIS data seems to be even more promising.

In SGA-2 we will focus on describing possibilities of using AIS as a source for making new statistical

products (e.g. like intra-port distances, sea traffic and variations in time). We also wanted to involve

other statisticians working on maritime statistics on thinking about the use of AIS for improving

maritime statistics and for new statistical products. To this end, we sent out a questionnaire on the

use of AIS to maritime statisticians and all member countries of the ESSnet. We will describe the

results of this questionnaire, including ideas for making new statistical products by using AIS data, in

SGA-2.

In SGA-2 we will also investigate other AIS sources like Marine Traffic, Luxspace (satellite) and

hopefully EMSA. This will result in an advice on what data source would best fit analyses for

Eurostat’s purposes. Furthermore, we will develop a methodology for calculating emissions and

report on the impact of this methodology on the (European) level of emissions statistics. All project

results of WP 4 will result in a consolidated report.

Page 34: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

34

2.5 Mobile Phone Data

WP 5 on mobile phone data aims to do research on the potential use of these data for the

production of official statistics. Following the general bottom-up approach of the ESSnet, this work

package concentrates upon concrete sets of mobile phone data investigating and producing a

concrete statistical output in a given statistical domain assessing the methodological and

technological framework as well as some data quality aspects, also envisaging future perspectives for

their extensive use in official statistics production.

There arise several landmarks to achieve the goals. Firstly, mobile phone data are produced in the

frantic activity of the telecommunication industry, in particular in mobile phone networks built,

operated, maintained and technologically streamlined by international telecommunication

corporations. These data are strongly protected by diverse national and international legal

regulations due to highly sensitive personal privacy and confidentiality issues. Furthermore there

already exist economic interests to exploit commercially these data by these corporations

themselves. For all these reasons and some other, data access is a big issue to be tackled by the work

package. The short-term goal is to have access to concrete data sets to be used in the rest of

research activities. The long-term goal is to investigate the feasibility of a sustained access in

standard production conditions as well as the required characteristics of the accessed data.

Secondly, since usual survey methodology cannot be directly applied to these data, methodological

proposals must be produced in order to achieve high-quality statistical outputs as in traditional data

sources. This must be complemented with the technological requirements necessary to process

these data. This is the second general goal of the work package.

As third general goal, depending on the specific agreements with mobile phone corporations to get

access to the data, concrete statistical outputs in the statistical domains of human mobility and

tourism will be produced out of the accessed data sets. The production of statistical outputs will

entail the production of point estimates and the assessment of their quality, especially through their

accuracy and timeliness.

Finally, future perspectives will be envisaged, judging by the preceding diverse elements necessary to

produce statistical outputs from mobile phone data.

This work package will serve as an input for WP 7 on combining different statistical domains and WP

8 on big data methodology. Findings in WP 5 will be of valuable utility for the goals of these work

packages.

The degree of accomplishment of the preceding tasks and the main results are as follows:

Belgium, Finland, France, and Italy succeeded in their negotiations to have access to a

concrete mobile phone data set for the SGA-2. Spain and Romania are still under contact

pursuing this goal. UK, Netherlands and Germany will join the WP 5 for the SGA-2 with more

mobile phone data sets, mostly at an aggregated level.

No generic recommendation or golden rule to achieve success was found since the situation

is noticeably different by country (different legal regulations) and by mobile network

operator (different company structure and business interests).

Page 35: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

35

The agreements in every country are indeed also different regarding the access conditions

and data characteristics. The main consequence for SGA-2 and the rest of activities of WP 5 is

that data sets are not homogeneous regarding space and time extension and granularity,

event origin (either passive or active events) and other minor factors. Although this may be

seen as an obstacle, it will be used in SGA-2 to test what can be achieved in every case, so

exploring de facto several circumstances at the same time.

It is important to remark that in all cases consultations and/or resolutions of Data Protection

National Authorities have been necessary, possibly connoting a lack of clarity in legal

regulations regarding both the right of NSIs to access these data and the legally supporting

permit for mobile network operators to share their clients' data.

An overview of these experiences is contained in deliverable 5.2 of this work package.

As a starting point for the activities of WP 5, we designed a questionnaire to take stock of the

current status of the access to mobile phone data across the ESS. This questionnaire has

been designed taking as a basis a preliminary analysis of the many entangled issues regarding

access to mobile phone data for official statistics purposes.

The questionnaire has basically three parts. In the first part, we enquire about legal issues

regarding statistical, telecommunication and personal data protection regulations, all being

highly entangled in the question of the access to data. This is complemented with some

requested information about the characteristics of the mobile network operators. In the

second part, we focus on the access conditions: (i) in-situ, transmission, or via trusted third-

party, (ii) access for research or for standard production, (iii) conditions on dissemination

regarding intellectual property rights and industrial secrecy on data extraction algorithms

and methods, (iv) combination with official data, (v) data extraction cost compensation, and

some related details. In the final part, data characteristics are investigated: (i) raw or pre-

processed micro data vs. aggregated data, (ii) event source of data (active events, signalling,

etc.), (iii) spatial and time coverage, (iv) spatial and time granularity, (v) data on roamers or

not, (vi) details about possible pre-processing (anonymisation, geolocation, etc.). In addition,

questions were asked about the characteristics of the MNO and on some other aspects.

Detailed results of the survey as well as the questionnaire itself are contained in deliverable

5.1. Regarding responses, as a general overview, we can say that 28 out of the 32 ESS

members were surveyed (the remaining 4 do not participate in big data activities within the

ESS, so that they can be safely understood as having no mobile phone data related activity).

We got response from 25 NSIs, out of which 14 reported contacts with mobile network

operators, only 7 of them being successful in having access to mobile phone data. From the

corporations' point of view, 10 mobile network operators are reported to grant access to

their data, mainly for research purposes.

For the workshop in Luxembourg to gather mobile network operators, national statistical

offices, Eurostat and other stakeholders, 52 invitations to MNOs' representatives were issued

as well as to representatives of the members of the ESS (NSIs and Eurostat) and some other

international organizations (UN, OECD, ITU, DG Connect, DG Digit). STATEC acted as local

organizer (we explicitly acknowledge their support). Positium assisted the work package

members delivering one talk of their on-going experience and taking active part in the

debates and round tables. The meeting was organised in four sessions. Firstly, a presentation

of the context and actors was conducted by Eurostat, STATEC and Statistics Spain (INE) as

Page 36: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

36

work package leader. In the second session, three examples of on-going experiences in three

different European countries were presented and complemented with a round table

exploring the lessons learnt so far. In the third session, the core issues regarding the access

to mobile phone data were presented both from the Official Statistics and mobile network

operators’ points of view. A tour-de-table with vivid discussion was organized to exchange

the different views. A final session gathering the joint conclusions for the future closed the

meeting.

Detailed contents of the workshop were transcripted in the minutes of the meeting.

Two deliverables were produced in order to accomplish the goals of SGA-1 for WP 5.

Deliverable 5.1 contains (i) a preliminary analysis of the issues regarding the access to mobile

phone data, which was the basis to design the questionnaire surveying the status of this

access across the ESS, and (ii) the results of this survey.

As mentioned, basically five main groups of issues were identified regarding the access to

mobile phone data, namely (i) the characteristics of the MNO, (ii) the legal requirements, (iii)

the access conditions, (iv) the data characteristics, and (v) some other aspects.

For diverse reasons, the characteristics of the MNO stand as an important factor in granting access to

their data. Although each corporation is different, three main typologies have been identified: (i)

MNOs having invested (or on the verge of investing) on the development of a business line around

the statistical exploitation of their data, (ii) MNOs not having developed this business line but

decided and interested to do so, and (iii) MNOs not having any interest whatsoever so far in this

business line. It is important to underline that different possible interlocutors within these companies

can be approached. Being these companies large as they usually are, a number of departments or

operating units are involved in the question of granting access to data. In some cases, these

departments do not show themselves the same vision thus hampering the agreement. This reflects

the many facets involved in the question.

Legal requirements stand as the most visible obstacle to have access to mobile phone data. We have

detected that at least three kinds of legal regulations are at stake. Firstly, by and large statistical

regulations do not seem to bear a clear legal support regarding the right of NSIs to have access to

these data. This is mainly motivated because National Statistical Laws are already some decades old

and did not explicitly foresee the access to private data as mobile phone data and other big data

sources. However in many cases, a straight interpretation of the current wording of this legislation

still gives support to NSIs to request access under common strict confidential conditions. Secondly,

telecommunication regulations are clearly stringent on the conditions to access, to store and to

process these data, even by the MNOs themselves. In some cases, there even seems to be an

unsolved collision between these two kinds of legal requirements. Finally, in the European

framework personal data are strongly legally protected and National Data Protection Authorities

must be involved to clarify and to give explicit support to NSIs for their data request and conditions

for accessing, storing and processing. There seems to be a clear consensus that more clarity is

needed, especially in the European realm, regarding the legal status of the access to these data for

official statistics purposes.

The actual conditions to access the data are also part of the negotiations and of the agreement. Data

can be possibly accessed only in-situ in the company's premises or can be securely transmitted to the

Page 37: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

37

NSIs' information systems possibly through a trusted third-party. In all cases so far, access have been

granted for research purposes, the conditions for long sustained production being still an open

question. Data are to be extracted from the MNOs' information systems through technological

solutions which are wanted to be kept under intellectual property rights and/or industrial secrecy.

How much information is to be shared between MNOs and NSIs thus stands as another point in the

negotiations. Equally, the combination of mobile phone data with official data, especially with micro

data (data at the statistical unit level) is another factor to take into account. As clearly seen from the

operational framework, data extraction and data pre-processing procedures entail some cost and

effort, which must be dealt with in the negotiations to reach an agreement.

Finally, the operational conditions under which confidentiality is scrupulously observed must be

agreed upon. Many characteristics of data can be possibly considered. Firstly, data can be at the

mobile phone level or can be aggregated at some level (also to be agreed upon). The actual origin of

the data within the mobile network is also an important issue. They can be generated from the active

events (calls, messages, Internet connections,...) of subscribers or from any signalling activity

between mobile devices and antennas possibly recorded in the network. The amount of information

(and also of effort) greatly varies from one scenario to another. Complementary data from the

subscribers' contracts (sociodemographic data) or from the operational framework of the network

(position of antennas) are another factor. Pre-processing regarding the procedures of anonymisation,

of geolocating each piece of data and of giving time references to them is an important issue to be

agreed upon. In the case of aggregated data, this must be complemented with details of the

procedure of aggregation.

Lastly, there exist some other issues that NSIs must take into account. To name a few, access to data

from the main MNOs are to be considered, because having access to just one or two companies

introduces the unsolved question of partial coverage of the population and thus the subsequent

inference problem. Also, an a priori apparent collision of interests may arise between the private and

public sectors, both of them exploiting the same data. As in traditional survey sampling, this is only

apparent, as the public and private sectors may reinforce each other, especially if partnerships are

formed aimed at complementary actions. Finally, public opinion regarding the access to and sharing

of so highly sensitive data must be also taken into account, possibly with a joint transparent

communication strategy about the actual use of these data for the benefits of society.

Deliverable 5.2 summarises the main findings of SGA-1 activities in the form of guidelines and

recommendations for the partners of the ESS when initiating their own contacts and negotiations

with their national MNOs. This second deliverable has three main sections.

Firstly it contains a technical description of the operational framework of mobile telecommunication

networks to understand how data are generated. Although it is not strictly necessary to bear an

expert knowledge in this technology to negotiate access to mobile phone data, it is remarkably

beneficial to understand several factors in the underlying complexity.

Mobile phone data for statistical exploitation do not exist in a cellular network. They are organic data

created, reproduced, stored, and deleted in a frantic business cycle providing a telecommunication

service, not a statistical service. A clear specification of the requested data must thus be formulated

and negotiated.

Page 38: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

38

A mobile telecommunication network has a nested hierarchical structure so that at the top basic data

mainly for billing purposes are compiled whereas at the bottom (where multiple information systems

are geographically distributed across the national territory) a wealth of technical data exists.

The core set of variables for statistical exploitation embraces (i) (anonymous) identification variables

of each mobile device, (ii) time attribute(s), and (iii) geolocation attribute(s). The creation of these

variables depends on diverse factors operating in the network. Complementary variables can also be

extracted. A description of all these variables are included in the deliverable5.2.

From a wider perspective, the network complexity entails two immediate issues. On the one hand,

extraction costs must be carefully taken into account, especially when confronted with the firm

international principle in official statistics production of not paying for data for this purpose. On the

other hand, new professional skills are needed for the staff dealing with this task. Both aspects are

briefly tackled in this chapter.

All these technical aspects are summarised in a sequence of tables in the deliverable 5.2 as a

collection of issues to be dealt with the MNOs in a negotiation ranging from the premises where data

are to be processed over data coverage to network technology. This part of the deliverable strongly

follows the technical assistance by Positium.

Secondly, we gather all business guidelines especially using the results of the workshop in

Luxembourg and the opinions and visions exchanged with MNOs during this event. These findings

help us understand and disentangle many of the factors behind the difficulties to access mobile

phone data by official statistics producers.

Several on-going initiatives of collaboration between NSIs and MNOs were exposed. A round table

and a tour-de-table were held with a vivid debate on the different aspects involved in the access to

data.

As main relevant issues, these were identified:

Construct clear use cases to show feasibility and mutual trust.

Consensus on partnerships outperforming mandatory scenarios.

Concerns arise when moving from research to production.

Distributed vs. centralised data processing models. Solutions based on development of open

algorithms to be applied on secured data kept in data centres (e.g. MNOs premises).

Regulation on data protection and relationship with regulatory authorities is a big issue at

national and European level. More clarity needed.

Perception by society on the use of these sources. Communication strategy is needed.

Transparency for citizens.

Vicious circle in data access: it is necessary to build detailed case studies and delimit a

precise set of data to be requested to MNOs… but some kind of data access is needed in

advance for setting up these detailed case studies...

Relationships with MNOs could be different depending on their current strategies on big

data. NSIs must take into account these different starting points.

Quality assurance framework for our users.

Page 39: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

39

Thirdly, the individual experiences of each work package member are included in the deliverable in

the hope of providing illustrative guidance in the process of negotiating with MNOs. As main

guidelines, we have identified the following:

See/show the window of opportunity of building up a partnership between an NSI and an

MNO.

Get the right people committing their organizations with technical skills and competence to

build up the partnership.

Show empathy and value arising from the NSI’s contribution to this partnership.

Show absolute guarantees of confidentiality and privacy protection.

Show the limits of producing statistical outputs with no collaboration in contrast to the

combination of data sources and methodologies.

Be aware of the complexity of data extraction and the implications in cost extraction and

professional skills.

What data? Mobile phone data for statistical exploitation do not exist in a mobile

telecommunication network and a concrete specification must be formulated.

Define a concrete small research project.

Be attentive to legal issues.

Analyse costs.

Be transparent.

The overall goal is to provide relevant information about the access to mobile phone data seeking

optimal cost-effectiveness and efficiency within the ESS.

Outlook

For the SGA-2, the work package will concentrate on producing concrete statistical outputs using the

mobile data sets secured during SGA-1. The main goals for this second part of the ESSnet for this

work package will be:

From the methodological point of view we intend (i) to clarify the application of common

definitions of tourist, commuter, etc. on these data sets, and, in particular, the use of novel

techniques such as machine learning to apply these statistical concepts on these data; (ii) to

produce estimates together with their accuracy (especially bias) as the main measure of

quality; (iii) to research on the inference question from these data sets to the whole

population of analysis.

From the technological point of view, we will identify (i) needs, if any, for distributed

computing; (ii) special needs, if any, for data storage; (iii) novel needs, if any, for software.

From the data quality point of view, an evaluation of the statistical outputs especially

regarding accuracy and timeliness will be conducted and complemented with an assessment

of the adequacy of the current quality framework, again especially regarding accuracy and

timeliness.

Finally, some future prospects will be explored for the extensive use of mobile phone data in

the production of official statistics.

Page 40: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

40

2.6 Early Estimates

The aim of WP 6, Early estimates, was to investigate the potential of big data and others sources in

order to combine them for purposes of early estimates of statistical parameters.

The main goal of the WP 6 team was to explore how a combination of (early available) multiple big

data sources, administrative and existing official statistical data could be used in creating existing or

new early estimates for official statistics. The study included the exploration of:

big data sources and statistical areas where those sources could be used;

other administrative and statistical sources which could be combined with investigated big

data sources;

possible business cases which could be tested in the SGA-2 period;

data collection, data linking, data processing, methodological and IT issues;

results of one or two pilots which may help us to determine the most prosperous business

case for SGA2. Proposed pilot were Nowcasts of Turnover Indices or (and) Consumer

Confidence Index.

This report (SGA-1) focuses on results of the study related to a list of potential big data sources and

proposed business case for SGA-2. Three deliverables were produced, focusing on data sources and

business cases, on IT tools and on methodology, respectively.

Exploration of data sources

On the basis of brainstorming sessions at some of the NSIs involved in WP 6 (and WP 7), a

questionnaire sent to participated NSIs in the ESSnet Big Data project and discussion of members of

the WP 6 team, the initial list of possible big data and other sources and possible statistics which

could be calculated out of detected data sources was prepared.

Table1. List of possible data sources with statistical domain where they could be employed

STATISTICAL DOMAIN DATA SOURCES STATISTICS

Tourism (1) Mobile phone data, traffic counters at border crossing (including recognizing the number of plate of the vehicle), flight and train tickets, surveys…

Number of foreign tourists, number of (tourists) vehicles passing the country,

Tourism (2) Mobile phone data, surveys… Number of tourists, lengths of trips…

Population mobility Mobile phone data, surveys…

Number of (short) travels per day, average travelled distance per day…

Health statistics E-health recipes, personal health cards, pharmacies, surveys…

Use of medicines (by age groups, gender, territory…)

Agriculture

Airplane or satellite images, surveys… Utilized agricultural area, arable land, share of permanent crops in unutilized areas…

Quick and dirty statistics (in all statistical domains)

NSI data & e.g. Google trends tool Flash estimates of all kind of early statistics

Page 41: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

41

Statistics for the internal NSI purposes

Newsfeeds, social media data Monitor and detect the statistical products and areas which occur in statistical and other web news:

• Detect new statistical products which are very frequent and not covered by NSIs production yet

• Detect the statistical products produced by NSIs for which there is almost no demand

• This information which help the management of NSIs (together with stakeholders) to decide for which statistical product there is high public demand

Economic indicators:

• Gross domestic product (GDP)

• Consumer price index (CPI)

• Retail sale

• Balance of payments

• Economic sentiment indictors

Big Data: Job vacancies ads from job portals, traffic loops, Social data (Twitter, Facebook, etc.), supermarket scanner data, bank transaction data, news feeds/messages,

Registers and existing sources: Statistical Register of Employment, data from the Employment Agency, tax data, wages and salaries

Surveys: Turnover data from various short-term surveys, Business confidence index, Consumer confidence index

Flash and (or) intermediate estimates of economic indicators

Among the proposed data sources and associated statistics, the WP 6 team decided that the most

promising and interesting ones are combining sources which could be used for early estimates on

economic indicators. Three main reasons for this decision are:

There is a very high demand for flash estimates of economic indicators from stakeholders;

Many of the sources are available in most of the countries so it is possible to test them and

create the results for more than one country;

Even if the country does not have access to any of the big data sources, it is still possible to

test methods and processes on administrative and other existing sources.

Combining of data sources

When we think of combining of data sources in the traditional statistical production we mostly think

of combining them on micro level. If a common identificator exists, the linking of data is quite

straight forward, otherwise the various record linkage methods are applied in order to derive id in

data set where id is missing. In the area of big data the issue of combining of different data sets is

more complicated. Often the (big) data sources are completely different, so we are not able to

employ record linkage techniques, or one of the data sets contains unstructured data where we need

to employ big data techniques like machine learning in order to link data.

Page 42: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

42

The other possibility for linking of data sources is linking on macro level. Here we try to aggregate all

data sets making use of common identificators and including the data in nowcasting models.

Nowcasting4 is a very early estimate produced for an economic variable of interest over the most

recent reference period calculated on the basis of incomplete data using a statistical or econometric

model different from the one used for regular estimates. Soft data should not play a predominant

role in nowcasting models. Nowcasts may be produced during the very same reference period for

which the data is produced.

Conducting one of the pilots during the SGA-1 ESSnet project, several nowcasting methods were

under investigation. Among them the most promising in the sense of practical implementation was

the Principal Components Analysis (PCA) method. The central idea of PCA is to reduce the

dimensionality of a data set consisting of a large number of interrelated variables, while retaining as

much as possible of the variation present in the data set. This is achieved by transforming to a new

set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so

that the first few retain most of the variation present in all of the original variables.

When nowcasting early indicators with a PCA model, big data and other sources could be combined in two ways:

as a regressors in nowcast equation y = α1x1 + α2x2 + ⋯ + αkxk + β1y1. Variables x1, x2, ⋯ , xk are principal components of given set of (big) data and variable y1 is aggregated set of combined (big data) data source.

In the pilot conducted at SURS where we tested the PCA model in order to estimate Real

turnover index in industry (time series of interest), the Real turnover of industrial enterprises

(time series of enterprise data used for determination of principal components) was combined

with Economic sentiment indicator used as an additional predictor in linear regression.

as a micro data in nowcast equation y = α1x1 + α2x2 + ⋯ + αkxk + β1y1. Variables x1, x2, ⋯ , xk, y1are principal components of given two sets of (big) data

Business case for SGA-2

Early estimates of economic indicators

During the SGA-1 period, the WP 6 team explored big data and other sources which could be

combined for purposes of early estimates and conducted two pilots at Statistics Finland and the

Statistical Office of the Republic of Slovenia based on early estimates of some economic indicators.

Statistics Finland tested a series of shrinkage and factor analytic methodologies to compute nowcasts

of the main Finnish turnover indexes, using continuously accumulating firm-level data. They showed

that the estimates based on large dimensional models provide an accurate and timelier alternative to

the ones produced currently by Statistics Finland, even after taking into account data revisions. In

particular, it was found that the turnovers for some economic sectors could be estimated with high

accuracy five days after the reference month has ended, giving more accurate and faster predictions

compared to the first official internal release. The Statistical Office of the Republic of Slovenia

worked on a PCA model where a sharable application was created and tested on real industry indices

4 Overview of GDP flash estimation methods; Eurostat 2016 Edition

Page 43: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

43

where promising nowcasting results were obtained. This method was also tested on some big data

sources such as on-line job vacancy data.

Moreover all of the counties involved in WP 6 (Finland, Poland, Netherlands, Slovenia) and some of

the other NSIs involved in the ESSnet Big Data projects expressed quite high interest in earlier

estimates of main economic indicators produced at NSIs. Due to the expressed interest for

investigation in area of early estimates and the fact that many big data sources could be associated

with early economic indicators, it was decided to propose the pilot on early estimates of economic

indicators.

The aim of the pilot is to investigate big data and other existing sources for calculating flash and (or)

intermediate estimates of economic indicators. Early estimators of economic indicators which will be

considered are:

Gross domestic product (GDP)

Consumer price index (CPI)

Retail sale

Balance of payments

Economic sentiment indicators

Some work will also be dedicated to exploring possible new leading economic indicators.

During the conducting of the pilot the correlation of the data sources and early economic indicators

is planned to be explored and according to the results (detected combining sources and testing early

economic indicator), various models for flash and (or) intermediate estimates will be tested. The

most promising estimator is GDP, but the pilot will not limit itself to GDP due to the fact that results

of analysing data sources could propose calculation of (better) estimates of other economic

indicators.

Data sources

Many big data, statistical and other administrative sources could be linked to early economic

indicators. As one of the results of SGA-1, the list of possible sources which could be combined for

purposes of estimates of early economic indicators was prepared (Table 2). Some of the sources have

been already investigated; for some of them there is an issue with their accessibility. Availability of

time series of certain data sources should also be taken into account, due to our goal to nowcast

economic indicators.

Table 2. Overview of possible sources to be investigated

Big Data Registers and existing sources Surveys

Job vacancies ads from job portals Statistical Register of Employment Turnover data from various short-term surveys

Traffic loops Data from the Employment Agency Business confidence index

Social data (Twitter, Facebook, etc.) Tax data Consumer confidence index

Supermarket scanner data Wages and salaries …..

News feeds/messages …

Bank transaction data

Page 44: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

44

One of the data sources which could be easiest acquired is traffic sensor data. An additional

advantage is the availability of times series of traffic data, which is not the case for many other big

data sources. First results which have been obtained at Statistical Office of the Republic of Slovenia

(Figure 1) shows quite a fit between curves which show movement of annual GDP and movement of

estimates of annual GDP based on various annual aggregates of traffic density in Slovenia using data

in the period 2005-2014. The model used for estimation was simple linear regression.

Figure 1. Correlation between GDP and traffic sensor data

Figure 1 shows 5 examples of estimates of annual GDP due to the aggregated categories of vehicles

on Slovenian roads. Categorises which have been used are

Light trucks (up to 3,5 T)

Medium trucks (3,5 -7 T)

Heavy trucks (more than 7 T)

Trucks with trailer

Semi-trailers

In the example at the bottom right, all categories of vehicles were used as a regressor. Surprisingly

the best results were obtained where all vehicles are taken into account. However, based on initial

encouraging results, a more detailed analysis of traffic loops data (most of work planned for SGA-2)

has been started.

After some research it has been found that the data can be acquired from the Slovenian Ministry of

Infrastructure (and municipalities for local traffic). They gave us multiple choices for the format of

data and they also provided us with a sample of micro data. Samples of raw data were provided and

Page 45: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

45

so-called “edited data”. Row data presents categories of counted vehicles per traffic loop while

edited data represents time series of data of one or more traffic loops which were placed at the

same location point.

Raw data

The data is raw data from every traffic sensor placed on the Slovenian roads. As there exist different

kinds of sensors that count different categories of traffic, this would mean we would need to merge

the sensors on the same counting spot according to a formula that would adequately distribute these

differing categories. The number of categories differs according to the version of the sensor, as is

shown in the table:

QLD3 Sensor QLD3 counts all vehicles

QLD5 Sensor QLD5 distinguishes 5 vehicle categories

QLD6 Sensor QLD6 distinguishes 10 vehicle categories

QLTC8 SensorQLTC8 distinguishes 10 vehicle categories

QLTC10 Sensor QLTC10 distinguishes 10 vehicle categories

QLD Counted with different versions of sensors

Sensors have some common features. Every sensor counts traffic on 2 channels, this being the 2

opposing lanes on regional roads or the ordinary and fast lane on speedways and highways. The

counting interval is also the same for every sensor and it is 15 minutes. The data output file is a text

file with 11 categories of vehicles, regardless of the number of categories a sensor actually counts.

The uncounted categories are not marked, but are filled with zeroes. The data also contains other

information, such as the highest, lowest and average speed in the interval, the average of specifically

personal vehicles, the average time gap between vehicles, the occupancy of the lanes and the

temperature.

Traffic sensors

In 2015 there were 659 sensors in Slovenia which were not manual.

On the website promet.si there is information about traffic sensors

(https://www.promet.si/portal/sl/stevci-prometa.aspx, 26.1.2017) which gives you on the fly

information about the current traffic situation. Those traffic sensors covered all highways and other

roads in Slovenia as well. There is also available the map of all traffic loops in Slovenia which allows

us to geo locate their exact location

(http://www.di.gov.si/fileadmin/di.gov.si/pageuploads/Prometni_podatki/2015_karta_stm.pdf)

Roads

There are 12 categories of roads in Slovenia which is very important to know due to the possible

influence of traffic density of some roads on economic indicators (e.g. excluding the foreign vehicles

which cross the country). One of the steps in SGA-2 is investigation of what type of roads and what

categories of vehicles are most correlated with economic indicators.

More detail description of the categories of roads could be found at

http://www.stat.si/StatWeb/File/DocSysFile/8025

Page 46: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

46

Methodology

Nowcasting

The main idea is to (practically) test at least one of the nowcast methods for purposes of estimating

early economic indicators. As it was mentioned the PCA model has been tested at SURS for these

purposes.

Model: consists of two stages:

1. Principal component analysis (PCA)

dimensionality reduction

time series of (enterprise) data → standardize → choose the first few principal components

various conditions for choosing principal components:

The chosen principal components explain at least 70% (75%, 80%, 85%, 90%) of

variability of enterprise data.

Time series in the linear regression model are at least 7 (8, 10, 15, 20) times longer

than the number of the chosen principal components.

The last chosen principal component explains at least 5% of variability of enterprise

data.

2. Linear regression

Y (dependent variable): time series of interest, e.g. turnover index

x1, x2, ⋯ , xk, (predictors): e.g. the chosen principal components

SURS with the help of statistics Finland prepared an application (together with instructions how to

use it) which allow:

inputting different kinds of data

testing various conditions for choosing principle components

producing quality indicators which compare results of different nowcasting methods

producing quality indicators which compare results

Page 47: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

47

Figure 2. Comparison of estimates of real turnover index and actual index

Having an operational application where one of the methods for nowcasting is implemented, the

plan for SGA-2 is to include big data sources in the model and to test performance of estimates of

early economic indicators. There will be also focus of assessing the quality of the process and the

calculated estimates of early indicators.

Nowcasting turnover indexes (Finnish experience)

The main reasons of work done by Statistics Finland were:

Pressing issue was the long lag of publication and requirements from FRIBS and users (i.e.

national accounts);

StatFi wanted a practical solution, so another aim was simplicity and tractability in terms of

data source and method, if possible;

To propose methods that can be useful in big data, and that are commonly used in the

Nowcasting of macroeconomic variables;

Using continuously accumulating firm level data (hard data);

Estimating the common components underlying this data with factor analysis;

The common components would be predictors in nowcasting equations;

Looking at the machine learning literature, other options were available (such as LASSO,

RIDGE, Elastic Net regressions) that could deal with high dimensional econometric problems.

Page 48: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

48

The firm level data was used for those purposes. One of the issues which had to be solved was

multidimensionality of data. Widely used is the factor analysis which estimates the common and

idiosyncratic variance underlying the data. But shocks to large companies can have a sizeable impact

in the Finnish economy (economic activity is concentrated on a few multinationals e.g. Nokia). That is

why so-called shrinkage models were explored in order to better capture some of the firm specific

variations. They include all the firm growth rates in the estimation, and deal with the curse of over

fitting by shrinking parameter values towards 0. These models outperform factor models (in general).

Statistics Finland has also tested nowcasting the second month of the quarter and forecasting the

third, allowing to compute real time estimates of quarterly GDP. Methods were similar, using sales

inquiry for firm level data. Turnover indexes are widely followed in their own right, but are also used

as source material for producing the Trend Indicator of Output (TIO, i.e. the Finnish monthly

economic activity indicator).

Conclusions and outlook

The first aim of ESSnet Big Data WP 6 (SGA-1) was to find out which pilot would combine multiple

(big) data sources and have a real potential to be implemented (by at least two countries) in SGA-2

period. The proposed pilot is Early estimators of economic indicators. The proposal of the pilot has

been made and positive response from other countries shows that we are on the right track.

According to the plan the WP 6 team worked on following deliverables:

Detailed business plan for the pilot was prepared;

List of tasks per each country was prepared (involved in SGA- 2);

Initial investigation of available data sources in participated countries was done.

The second aim of the report was to display which of the two proposed pilots have greater potential

in order to be implemented during the first wave of pilots. After the first few months of investigation

it had been found out that the pilot NowCasting turnover indices is much more feasible in terms of

available data. Another advantage of this pilot is the models for nowcasting which will be tested

during the next period. The project team have seen the clear connection between those models and

the proposed pilots on early estimates where we could use the experiences from nowcasting the

NowCasting turnover indices.

For the SGA-2 period, Italy and Portugal will join the existing WP 6 team (for SGA-1 consisting of

Finland, Poland, Netherlands, and Slovenia). Internal WP 6 meetings were already organised with the

new members, where we discussed about the organisation of work, possible (big) data sources and

about now casting methods and economic indicators for which we intend to calculate early

estimates. The most interesting among them are quarterly GDP (GDP overall or some of the

components of which the GDP consists) and new leading economic indicators which measure the

business cycle. However, in case of estimates related to GDP we need to bear in mind that demands

for the precision of estimates are very strict.

Page 49: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

49

2.7 Multi Domains

Aim of WP 7 is to find out how a combination of big data sources, administrative data and statistical

data may enrich statistical output (how they can be used to improve current statistics) in domains:

‘Population’, ‘Tourism/border crossings’ and ‘Agriculture’.

This work package has a more scientific nature. From the methodological, qualitative and technical

point of view it is required to work with professional independence. The activities in this work

package are independent from the results of the work packages 1 to 5, although, where relevant,

information with other work packages is exchanged. Especially, WP 6 and WP 7 are related. These

work packages both concern the exploration of combining various sources, including big data

sources, for the production of statistics.

The work package team aims at describing the data collection, data linking, data processing and

methodological aspects when combining data in statistical domains and additional value could be

created by investigating the linkages between domains.

Since this work package considers many crosscutting issues, such as methodology, quality and

technical requirements, care will be taken that its outputs can be used as inputs for the WP 8 of SGA-

2. The participation of Statistics Netherlands, which leads WP 8 of SGA-2, is aimed at ensuring this.

Under SGA-1, apart from GUS (Statistics Poland) which is leading WP 7 and CBS (Statistics

Netherlands), this work package is carried out by 2 other representatives of ESSnet Big Data partners:

CSO (Statistics Ireland) and ONS (Statistics United Kingdom).

The WP 7 team divided work into 4 main groups of tasks:

Task 1. Data availability/Data inventory

Task 2. Data feasibility

Task 3. Data combination (SGA-2)

Task 4. Summary plus future perspectives (SGA-2)

Similarities and differences between countries, such as concerning the availability of registers, the

legality of data linkage, etc., is taken into account when carrying out the tasks.

At the end of SGA-1, work was carried out mainly in the field of pilots. Previous tasks had provided

appropriate preparation for further studies. In the second part of SGA-1 WP 7 started first practical

work. The results obtained so far, divided into individual domains, are presented below.

As the final results of SGA-1, WP 7 prepared at the end of January 2017 three final reports:

7.1 Report for Population domain

7.2 Report for Tourism/Border crossing

7.3 Report for Agriculture.

These (combined) reports contain basic information on data access (such as legal and privacy

aspects), data quality issues, methodology (including the combination of data) and the technical

aspects of the data.

Page 50: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

50

They have been submitted to the Review Board on 1st February 2017 and then WP 7 started SGA-2

preparation, because it was the final phase of SGA-1 for this team. According to the FPA, there was

overlap between SGA-1 and SGA-2, WP 7 was one of the work packages that has already started

work under the second agreement from March 2017.

All tasks were completed within the scheduled timeframes.

Population

The population domain covers mainly the study including an analysis of the attributes of the

population. The survey’s results are often presented in the form of population structure, for example

the number of females per 100 men, as well as an index, for example net migration per 1000 people.

Due to the unrepresentative nature of the data contained on the Internet and the possibility to apply

only on the basis of the persons that are active in the Internet, it was decided to link the indicators of

the population with the indicators of social research, such as life satisfaction or participation in the

elections (intention to vote). The decision to choose this scope of study also results from the

possibility of comparing the results obtained with the results coming from the traditional research

techniques.

It was decided to split the use case into three parts:

Daily satisfaction (Life satisfaction);

The moods of population associated with public events (e.g., Brexit, Voting);

The morbidity areas (e.g., flu).

Accordingly, the aim of this study is:

To examine the level of daily satisfaction of the population by analysing the content of

messages for the presence of defined expressions describing emotional states, e.g.,

happiness, joy, sadness, fear, anger;

To present the moods of the population associated with various public events;

To observe morbidity areas, e.g., flu.

Population indicators will be limited to:

Residence population (and migration);

Number of women per 100 men;

Population structure;

of persons using new technologies to communicate, such as social media, blogs etc. These indicators

will be related to social statistics, such as life satisfaction, moods and morbidity areas.

Data obtained through the proposed solutions enable:

extending the scope of the database;

obtain more current results;

adding more detailed cross-sections for the study population of users of social media and the

Internet (currently there is no such subpopulations in similar thematic studies).

Page 51: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

51

In the first stage (December 2016 - February 2017) it was decided to conduct a pilot Use Case 1, i.e.

Daily satisfaction within Life satisfaction. Others Use Case in Population domain, numbers 2 and 3 are

made in the months of February and March 2017.

Table 1. Schedule for carrying Use Case

Use Case 12.2016 01.2017 02.2017 03.2017

1.1.

1.2.

1.3.

The criteria for selecting the Use Case relate primarily to the availability of source data and the

reliability of the results. It was therefore decided to use the widely available source, access to which

is possible by using web scraping or a dedicated API.

It was assumed that the life satisfaction data may be based on the data sources in the form of social

networking sites and web pages. The advantage of social networking sites in relation to websites is

the possibility of obtaining the geographical location of the entry. Concretising the data source, the

following portals / tools were indicated:

DS1 – Twitter

DS2 – Google Trends

DS3 – Comments on Specific News/Events on Web Portals such as gazeta.pl, bbc.co.uk,

irishtimes.com, spiegel.de

According to the data quality assessment, these sources are characterized by a relatively high quality

of data sufficient to acquire and process data in order to achieve the main objective of the Use Case.

The main aim of Use Case 1.1 is to obtain data on Daily Satisfaction. This means that there is checked

satisfaction of life by analysing the entries posted on the Internet. To achieve this objective, it was

decided to use the API tools to retrieve data from the social networking site Twitter.

Due to the nature of entries on social networking sites it is not possible to graduate life satisfaction,

such as it is, e.g. in the European EU-SILC (European Union Statistics on Income and Living

Conditions).

The main advantage of the proposed solution is possibilities to carry out surveys more frequently

than with currently practiced study’s modules. In addition, the benefit is less burdening respondents.

Use Case realization includes seven steps. Within these stages the following tasks will be realized

respectively:

Classification of Tweets in order to extract from them the attributes related to life

satisfaction;

Analysis of the possibility to obtain the attributes of the population;

Preparation of a Python solution enabling the acquisition of tweets and processing them with

algorithms such Machine Learning;

Teaching Machine Learning algorithms the type of tweets classification;

Verification of the algorithms on the test data set;

Page 52: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

52

The choice of algorithms to conduct the survey in a test environment;

Conducting a pilot in a production environment.

To be prepared within the first phase is a solution using Python version 3, and the environment

Apache Spark for data acquisition and processing. For the implementation issues related to the

webscraping, a Twitter API library Tweepy was used. In order to implement the Machine Learning

algorithms Library scikit-learn was used. The whole is presented in Figure 1.

Figure 1. Big Data Framework used for Use Case 1.1.

The training set looks as presented in the Table 2 examples.

Table 2. The content of the training set

No. Text Target Language Id

1 Rousey is gonna quit UFC forever now lmao #SoHappy Satisfied EN F1

2 And I did absolutely nothing #satisfied ��❤ Satisfied EN F2

3 To był cudowny weekend ����� #love #happy #awesome #osom #bestweekend @ Czestochowa

Satisfied PL F3

4 Połączenie nowoczesnego designu z funkcjonalnością sprawi, że osiągniesz jeszcze lepsze wyniki.

Neutral PL F4

5 They want more happiness & more money in 2017 cause they're not satisfied w/the position of each. It don't matter the context. #Unsatisfied

Not satisfied EN F5

The general outline of this study includes the following topics:

Webscraping

Analysing of comments / messages

The selection and arrangement of useful information

Quantitative and qualitative classification of posts/messages/comments (machine learning)

To sum up, we can say that the project sets up the use of large sets of data from social networks to

assess the current state of the moods of people associated with the public event as well as related

indicators, such as life satisfaction. This allows studying the impact of public mood on decisions.

Particularly interesting could be the analysis of the moods of different countries citizens in relation to

the same events and mood changes over time.

The work plan for the future is to develop more indicators associated with satisfaction in various

areas of life, e.g. work, commuting time etc.

Twitter data

Tweepy

Sklearn

Training Dataset Machine Learning

algorithm

Data extracting Predictive

model

Labels

Feature vectors

Result set

Page 53: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

53

Tourism/border crossing

Regarding the data source we indicated the following data owners (in original language):

Generalna Dyrekcja Dróg Krajowych i Autostrad for Poland5 (GDDKiA)

Bundesanstalt für Straßenwesen for Germany (BASt)

Ředitelství Silnic a Dálnic for Czech Republic (RSD)

Národná diaľničná spoločnosť for Slovakia (NDS)

Kelių ir transporto tyrimo institutas for Lithuania (KTTI)

At this moment we started cooperating with GDDKiA, BASt, KTTI and NDS while we are awaiting the

answer from RSD.

Simultaneously, there is a sample survey on border traffic conducted by the Statistical Office in

Rzeszów. We sample days from quarter as well as measurements points from all possible

measurements points along the border.

On the one hand we have a large amount of data but only for few points of time while on the other

hand we have data of high frequency but of the low volume. Volume of data from GDDKiA is slowly

increasing over time. Some of the measurement points were not available in the past. Also we have

some mirror statistics – from German and Lithuanian side data is of high frequency and plausible

volume, for other countries data frequency and volume is available in the same way as for Poland. In

addition, we have data of high frequency for several points near the border from our sample survey

but we assume that the role of surveys will be significantly decreased due to the results of our efforts

and the data may be used up to a certain moment of time in future. After that moment we hope to

base results only on big data sources.

In the first step we prepared a template to fulfil spatial-temporal data. We selected length of time

series according to data availability and appropriate measurement points: all points from Continuous

Traffic Measurement and all points near the border. Most of the points from Continuous Traffic

Measurement is situated in the inner area of Poland, only few are near the border. For several points

mirror statistics are available. The template will be fulfilled for following data sources:

General Traffic Measurement from GDDKiA

Continuous Traffic Measurement from GDDKiA

Continuous Traffic Measurement for German border from BASt

General Traffic Measurement for Slovakian border from NDS

Continuous Traffic Measurement for Lithuanian border from KTTI

Survey on border traffic conducted by Statistical Office in Rzeszów as a support for improving

model parameters but not as a source of data for further estimations

Also additional data will be gathered:

Distance matrix between measurement points;

Number of registered vehicles in LAU1 level;

Other available data which can be connected with traffic.

5 General Directorate of National Roads and Motorways

Page 54: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

54

Continuous Traffic Measurement is conducted not for all measurement points for the whole period

of time – lacks of data at the beginning of time series – as well there are some missing data in the

middle of time series. Thus, data will be imputed with use of correlated time series available for the

missing moment of time. As a result, we shall obtain the greater set of time series available for the

same period.

Although there are some missing data, some spatial and temporal analysis have already been carried

out on available data. The Figure below presents a histogram of Annual average daily traffic (AADT).

Figure 2. AADT for measurement points in General Traffic Measurement in 2015

It turns out that traffic intensity closely follows the Pareto rule of 80/20. 67% of traffic flows through

32% of measurement points. Distribution is right-skewed and concentrated. Data variability is also

very high. Basic statistics of spatial distribution are presented below.

Table 3. Distribution of traffic intensity in General Traffic Measurement in 2015

Statistics Value

Mean 12818

Standard deviation 12329

Coefficient of variation 96%

Median 9622

Kurtosis 5,20

Skewness 1,92

Minimum 179

Maximum 73937

N 148

Basic temporal analysis reveals that most of the time series exhibit nice regularities. Simply, checking

R-squared of linear trend model we obtained that more than 65% of time series of relevant length of

Continuous Traffic Measurement has R-squared greater than 0,5. That allows us to produce nice

forecasts for these time series. On the other hand we discovered that some time series have a huge

fall (level shift) probably connected with changes in traffic network. Example of well-behaved and ill-

behaved time series are presented below.

0,00%

20,00%

40,00%

60,00%

80,00%

100,00%

120,00%

0

10

20

30

40

50

60

179 6326 12472 18619 24765 30912 37058 43205 49351 55498 61644 67791

Ab

solu

te f

req

ue

ncy

AADT

Absolute frequency Cumulative distribution

Page 55: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

55

Figure 3. Examples of time series of traffic intensity

There are a few crucial steps in building a model of traffic intensity which will be performed in our

use case:

Data imputation;

Connecting traffic intensity variables with exogenous variables;

Including distance matrix or adjacency matrix to improve data coherence;

Modelling level shifts;

Building a prior traffic intensity based on General Traffic Measurements;

Building a posterior traffic intensity based on Continuous Traffic Measurements from several

sources by applying combining data methods;

Temporal disaggregation of yearly data to quarterly data.

Agriculture

Agriculture is one of the sectors of the economy; its main task is to provide agricultural products.

Plant and livestock products are obtained through tillage and plant breeding and animal husbandry.

Agriculture is also an area which has a strong impact on the environment. In recent decades, the

agricultural sector has seen much change. The recent addition of research in this sector has seen

data produced at different stages of agricultural production. This data can be processed and analysed

contributing to increased efficiency, productivity, or better use of resources.

The free, full and open data policy adopted for the Copernicus programme foresees access available

to all users for the Sentinel6 data products, via a simple pre-registration. Registration is open to all

users upon completion of an on-line self-registration accessible via the Sentinels Scientific Data Hub.

Member States requiring data for national initiatives in the frame of the Sentinels Collaborative

Ground Segment need not register on this service, they are served via the dedicated access point.

Following registration, the user can immediately download Sentinel products generated

systematically from all acquired data. Please note that depending on the mission and the acquisition

time of the product, the full operational qualification may not yet be completed.

6 Source: https://sentinel.esa.int/web/sentinel/sentinel-data-access

0

2000

4000

6000

8000

10000

12000

14000

16000

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

AA

DT

well-behaved ill-behaved

Page 56: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

56

Satellite data holds potential to be important for this domain area, for example generating

agricultural maps. With successive orbits over repeat areas, with a constant interval of time, satellite

images allow us to monitor changes in the field situation. The main satellite data applications in the

agriculture domain are as follows: monitoring of crop conditions, seasonal changes, soil properties

and mapping tillage activities.

Moreover, satellite data enables us to monitor changes in agricultural production or soil quality and

supports policy for sustainable development. Agricultural maps based on satellite images provide

independent and objective estimates of the cultivation extent in a given country or a growing season.

The use of aerial and satellite images for mapping land cover and identifying land use change is one

of the most advanced and most widely used methods of remote sensing. The most commonly used

methods are :

based on computer-aided interpretation of the types of land cover in high resolution satellite

images;

relying on semi-automatic classification of types of land cover in high resolution satellite

images using advanced techniques of identify classes, based on object classification (using

spectral, textural and contextual attributes of object).

Object-oriented methods are also used to classify land use using microwave images, which currently

has great importance to the free satellite data Sentinel-1 program Copernicus.

The scientific work focused on the creation and improvement of methods of processing and the

classification of different types of remote sensing data. This is carried out in a number of research

centres. In particular, it concerns the work on the establishment of optimal methods for classification

of multispectral images of optical and radar images. Different approaches to data classification are

tested, both at pixel and object level (analysis of spectral mixing, analysis using a decision tree,

analysis using neural networks, analysis of morphological image, multi-fractal analysis), in order to

optimize the results of the classification. Parallel studies are conducted into the combined methods

using the optical and radar images to classify elements covering the surface of the earth.

Classification algorithms are developed continuously depending on needs.

The extremely important aspect of all work that relates to the use of satellite data is their calibration

by "in situ" measurements.

Page 57: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

57

3. Issues encountered

3.1 General issues

Planning and related issues

The planning of the milestones and deliverables of SGA-1 is given in the form of a Gantt chart in

Annex II of the agreement of SGA-1.

Has the planning been realised? By now all milestones and deliverables have been realised, but in a

few instances there was some minor delay. The only case where the delay was more substantial was

WP 5. Milestone 5.3, the organisation of a meeting with Telecom-providers, did not take place before

but after the summer of 2016. This had to do with the time needed for preparation, in particular

ensuring the participation of relevant Telecom-providers. The work package leader, the ESSnet co-

ordinator and the Eurostat project manager of the ESSnet together reached the conclusion that

holding the meeting after summer would yield a far better result, and they agreed on postponing the

meeting accordingly. As a consequence, deliverable 5.1 (a report on the current status of data access

to mobile phone data in the ESS) was produced with a delay, as was the case with deliverable 5.2 (a

report with recommendations), but both deliverables were realised before the end of SGA-1. The

output of WP 5 was not affected, only the timing – and for good reasons.

More details on the realisation of the planning of the specific work packages will be provided in the

Final Report on the Implementation of the Action, due 60 days following the closing date of the

action.

The issues that occurred in carrying out the action of SGA-1 for the specific work packages are

discussed in section 3.2. There were no major cross-cutting issues, and the CG meetings proved to be

effective in solving all matters that affected multiple work packages.

Use of the Sandbox7

The Sandbox is a shared platform for storage and computation of big data, hosted and managed by

ICHEC (Irish Centre for High-End Computing). The Sandbox is one of the outcomes of the HLG Big

Data project, carried out in 2014 and 2015 and facilitated by UNECE. The use of the Sandbox as a

training and collaboration platform was successfully tested during the project and after the end of

the project, an agreement with ICHEC to grant the use of the Sandbox to organizations on a

subscription basis. The main characteristic of the Sandbox is the possibility to share datasets and

work with big data tools without any software installation and configuration. The tools included in

the Sandbox are accessible remotely from any computer simply through a web browser and do not

need special software or hardware requirements.

A subscription to ICHEC for the use of the Sandbox was acquired by the ESSnet project. The

subscription gives the possibility to all the project participants to create and account in the Sandbox

to upload/download datasets and use the tools for analysis.

7 This section was kindly written by Antonio Virgillito, who is the Sandbox officer for the ESSnet.

Page 58: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

58

A special section in the project wiki is dedicated to the Sandbox, with instructions on access and use

and documentation for all the tools:

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/Sandbox

In the following we describe the ways the Sandbox has been used in SGA-1:

WP 2. A software for scraping web sites was uploaded in the Sandbox from members of WP 2 and is

available for all the users.

WP 3. Members of WP 3 requested the use of the Sandbox for experimenting analysis of smart

meters data. Synthetic smart meters data sets were used during the HLG project but were removed

after the project ended and were no longer available. So, a search about public data on the topic led

to a public open data set published by the Australian government. The dataset was uploaded directly

in the Sandbox and now is available for all users. It contains 350M real observations of electricity

consumption readings of 80.000 anonymized domestic customers over a time range of 3 years.

Examples on the use of the smart meters datasets for analysis and visualization were developed

during one of the ESTP courses on big data held in 2016 and is available in the project wiki.

WP 4. Members of WP 4 uploaded in the Sandbox a sample of the datasets concerning ship

positioning data. The dataset is available for all the users and it represents six months of

observations of positions of ships in European waters. Examples on the use of the AIS datasets for

analysis and visualization were developed during one of the ESTP courses on big data held in 2016

and is available in the project wiki.

WP 5. Access to the UNECE Sandbox was requested in order to have a view of such a platform as a

potential model to store and process data in-situ within the MNOs' own premises.

Page 59: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

59

3.2 Issues at the level of the work packages

1 Webscraping / Job Vacancies

The overriding issue for WP 1 is how to address the various technical challenges of using web scraped

job advertisements to produce concrete and meaningful outputs. This is summarised in Section 2.1.

Some of the issues encountered within this pilot are country specific. For example, Sweden has faced

an on-going problem in getting permission from their legal department to undertake web scraping.

Germany have a specific issue in gaining access to job vacancy survey micro data that is held by

another government department. The approach here has been to focus on other areas where

progress is possible. For example, in Sweden, the focus has been on analysing data obtained through

the Swedish Employment Agency. In Germany, a comparison of job portal and job vacancy survey

data was made based on aggregated data.

Some of the more general issues encountered within WP 1 are listed below:

Staffing

Some countries involved in WP 1 have experienced difficulties in retaining staff, in particular data

science specialists, who have been difficult to replace. The countries affected are Sweden who lost a

key team member in May, and the UK, where three team members left during 2016. Germany also

lost a key subject matter expert. Thus, the experience is that it is difficult to recruit and retain staff

with strong data science skills.

Working across environments

For the most part, the experimental web scraping activities within this work package have been set

up in open environments, separate from existing production environments, which require higher

levels of security. This has created some practical difficulties when looking to combine data collected

from different environments, i.e. web scraped job vacancy data with survey or business register data.

If the web scraped data is moved into the secure environment, then the range of data science

analytical tools is often limited. If the data is to be combined in a less secure environment then

additional processing on the secure data may be needed. This additional processing could include

removal of variables and/or the creation of temporary unique identifiers to enable some processes

(e.g. matching on company name) to be done in a less secure environment using open source

machine learning tools with the processed data then being moved back into the secure environment

to incorporate the secure elements (e.g. the survey responses). This issue is not a show stopper, but

it does underline that the IT systems of NSI are not yet geared up to combine and process data from

different sources using open source big data tools.

CEDEFOP

WP 1 has a good relationship with CEDEFOP and reached an earlier agreement to share data from

their pilot system. However, the analytical dataset lacked sufficient date and company information

that is needed to validate against survey data. CEDEFOP have now launched the next phase of the

web scraping project and so there is an opportunity to develop a deeper partnership with the Big

Data ESSnet. WP 1 participants met with CEDEFOP as part of a satellite meeting set up as part of the

Page 60: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

60

ESSnet dissemination workshop in Sofia, in February 2017. This was followed by a partnership

agreement between Eurostat and CEDEFOP to help facilitate collaboration.

Work package expansion

For SGA-2, four new partners (France, Belgium, Denmark, Portugal) will join WP 1 bringing the total

number of partners to ten. While this creates the opportunity to spread knowledge wider and to tap

into a wider pull of expertise, it also creates additional logistical challenges. Some of the new

partners joining during SGA-2 have already undertaken some work so there will be opportunities to

share information. However, there is a need to still “push deep” and focus efforts on producing the

meaningful concrete results that are proving elusive.

2 Webscraping / Enterprise Characteristics

There were two main issues in carrying out WP 2 work:

Issue 1: Legal and ethical limitations in accessing and storing data scraped from Enterprises

web sites.

Issue 2: Coordination about pilots.

Concerning issue 1, the most interested countries (as detailed in the Deliverable 2.1) were UK and

Sweden. The impact on the development of pilots implementing the use cases was as follows:

Sweden worked only on the Job Vacancy use case, especially as a contact point with WP 1.

UK worked on selected use cases (1,2,3) especially focusing on the analysis phase rather than

on the web scraping one (at massive scale).

Poland also has a not completely defined legal landscape, though this did not have a major impact on

the pilots.

Concerning issue 2, there was a problem of coordinating among each other the different pilots. In

this respect, it was decided that:

ICT Survey can be considered as benchmark, however there are differences at country level.

Volunteering countries can use software developed by other countries, if there are resources

for doing that. So far, BNSI volunteered to use Istat’s software.

A “logical architecture” for the pilots was shared, so that even if the pilots do use different

software, it is possible to classify and compare the different solutions.

3 Smart Meters

The biggest issues the partners have faced during the project are related to getting access to data,

delays caused by the lack of knowledge and experience with big data, and project members

leaving/changing.

Data access

The most important issue has been getting access to data. In the beginning of the project Statistics

Estonia and Statistics Denmark had access to some data, Statistics Sweden was having discussions to

get the data and Statistics Austria did not plan to try to get access for different reasons. The aim was

Page 61: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

61

that during the project Statistics Denmark would get access to newer and more detailed data and

Statistics Sweden would get access to some test data.

During 2016 Statistics Sweden had several meetings with the data owners and during the meetings

and discussions common concerns and interests were recognized on which to build further

cooperation. Agreement was reached on getting test data in the beginning of 2017.

Statistics Denmark has had access to the aggregated annual consumption data. Access to detailed

electricity consumption data was achieved in the end of the 2016 when a contract between the

datahub owner and the office was agreed on. As the access to detailed data was achieved after the

report was delivered, it was decided not to update the report with regard to the data handling part

when the new relevant information became available. These changes will be made during the second

part of the project.

Lack of skills

Although Statistics Estonia has had access to data from the beginning of the project, the office did

not have suitable IT infrastructure to store and handle the large amount of data. Neither had the

office knowledge how to install and set up the system. For getting the suitable infrastructure to store

the data, several servers were rented from our IT partner using the resources foreseen by this

project. It took three months to get servers with suitable configuration. However, a by far bigger

problem was getting the new software installed and running as there is no competence available in

the office. Some help were ESTP courses on big data that two specialists from Statistics Estonia

attended. In the end, with the help of an outside consultant the system was set up and at the

moment, the most common analyses can be carried out on the data. It takes some time to learn to

use the new system, but the issue that still needs to be solved is acquiring the skills to maintain and

upgrade the system, so more advanced tools could be added.

Personnel

Some delays in the work were caused by the change of personnel in the offices. Two members (one

from Statistics Sweden and one from Statistics Denmark) in this work package have been replaced by

new people during 2016.

4 AIS Data

Staffing issues

Very shortly after the first face-to-face meeting in May 2016 our team member from GUS (Dominik

Rozkrut) got another job and left our work package. It took till the beginning of September before

there were new participants from GUS available for this work package. Also one of the participants

from CBS in this work package (Maarten Pouwels) left this project in November 2016, because of

other priorities in his work. This was not a big issue, because Tessa de Wit could take over his tasks in

a very short time period. Tessa was already involved in this project since the beginning of September

2016.

Getting AIS data by EMSA

Page 62: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

62

We still did not succeed in getting free European AIS data from EMSA. We work together with

Eurostat on this topic. Together with Eurostat we visited DG Move in Brussels in July 2017 to explain

why we need this data. The High Level Steering Group has to decide now if we get AIS data of EMSA

available for our work package. If we do not get access to the EMSA data we cannot deliver

deliverable 4.7 of SGA-2, but it would not have any consequences for the other deliverables in this

work package.

5 Mobile Phone Data

There is no issue of relevance to be mentioned during the execution of SGA-1 for WP 5.

Access to the UNECE Sandbox has been requested but not for data storing and data processing

purposes. The aim was to have a view of such a platform as a potential model to store and process

data in-situ within the MNOs' own premises.

Regarding the consultancy agreement with Positium, the technical assistance during the workshop in

Luxembourg was conducted according to the initial budget. Concerning the two reports requested

regarding both the access and the processing of mobile phone data, the first report was produced

and received in time to be strongly used for the composition of deliverable 1.2. The second report

concerning methodology for the processing of data was satisfactorily received some time afterwards

and will play a relevant role in the SGA-2. Furthermore, Positium was invited to the first physical

meeting of partners of the WP 5 in relation to diverse issues regarding both the reports and the

access to mobile phone data from MNOs. This technical assistance was also part of the original

budget of SGA-1.

6 Early Estimates

In the proposal of the WP 6 we were very ambitious in the sense that we planned to carry out two

pilots which would give us "quick wins". Beside the pilot related to Nowcasting the turnover indices,

the WP 6 team tried to work also on the pilot which aim was to test the possibility to estimate the

sentiment indicators such is the Consumer Confidence Index (CCI) using data from social media

(Facebook, Twitter, ...). However, due to early findings where we found out that access to a sufficient

amount of data is impossible, we abandoned this idea. The problem was also access to the historical

data. The same issue occurred during the implementation of the pilot on early economic indicators

where we did not have access to time series of big data sources.

For SGA-2, other issues are IT infrastructure and IT tools in case of access to big data sources. In the

SGA-1 period the WP 6 team focused more on the “big data methods” rather than big data sources

due to obvious problems with the access to those sources.

7 Multi Domains

There is no issue of relevance to be mentioned during the execution of SGA-1 for WP 7. Access to the

Sandbox was not necessary at this stage of work. For experimental purposes the team are using their

own server with dedicated software running on Apache Spark and Apache Hadoop. At the moment

they are using Python language for pilot surveys.

Page 63: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

63

Annex: Communication and Dissemination

1 Summary

WP 9, Dissemination, focuses on the internal and external communication of the ESSnet Big Data.

The major activities in SGA-1 consisted of creating and maintaining a Mediawiki collaboration and

communication platform which also serves as public extranet, mirrored on the CROS Portal and

complemented by a restricted-access project website; supplying project participants with training,

support and assistance in using the collaboration and dissemination tools; and preparing the SGA-1

Dissemination Workshop on 23-24 February 2017 in Sofia (Bulgaria).

2 Introduction: targets, objectives and tasks

The groups and individuals targeted by WP 9 include, in order of importance:

the 22 partners participating in the ESSnet Big Data, in particular the ones active in SGA-1;

non-participating ESS NSIs as well as European and international organisations involved in big

data initiatives;

other organisations or individuals interested in official statistics based on big data, as well as

the media and the general public.

The objectives of WP 9 tailored to these target audiences are:

ensuring the exchange of information on tasks, planning, timing, processes, intermediate

products and final results among pilot project participants;

providing all necessary tools, datasets and training to project participants in an easily

accessible way;

disseminating results, as best practices, to the ESS and the wider community of official

statistics, and in the appropriate formats to anyone else potentially interested.

The concrete tasks to achieve these objectives are:

building a collaboration and communication platform for ESSnet partners to post or consult

information, collaborate on outputs and comment or discuss processes and outputs;

one or more external websites presenting project outputs, customised to the different

categories of users (the target audiences identified above);

training, support and assistance, either individually or via tutorials and documentation, on

the use of the platform and its tools, the preparation of reports and deliverables (via

templates) and the dissemination of results via the ESSnet Big Data websites or other

channels such as articles, publications, presentations at conferences, …;

a general dissemination workshop on results and lessons learned at the end of SGA-1.

3 Collaboration and communication platform/extranet

A Mediawiki website https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/ has been created

and is being expanded gradually and continuously.

Page 64: ESSnet Big Data - Europa€¦ · statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the possibilities and pitfalls of creating a database

64

It serves two functions:

collaboration and communication platform for project participants: project information and

backgrounds, contact information (later moved to restricted-access site), resources and

tools;

extranet, presenting outputs in a well-structured way to anyone outside the project:

backgrounds, documentation, public reports and deliverables.

4 CROS Portal mirror site

On the CROS Portal a mirror site https://ec.europa.eu/eurostat/cros/content/essnetbigdata_en

using Confluence was created, providing access to the same content as the Mediawiki site (without

the ‘special’ project pages). This was achieved differently for pure navigation pages and content

pages: navigation pages at the two highest hierarchical levels were duplicated in the CROS Portal, but

the links to content pages in them leads to the Mediawiki content pages. This has the double

advantage of being manageable (navigation pages change infrequently) and maintaining only one

version of the ‘truth’.

5 Restricted-access project website

A restricted-access project website

https://webgate.ec.europa.eu/fpfis/wikis/pages/viewpage.action?spaceKey=EstatBigData&title=ESS

net+Big+Data using Confluence was created to store all confidential project information (e.g.

personal or financial data). This site is linked to from the central Mediawiki site but persons not

specifically granted access cannot view its content.

6 Training, support and assistance

Project participants were given the needed training, support and assistance to use the Mediawiki

collaboration and communication platform, either individually or via tutorials and documentation

created in the website (see

https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/Category:Tutorial) and via

templates for reports and deliverables. Assistance is also available for the dissemination of results via

the websites or other channels such as articles, publications or presentations at conferences.

7 Dissemination Workshop

The results of SGA-1 have been presented on a Dissemination Workshop on 23-24 February 2017 in Sofia (BG), with an attendance of about 80 persons (see https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/d/db/Dissemination_Workshop_2

017_02_23-24_Sofia_Minutes.pdf)