Page 1
ESSnet Big Data
S p e c i f i c G r a n t A g r e e m e n t N o 1 ( S G A - 1 )
h t t p s : / / w e b g a t e . e c . e u r o p a . e u / f p f i s / m w i k i s / e s s n e t b i g d a t a h t t p s : / / e c . e u r o p a . e u / e u r o s t a t / c r o s / c o n t e n t / e s s n e t b i g d a t a _ e n
Framework Partnership Agreement Number 11104.2015.006-2015.720
Specific Grant Agreement Number 11104.2015.007-2016.085
W o rk P a c ka ge 0
Co - o rd i na t i o n
De l i vera bl e 0 . 2
F i na l T ec hni c a l R epo rt
Final version 2017-08-31
ESSnet co-ordinator:
ESSnet co-ordinator:
Peter Struijs (CBS, Netherlands)
[email protected]
telephone : +31 45 570 7441
mobile phone : +31 6 5248 7775
Prepared by:
Martin van Sebille (WP 0, CBS, Netherlands) Peter Struijs (WP 0, CBS, Netherlands)
Nigel Swier (WP 1, ONS, United Kingdom) Monica Scannapieco (WP 2, ISTAT, Italy)
Maiki Ilves (WP 3, EE, Estonia) Anke Consten (WP 4, CBS, Netherlands)
David Salgado (WP 5, INE, Spain) Boro Nikic (WP 6, SURS, Slovenia)
Anna Nowicka (WP 7, GUS, Poland) Marc Debusschere (WP 9, SB, Belgium)
Page 2
2
Contents
page
Executive summary 3
1. Introduction 6
1.1. Background 6
1.2. General approach 7
1.3. Organisation 8
2. Results of the work packages 9
2.1. Webscraping / Job Vacancies 9
2.2. Webscraping / Enterprise Characteristics 15
2.3. Smart Meters 18
2.4. AIS Data 26
2.5. Mobile Phone Data 34
2.6. Early Estimates 40
2.7. Multi Domains 49
3. Issues encountered 57
3.1. General issues 57
3.2. Issues at the level of the work packages 59
Annex: Communication and dissemination 63
Page 3
3
Executive Summary
This is deliverable 0.2, the Final Technical Report, of the Specific Grant Agreement No 1 (SGA-1) of
the Framework Partnership Agreement (FPA) for the ESSnet Big Data. The FPA, which has 22
partners, covers the period from January 2016 to May 2018. SGA-1 covers the period from February
2016 to July 2017, on which this deliverable reports. This includes the period covered by deliverable
0.1, the Intermediate Technical Report.
The ESSnet has organised the core of its work around seven work packages, each work package
dealing with one pilot and a concrete output. The pilots cover five phases: (1) data access, (2) data
handling, (3) methodology and technology, (4) statistical output, and (5) future perspectives. SGA-1
covers only some of the five phases for each of the work packages, the rest being covered by SGA-2.
These are the main results obtained in SGA-1:
WP 1 Webscraping / Job Vacancies
For SGA-1, this work package focussed mainly on job portals. Since this pilot involves each country
taking its own specific approach, there are a lot of country specific results. However, general
selection criteria have been identified for targeting portals for scraping. Taking into account the
distinction between job vacancy and job advertisement, a conceptual model is proposed of how on-
line job advertisements correspond to the target population. In practical terms this may be defined
as all vacancies that are available to be measured by existing job vacancy surveys. As well as
providing a conceptual framework for understanding the coverage of job vacancies from on-line
sources and how these relate to the measurement of all job vacancies, this approach may also
provide the conceptual basis for an estimation framework, including an approach to data integration.
Furthermore, the work package has identified an opportunity to work with the EU Centre for
Vocational training, CEDEFOP.
WP2 Webscraping / Enterprise Characteristics
Six use cases have been identified in the pilot: (1) enterprise URLs inventory, (2) e-commerce in
enterprises (about predicting whether or not an enterprise provides web sales facilities on its
website), (3) job vacancies ads on enterprises’ websites, (4) social media presence on enterprises
webpages, (5) sustainability reporting on enterprises’ websites (linked to the UN Sustainability
Development Goals), and (6) relevant categories of enterprises’ activity sector (NACE) aimed at
checking or completing statistical business registers. A common use case template was developed
and has been used. For the use cases, a total of sixteen pilots were performed and all of them were
mapped to a general “logical architecture”. Also, a report was produced on legal aspects related to
web scraping of enterprise websites, aimed at showing the real possibilities for the NSIs to perform
activities of web scraping. These appear to be generally favourable, although the situation differs
from country to country.
WP 3 Smart Meters
This pilot has investigated data access and data handling of smart meters electricity data. It has
carried out a literature study and a survey on access to smart meters data, which was sent to the
NSIs of all EU member countries in the spring of 2016, with 18 responses. It appeared that only two
Page 4
4
countries currently have access to data: Denmark and Estonia. Several countries were aware of
substantial legal barriers. Some countries such as Poland are in the process of drawing up legislation
that will enable smart meters data use. For two countries, Estonia and Denmark, the pilot has
defined and assessed the quality of smart meter electricity data, and a synthetic dataset was
analysed as well, aimed at generating demo output and developing and testing statistics and
algorithms for situations where linkage to enterprise or household characteristics is necessary.
WP 4 AIS Data
The work package investigates whether real-time measurement data of ship positions (measured by
the so-called AIS-system) can be used to improve the quality and internal comparability of existing
statistics and for new statistical products relevant for the ESS. Reports were produced on (1) the
possibilities and pitfalls of creating a database with AIS-data for official statistics, (2) deriving harbour
visits and linking data from maritime statistics with AIS-data, and (3) sea traffic analyses using AIS-
data. While the possibility of using AIS data from EMSA is still being investigated, AIS data from
Dirkzwager was used, and the data quality analysed. Visualisations were made, showing the coverage
of the ships by the data, and showing the path of a ship through time. A method to build a reference
frame of maritime ships was developed. First results show that AIS data can be used as a backbone
for maritime statistics. This is important, since the added value of running a pilot with AIS-data at
European level is linked to the fact that the source data is generic worldwide and data can be
obtained at European level.
WP 5 Mobile Phone Data
This work package has focused exclusively on data access during the SGA-1, which will be needed for
SGA-2. A preliminary analysis of the issues regarding the access to mobile phone data was made,
which was the basis for the design of a questionnaire surveying the status of this access across the
ESS. Belgium, Finland, France, and Italy were found to have succeeded in their negotiations to have
access to a concrete mobile phone data set that can be used for SGA-2 (together with those of UK,
Netherlands, and Germany, as new partners for the second phase). Spain and Romania are still under
contact with MNOs pursuing this goal. A workshop was held in Luxembourg to bring together mobile
network operators, national statistical offices, Eurostat and other stakeholders, including some other
international organizations (UN, OECD, ITU, DG Connect, DG Digit). Finally, with the technical
assistance from the Estonian company Positium, which is an international expert in accessing and
processing mobile phone data for statistical purposes, a set of guidelines for the access to these data
has been produced with technical, business and practical recommendations for partners of the ESS.
WP 6 Early Estimates
The aim of this pilot is to investigate how a combination of multiple big data sources and existing
official statistical data can be used in order to create existing or new early estimates for statistics. A
list was compiled of possible data sources and the statistical domains where they could be employed,
and it was decided that the most promising and interesting ones concerned combining sources for
early estimates on economic indicators. The economic indicators and possible sources were further
specified. In this context two pilots were conducted, one by Statistics Finland and one by the
Slovenian NSI. The relationship between GDP and Slovenian traffic sensor data was investigated, and
Page 5
5
Statistics Finland produced nowcasts of turnover indices. These results were used for a business case
for the research of SGA-2.
WP 7 Multi Domains
The aim of this pilot is to find out how a combination of big data sources, administrative data and
statistical data may enrich current statistical output. Three statistical domains are being investigated:
(1) population, (2) tourism/border crossings and (3) agriculture. For population, three areas are
looked at: daily (life) satisfaction, the moods of population associated with public events (e.g., Brexit,
voting), and morbidity areas (e.g., flu). For tourism/border crossings, a number of possible data
sources have been identified and investigated, for instance with regard to traffic intensity
information. For agriculture, the focus is on recognizing crop types based on satellite data.
Page 6
6
1. Introduction
1.1 Background
This is deliverable 0.2, the Final Technical Report, of the Specific Grant Agreement No 1 (SGA-1) of
the Framework Partnership Agreement (FPA) for the ESSnet Big Data. The FPA stretches from January
2016 to May 2018. SGA-1, and thus this report, covers the period from February 2016 to July 2017. A
second SGA (SGA-2) covers the period of January 2017 till the end of the FPA. This means that SGA-1
and SGA-2 have a time overlap from January till July 2017, but this report does not include activities
of SGA-2.
This Final Technical Report builds on deliverable 0.1 of SGA-1, the Intermediate Technical Report, of
January 2017, and can be seen as an extension. In fact, much of its contents is the same, as several
work packages ended soon after the Intermediate Technical Report was written. However, contrary
to the Intermediate Technical Report, it does not have an annex with an overview of the activities
carried out by the work packages, as this will be included in the Final Report on the Implementation
of the Action, due 60 days following the closing date of the action. For the same reason, this report
does not comprise an evaluation of the budget needed and used.
The overall objective of this ESSnet is to prepare the ESS for integration of big data sources into the
production of official statistics. The FPA is founded on a consortium of 22 partners, consisting of 20
National Statistical Institutes (NSIs) and two Statistical Authorities. For SGA-1, all but two NSIs have
been involved as beneficiaries of the agreement, so SGA-1 was carried out by 18 partners.
For SGA-1 as well as SGA-2, the consortium has organised the core of its work around a number of
work packages, each work package (WP) dealing with one pilot and a concrete output. In SGA-1
there are seven work packages, focused on specific sources or domains:
1. WP 1 Webscraping / Job Vacancies
2. WP 2 Webscraping / Enterprise Characteristics
3. WP 3 Smart Meters
4. WP 4 AIS Data
5. WP 5 Mobile Phone Data
6. WP 6 Early Estimates
7. WP 7 Multi Domains
A separate work package, WP 0, was created for the co-ordination of the ESSnet. For dissemination a
separate work package was created as well, WP 9. That work package is also responsible for
facilitating communication. Given the overall objective, the findings need to be generalised. This will
be done in SGA-2, for which a new work package is added, WP 8 (methodology, but also covering
other overarching aspects, such as IT).
SGA-1 specifies the agreed outputs of the work packages, and its inputs, both in terms of number of
days contributed by partner and work package and in terms of material costs. For SGA-1, the total
budget available is one million euro, but only 90% of costs, as a maximum, will be reimbursed. (The
same budget and percentage apply to SGA-2.)
Page 7
7
For more specifics on the FPA, SGA-1 and SGA-2 reference is made to the actual agreements. For the
current deliverable it is useful to mention that an overview of the milestones and deliverables of
SGA-1 can be found on page 10 of the signed version of Annex II of SGA-1, an overview of the
distribution of manpower (by partner and work package) is given on page 43, and an overview of the
foreseen physical meetings on page 44 of the same document. For each work package the document
(Annex II of SGA-1) provides a description of tasks, specifying, among other things, the tasks to be
carried out, the milestones and deliverables, and the number of days each partner contributes to the
work package. A specification of the budget is given in Annex III of SGA-1.
The remainder of this chapter describes the approach generally taken to the pilots, and the way the
ESSnet has organised itself. The next chapter presents the results obtained so far, for the seven work
packages. The third chapter describes the issues encountered so far in the action, at a general level
and for the work packages, and also provides an outlook for SGA-2 for each of the work packages.
1.2 General approach
The pilots, as foreseen in the FPA, have one thing in common: they cover the complete statistical
process, from data acquisition to the production of statistical output. In addition, and in accordance
with the general objective to prepare the ESS for the integration of big data sources into the
production of official statistics, the pilots also consider future perspectives. Thus, all pilots recognise
the following five phases:
1. Data access
2. Data handling
3. Methodology and technology
4. Statistical output
5. Future perspectives
The tasks, milestones and deliverables of the work packages refer to these phases. However, SGA-1
covers only some of the five phases for each of the work packages, the rest being covered by SGA-2.
And the phases covered by SGA-1 are not the same for each pilot (work package), as for some areas
it was possible to plan ahead further (in time and phases) than for other areas. In particular, WP 5
concentrated on data access problems in SGA-1 and could not plan further ahead, as data processing
would depend on the results of the efforts to realise data access. Therefore, WP 5 was planned to
end in December 2016, whereas the other work packages would continue into 2017. For WP 6 and
WP 7, a longer exploration period was needed for the first two phases, therefore they were planned
to end in February 2017. This explains the overlap in time of SGA-1 and SGA-2.
At a practical level, this approach required to be facilitated in several respects. First of all, an
organisational approach was needed to ensure that the agreed output would be produced with the
resources foreseen. This is the subject of the next section, 1.3. In order for the partners of the ESSnet
to be able to process big data, some IT facilities were considered necessary, although these would
probably not be needed at the beginning of the work of the work packages, when data access had to
be arranged first. IT facilities were ensured by subscribing to the so-called Sandbox in Ireland. This is
explained further in section 3.1. Facilities were also needed for communication, in order to share and
work on documents together and for virtual meetings, among other things. This is the subject of the
Annex to this report.
Page 8
8
1.3 Organisation
The organisational has been carried out as foreseen in the agreement of SGA-1. Each work package
has a work package leader who is in charge of organising the realisation of the milestones and
deliverables of the work package. This includes the organisation of virtual and physical meetings of
the work package members. The results of the work packages are described in chapter 2.
At the level of the ESSnet as a whole, the main instrument for co-ordination is the monthly virtual
meeting of the work package leaders, including WP 0 and WP 9, supported by the secretary (Martin
van Sebille) provided by WP 0. These are called the meetings of the co-ordination group, or CG
meetings. Eighteen virtual meetings were held during SGA-1. One big physical co-ordination meeting
was held in Tallinn, Estonia, in June 2016, for which a separate report is available on the wiki of the
ESSnet:
https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/8/85/Minutes_WP0_20160613-
15_Tallinn_meeting.pdf. That meeting was attended by almost all partners of the ESSnet. A smaller
physical co-ordination meeting was held in Brussels, Belgium, in November 2016, for which a
separate report has been made available on the wiki of the ESSnet:
https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/Coordinating_Group_2016_11
_17-18_Brussels. In addition, a dissemination workshop was held in Sofia in February 2017 for a
wider audience, in which the main results of the ESSnet achieved up till then were presented and
discussed. Again, a separate report for the meeting is available on the wiki of the ESSnet:
https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/d/db/Dissemination_Workshop_2
017_02_23-24_Sofia_Minutes.pdf
The aim of the virtual CG meetings and the physical co-ordination meetings was to stay in control of
the realisation of SGA-1. For most CG meetings the partners were asked to provide information on
the realisation of the foreseen budget in the form of a spread sheet, which was consolidated by the
secretariat (WP 0), thereby enabling the CG to link the progress in producing results to the resources
actually spent. The meetings were also used, of course, to discuss cross-cutting issues. In addition to
the work package leaders, the virtual CG meetings and other co-ordination meetings were also
attended by the Eurostat project manager of the ESSnet, Albrecht Wirthmann.
In order to ensure the quality of the deliverables of the ESSnet, a Review Board was created,
consisting of three members: Lilli Japec (chair), Anders Holmberg and Piet Daas (who is also the
leader of WP 8 of SGA-2). All work package leaders have arranged that their deliverables were
reviewed by the Review Board. This has worked well, both in practical terms (planning etc.) as in
terms of contents (usefulness of the reviews): all deliverables of SGA-1 have been reviewed and the
comments have been taken into account. The members of the Review Board are also invited to the
CG meetings.
The organisational arrangements of the ESSnet are considered to be quite adequate. There is no
need to make any changes to these arrangements in SGA-2.
Page 9
9
2. Results of the work packages
2.1 Webscraping / Job Vacancies
The aim of this pilot is to demonstrate by concrete estimates which approaches (techniques,
methodology etc.) are most suitable to produce statistical estimates in the domain of job vacancies
and under which conditions these approaches can be used within the ESS. The focus is on feasibility
and the pilot explores different sources including job portals, job adverts on enterprise websites, and
other sources (e.g. government employment agency data and commercial data providers).
The business justification for this pilot is that on-line job advertisements can provide much more
detailed and timely information about job vacancies than official job vacancy surveys. In particular
they contain information about the types of jobs (e.g. occupation, associated skills) and where in a
country those jobs are advertised. Therefore, these data could provide valuable additional
information about the labour market for policy making.
For SGA-1, the work package is focussed mainly on job portals. For SGA-2 the intention is to explore
the potential of capturing vacancies from enterprise websites using approaches developed by WP 2.
This together with legal issues has involved close collaboration with WP 2 (led by Italy).
The work package is led by the United Kingdom with support from Germany, Greece, Italy, Slovenia
and Sweden. Italy has observer status to help coordination with WP 2. The planned SGA-1 activities
of the work package have been grouped into the following high level tasks:
1. Data Access: This task includes preparing an inventory of relevant job portals in each participating
country, qualitative assessment of the information available, review of job vacancy statistics,
coverage assessment and conceptual analysis in comparison with current definitions, and feasibility
of accessing data from third party sources.
2. Data Handling: This task includes studying the technical and legal aspects of web scraping job
portals, evaluating legal aspects, designing web scraping experiments, exploration of web scraping
technologies, executing data including data over sustained time periods and quality assurance of
third party data sources. The aim is to test the feasibility of web scraping, but not to produce
production ready systems.
3. Methodology for Output Production: This task is focused on the processing steps required to
transform semi-structured web data from job portals into a structure suitable for analysis. This
includes data cleaning, evaluation and treatment of missing data, de-duplication of job adverts (both
within and across job portals), linking data with survey/business register data, coding and classifying
data, quality assessment of structured data and presentation of experimental results.
Deliverables
Three deliverables were produced during SGA-1:
1.1 Inventory and Qualitative Assessment of Job Portals (Completed July 2016)
Page 10
10
The first deliverable aimed to establish criteria for evaluating job portals to help inform decisions as
to which job portals should be targeted for web scraping. This was considered particularly important
for large countries such as Germany and the UK where there are large numbers of internet job
portals where it is infeasible to scrape all of them. Therefore, selection criteria are needed to identify
which portals should be prioritised for web scraping.
1.2 Interim Technical Report (completed November 2016)
The Interim Technical Report focused on the initial work on the data access and data handling tasks
covering the technical issues of web scraping and working with web scraped data. This incorporated
the results on the individual pilots executed within each country along with the work of the first two
virtual sprints covering de-duplication and matching. This also included a review of definitional
issues, in particular, the differences between the target concept of a job vacancy and the target
measure of an on-line job advertisement.
1.3 Final Technical Report (completed July 2017)
The Final Technical Report elaborated on the work of the first report covering the work completed by
each country up until the end of SGA-1 and also included results of the virtual sprint on quality
frameworks.
In addition, WP 1 provided some input into the WP 2 deliverable:
2.1 WP2 report on Legal Issues Relating to the Scraping of Enterprise websites (February 2017)
Main findings
Since this pilot involves each country taking different approaches, there are various country specific
results that are difficult to summarise in this report. The following is a summary of the main findings
from the WP 1 pilot during the SGA-1 period.
On-line job advertisements are part of a complex data eco-system typically involving many job
portals of different types that are both competing and cooperating with other portals. The job portal
market is characterised by evolving business models and changing market shares. The complexity of
this ecosystem varies greatly between countries with Germany having over 1600 job portals and
Slovenia having about 30 (and only 2 main ones). Advertisements are typically posted and
republished multiple times and so duplication is a major challenge when compiling a definitive set of
advertised job vacancies from multiple on-line sources. These sources may also include jobs
advertised on a company’s own website.
Although we do not yet have a full grasp of the proportion of on-line jobs in relation to all job
vacancies as measured by official job vacancy surveys, it is clear that some jobs are not advertised
on-line, that the degree of on-line penetration varies between countries, and that coverage has likely
been increasing over time. This presents fundamental statistical challenges in terms of how these
data may be used for measuring both the level of job vacancies and the change over time.
There are also important differences between the target concept of a job vacancy and the target
measure, which is a job advertisement. Apart from a vacancy being advertised multiple times, a
single advertisement may also be used to fill more than one vacancy. In addition some
Page 11
11
advertisements, known as ‘ghost vacancies’, may exist without a real underlying vacancy. Some on-
line jobs may only be advertised on either a job portal or an enterprise website. Some jobs
advertised through job portals may identify the employing business whereas others only identify the
employer.
Figure 1. Conceptual model for measuring job vacancies from on-line sources.
In addition, most job vacancies will exist for longer than the period for which they are advertised - a
vacancy needs to be created before it can be advertised and then normally takes some time to fill
after the advertisement has closed. Therefore, there is usually a lag from the time a job is first
advertised to the time it is filled. The Slovenian pilot estimated this to be about 45 days. These
factors mean that it is difficult to directly compare on-line data with official job vacancy survey
estimates.
The web scraping tools used within the pilot include simple “point and click” tools (e.g. Import.io) as
well as more sophisticated web scraping frameworks (e.g. Python scrappy, Selenium). Point and click
approaches have proved suitable for small volumes countries but more sophisticated approaches
would be needed for production systems involving larger volumes of data.
However, the much greater technical challenge is around transforming raw web scraped data into
data ready for analysis. On-line job advertisements usually consist of a small number of structured
elements and a larger amount of text containing the full job description. However, even the
structured elements can be messy and considerable effort is required to clean and classify data prior
to analysis. In addition, job portals usually have their own taxonomies for classifying data. There are
also legal and ethical issues around web scraping to consider and while many of these are common
between NSIs, legal uncertainty around web scraping means that NSIs have followed advice from
their own legal departments.
For all these reasons, NSIs (and particularly those in larger countries) should be aiming to build
partnerships with job portal owners, or others with access to job vacancy data, rather than looking to
Page 12
12
build their own large scale web scraping systems. WP 1 has identified an opportunity to work with
the EU Centre for Vocational training (CEDEFOP), which is currently undertaking an EU-wide project
to web scrape job vacancy data for all EU member states. A partnership is being developed with
CEDEFOP with a view to coordinating activities and in the longer term we expect that the data from
this initiative will become available to the wider ESS.
Thus, a key conclusion is that it is better for NSIs to focus on activities where they can add value
rather than simply replicate what CEDEFOP is already doing (and some commercial companies have
already done). One specific area where NSIs have a strategic advantage is in relation to official job
vacancy surveys and how they might be used to better understand on-line job vacancy data. This
involves matching survey reporting units with the on-line data and comparing counts from different
on-line sources, including both portals an enterprise websites. This approach has its own challenges
around matching survey reporting units with on-line advertisements. However, this approach can
form the basis for better understanding issues of data quality.
It is clear that the gap between on-line job advertisements and what is measured through on-line
surveys means that on-line data could not replace the existing surveys. However, both sources could
be used together with the survey providing control totals with the on-line data providing additional
granularity. An outline of how this could work is presented in Figure 2. There may also be scope to
use the more timely information available from on-line sources to produce nowcasts of job vacancy
rates that could be used to inform economic policy.
Figure 2. Outline Approach to Data Integration
Page 13
13
Concrete results
In terms of producing concrete results, Slovenia is the most advanced and some results are expected
early in SGA-2. Slovenia is a small country with a relatively small number of job portals and job
vacancies to be counted and so the problem is more manageable than for larger countries. For
example, it is estimated that 95 per cent of all jobs advertised on-line are captured in the two largest
Slovenian job portals accounting for about 25% of all Slovenian job vacancies. Administrative data on
all public sector jobs increases the overall coverage to 54%. Some capture and linkage of vacancies
listed on enterprise websites has already been done which suggests the coverage could be increased
further.
However, it is clear that producing meaningful concrete outputs remains challenging, especially for
larger countries. This can be illustrated by comparing a time series of total advertised vacancies
published by two large UK job search engines - Adzuna and Indeed - with the monthly, non-
seasonally adjusted estimates from the ONS JVS (Figure 3). This shows that all three series are quite
different both in terms of their levels and trends over time. Adzuna shows the highest level of
vacancies and has growth in vacancies during 2015 that is not seen in the other series. The Indeed
figures are lower than the ONS but closer than Adzuna and the long term trend is more similar.
However, neither Adzuna or Indeed have the seasonal pattern that is apparent from the JVS series.
Figure 3. Total job vacancies by selected sources (2013-2016)
Some of these differences may partly be explained by definitional differences. The stock of current
job vacancies as measured by the ONS JVS is very different to the average number of live job
advertisements over the month as measured by Adzuna. The large difference between the totals for
the two job portals clearly implies their definitions are very different. Another important definitional
issue is that the JVS does not include workers employed directly by employment agencies although
Page 14
14
many on-line job ads may be for such positions and it is often impossible to distinguish between
these based on information contained in the advertisement. The proposed approach of matching
survey and on-line data should help in that it provides a basis for understanding these differences
between sources for individual enterprises.
Outlook
The primary focus for the remainder of this pilot (i.e., SGA-2) is to make as much progress as possible
on producing concrete and meaningful outputs using web scraped job vacancy data. As described in
section 3.2, there are a stack of issues from the complex and changing data ecosystem of on-line job
advertisements, duplication, messy and unstructured data, data access, legal issues, definitional
differences and lack of representativity.
We are clear that our main focus for SGA-2 should be on development of methods for producing
outputs from web scraped data rather than focusing on problems on scraping, cleaning and
classifying web data, which CEDEFOP is already tackling. However, as the availability of these data
are still some way in the future, most WP 1 partners are continuing to work on improving access to
other sources of data including both government and commercial job portals.
To achieve the primary goal of producing concrete outputs in the available time, our strategy is to
focus on how to take advantage of existing JVS data and the outline data integration model provides
a promising avenue for combining survey and on-line data.
On 21-22 September 2017, WP 1 participants will be meeting in Thessaloniki. This is also the location
of the head office of CEDEFOP and so this will provide an opportunity for CEDEFOP colleagues to join
the meeting, exchange information and discuss plans for collaboration for the remainder of the
ESSnet. Another planned action for SGA-2 is to develop proposals for an informal network to
continue after the end of the ESSnet.
As stated, the intention is that SGA-2 will also look further at the feasibility of using the approaches
developed by WP 2 (web scraping of enterprise websites) and the collection and analysis of job
adverts as a specific use case. The consensus is that this is most likely to be useful in terms of
identifying enterprises that currently have job vacancies. It will be difficult to obtain more detailed
information because of the technical difficulties of creating structured data from websites where the
structure is not known in advance.
In summary, there are substantial challenges ahead, but there is a commitment from WP 1 partners
to do the best job possible.
Page 15
15
2.2 Webscraping / Enterprise Characteristics
The purpose of this work package is to investigate whether web scraping, text mining and inference
techniques can be used to collect, process and improve general information about enterprises.
Challenges compared to the “webscraping / job vacancies” pilot are application of more massive
scraping of websites and collecting and analysing more unstructured data.
In particular, the aim is twofold:
1. to demonstrate whether business registers can be improved by using webscraping techniques
and by applying model-based approaches in order to predict for each enterprise the values of
some key variables;
2. to verify the possibility to produce statistical outputs using predicted data, in combination or not
with other sources of data (survey or administrative data): in particular, a set of benchmark
estimates might be the ones produced by the survey on “ICT use by enterprises”, a survey
common to EU Member States.
The identified use cases are:
1. Enterprise URLs Inventory. This use case is about the generation of a URL inventory of
enterprises for the Business register.
2. E-Commerce in Enterprises. This use case is about predicting whether an enterprise provides or
not web sales facilities on its website.
3. Job vacancies ads on enterprises’ websites. This use case is about investigating how enterprises
use their websites to handle the job ads.
4. Social Media Presence on Enterprises webpages, aimed at providing information on existence of
enterprises in social media.
5. Sustainability reporting on enterprises’ websites. One of the Sustainability Development Goals
targets set up by the UN is to encourage enterprises to produce regular sustainability reports
highlighting the sustainability actions taken. In order to measure the companies’ response to
this, NSIs look at what companies publish on their official website and track changes over time.
6. Relevant categories of Enterprises’ activity sector (NACE). Aimed at identifying relevant
categories of Enterprises’ activity sector from enterprises’ websites to check or complete
Business registers.
Use case 3 is particularly useful for WP 1 to understand if the enterprises’ websites can be used as
information channels for WP 1.
WP 2 has six participating countries, namely Bulgaria, Italy (leader), Netherlands, Poland, Sweden
and the United Kingdom. The main results are described below.
Report on Legal Aspects
The participants to WP 2 made available on time the Deliverable 2.1 “Legal aspects related to Web
scraping of Enterprise Web Sites”.
The deliverable reviewed the EU and Member States legislation on the regulation of official statistics,
in particular in areas such as Personal data protection, Copyright protection and Database protection,
Page 16
16
in order to understand the real possibilities for the NSIs to perform activities of web scraping - on
small or large scale - on the websites of enterprises.
The deliverable shows that all the six Member States involved in WP 2 on web scraping / Enterprise
Characteristics have a Statistical Law guaranteeing to the NSIs similar prerogatives regard to data
access and data processing.
The laws on personal data protection, database protection and copyright protection have been
ratified in almost all countries and some countries in addition to these legislative measures also
mentioned the development of ethical codes (Italy, UK, Netherlands).
A picture quite favourable to the data collection through web scraping of websites of enterprises
seems, in general, to emerge, but with different shades. It ranges from most favourite situations
(Italy, Netherlands, Bulgaria) to expressing greater uncertainty (UK, Poland, Sweden).
As to the challenges and recommendations from NSIs' legal departments on possible alternative
approaches to achieve the agreement with enterprises on web scraping of their websites on a large
scale, different degrees of intervention appear to be necessary at country level.
A code of Netiquette for web scraping for official statistics has been proposed by NL and UK and
shared by all the partners.
Report on Methodological and Technological Issues and Solutions
The participants to WP 2 made available on time the Deliverable 2.2 “Methodological and IT issues
and solutions related to Web scraping of Enterprise Web Sites”.
Some figures on the performed work are:
Six different use cases were identified, which are the ones listed above.
Four use cases out of the six were selected to be demonstrated by specific pilots.
Sixteen different pilots were implemented, namely: six for use case 1 (with Bulgaria
implementing two pilots with two different technologies), four for use case 2, three for use case
3 and three for use case 4.
A detailed use case definition was carried out by first sharing a use case template. Participant
countries filled the template concerning the use cases they are involved in. All the use cases specified
according to the template are available on the project wiki.
The main findings of the technical work performed by WP 2 can be summarized as follows:
The complex pipeline for processing data scraped from enterprises’ websites has been defined in
detail and shared among the participants. This pipeline can be considered as a reference one to
which mapping specific technological and methodological choices. A set of logical building blocks
has been identified for each phase of the pipeline.
From a methodological perspective, both deterministic and machine learning methods were used
in the pilots. On one side, we learned that even with different methods good results can be
achieved. On the other side, however, we saw that in some cases there can be a convergence of
methods (e.g. the URL retrieval pilot where Italy, Bulgaria and the Netherlands applied the same
Page 17
17
methodology). Predicted values can be used for a twofold purpose: (i) at unit level, to enrich the
information contained in the register of the population of interest; (ii) at population level, to
produce estimates. The issue of measuring the quality of data pertaining to the unit level has
been faced in the piloting phase. In particular, for instance, when employing machine learning
methods the quality can be measured by considering the same indicators produced to evaluate
the model fitted in the training set. Under given conditions (if the training set is representative of
the whole population), the measure of the accuracy (and also of other indicators like sensitivity
and specificity) calculated in this subset can be considered as a good estimate of the overall
accuracy. The issue of measuring the quality of population estimates making use of predicted
values has also been in focus. However, specific solutions to that are still under investigation.
From an IT perspective, performance is a key issue especially when downloading and processing
whole websites. Processing unstructured information is very CPU and memory consuming,
especially with machine learning algorithms, and as a result not very efficient. A sustainability
issue is also very relevant; due to the fact that big data tools are changing very frequently as well
as the website technology, there is a need to provide an agile-like development of tools. For the
storage, the possible choices are between file system (CSV, JSON etc.), NoSQL database (Solr,
Cassandra, Hbase etc.) or relational database (MySQL, PostgreSQL, SQL Server etc.). The decision
on the particular data storage solution should be taken according to the volume of the data and
the type of data to be stored. Finally, although the frameworks are developed in particular
countries, it is possible to apply them in other countries as well without any major changes. For
instance URLSearcher developed by Istat was tested on Bulgarian and Polish websites as well.
Some indicators can be computed as outputs of the developed pilots, and can be considered as
experimental statistics. These include:
Rate(s) of retrieved URLs from a list of enterprises.
Rate(s) of enterprises engaged in e-commerce from enterprises websites.
Rate(s) of enterprises that have job advertisements on their websites.
Social media presence, in terms of both (i) Rate(s) of enterprises that are present on social media
from their websites and (ii) Percentage of enterprises using Twitter for a specific purpose.
In order to implement pilots, some specific software tools were implemented and made available
through the project wiki, the Sandbox and github.
Page 18
18
2.3 Smart Meters
The aim of the smart meters pilot study is to demonstrate the use of data from electricity meters for
production of official statistics. Smart meters are devices which can be read from a distance and that
can measure e.g. electricity, water, gas consumption, at a high frequency. This kind of data can be of
use for statistics on energy use and production, and it can be relevant also as an additional source for
calculating census housing statistics, household costs, or impact on environment. In the pilot study,
theoretical and practical issues are addressed and topics of data access and processing and linking
data, as well as calculation and visualization of statistics, are covered.
Although smart meters are currently deployed in a few countries only, they will be made available in
several countries by 2020.
The pilot is carried out by members from four national statistical offices: Statistics Austria, Statistics
Denmark, Statistics Estonia, and Statistics Sweden. Statistics Estonia is responsible for coordinating
the work in this work package.
During the pilot study the following tasks were carried out:
Task 1 Data access. Description of the current status and future perspectives regarding the
availability of smart meters in the partner countries.
Task 2 Data handling. Description of the available data (structure, records, attributes), description of
the IT environments and assessment of the quality of input data.
Task 3 Methodology and techniques used to analyse the data. This task includes linking smart
meters data with other data sources and development of the methodology to produce
electricity consumption statistics about businesses, households and dwellings. Countries that
do not get access to the real data would produce synthetic data on which different
estimation methods can be tested and results compared.
Based on the results of the three tasks mentioned, two reports were delivered. The first report
covered the results of tasks 1 and 2 and the second report covered the results of task 3.
This pilot study is not linked directly with other pilots in this project, but one can find similar issues
with other pilots like the methods, IT-technologies and quality issues. WP 3 will provide input to the
methodology work package (WP 8) that will start in the second phase of the project and which will
summarize approaches to methodology and quality assessment when dealing with big data sources.
Dissemination of WP 3 results is part of the dissemination work package (WP 9).
The European Commission has proposed a deployment plan for smart electricity meters in the EU
Member States on the basis of economic assessments of long-term costs and benefits and foresees
to achieve an almost 72% deployment rate by 2020. The use of smart meters raises privacy concerns
as, depending on the frequency of data collection, significant personal details about the lives and
private activities of customers can be revealed. Within the EU, consumer personal data is protected
by the EU's Directive on the processing of personal data. Most smart meters currently being installed
worldwide record electricity consumption data hourly, half-hourly or at 15 minutes intervals. This can
provide a strong indication of for example occupancy, but has much less potential to reveal individual
appliance use.
Page 19
19
While carrying out a literature study three use cases were found where smart meters data was used
in official statistics for either determining occupancy of the dwelling, identifying household
composition or to test different big data tools.
Expected improvements from smart meters data
Partners’ expectations from the smart meters data are the following:
Improvements and cost reductions to existing statistics;
Expansion of existing statistics - in terms of quality and aggregation levels;
Improved periodicity of electricity consumption statistics (from annual to quarterly, or even
monthly);
To use electricity data together with other sources to model electricity consumption
electricity consumption by the end-use;
To replace current data collection from businesses by on-line questionnaire by electricity
smart meters data source, i.e. to produce aggregated electricity consumption data according
to the requirements of Regulation (EC) No 1099/2008 of the European Parliament and of the
Council of 22 October 2008 on energy statistics and environmental statistics needs;
To investigate the possibility to improve the dwelling statistics. The registered place of living
does not necessarily agree with the actual place of living, and in order to estimate the
number of vacant or temporary dwellings (i.e. summer houses), electricity consumption
could be an important factor;
Smart meters data is also of interest for other types of statistics, for example as input to
price indices and for improving the Household Budget Survey (household cost for electricity);
It could be possible to use smart meters data in combination with other data sources, such as
mobile phone data. Day/night population estimates and statistics on residency are two
possibilities.
Data access
A survey on access to smart meters data was sent to the NSI of all EU member countries in the spring
of 2016, and there were 18 responses. Only two countries currently have access to data, Denmark
and Estonia.
Several countries were aware of substantial legal barriers. It was unclear if market participants could
even share data with each other. Some countries such as Poland are in the process of drawing up
legislation that will enable smart meters data use.
In terms of data hubs, only one country mentioned that one was under construction (Norway).
Denmark and Estonia already receive data through central data hubs, and a hub is being planned in
Sweden.
Table 1. Smart meters data in EU countries
NSI Plans to explore
smart meters?
Legal
obstacles
Data hub
available
Sweden Yes Yes No
Norway Yes No No
Hungary Yes No No
Page 20
20
France Yes Yes No
Lithuania No No No
Cyprus No No No
Bosnia and Herzegovina No No No
Poland Yes Yes No
Belgium No No No
Germany Yes Yes No
Portugal No No No
Luxembourg No NA No
The former Yugoslav
Republic of Macedonia
No NA No
Denmark Data received No Yes
Estonia Data received No Yes
Austria No Yes No
Greece Yes Yes No
Spain No No No
Data handling
The data received by Statistics Estonia contains hourly recordings from 709 000 metering points and
amounts to 1.5 TB. The most important tables in the data are metering data, metering points,
agreements and customers which contain information about by whom and where electricity was
consumed:
metering_data - hourly information of the amount of produced and consumed electricity,
metering_points - information about location and the type of the metering point (possible
types are: remotely readable, single and dual tariff manually readable),
agreements - information on when electricity contract was signed/ended and what type of
contract it is,
customers - information about private and legal persons who signed the contract.
With different structure but with similar content data was delivered to Statistics Denmark by the
datahub owners. In both cases the data was delivered on an external hard disk.
Analysis of 2014 Estonian data shows that of all metering points, 89% belong to households and 11%
to businesses, and 49% were smart readers.
For the use of smart meters data for statistics, four additional steps are necessary:
1. Geocoding or normalizing the metering point addresses, so that linking with other sources
could be carried out;
2. Transformation of the variables (e.g. changing formats, coding);
3. Anonymization of personal data in the customers table;
4. Cleaning the data.
An overview of the input data quality is given in Table 2.
Page 21
21
Table 2. Assessment of the electricity data based on input quality indicators.
Quality
indicator
Assessment of Estonian data Assessment of Danish data
Under- and
overcoverage Around 50% of households and companies did not
have smart meters by the end of 2014.
The smart meters do not measure electricity
produced for own consumption, they only measure
purchased electricity. The total amount of
unmeasured consumption is negligible compared to
total consumption.
1526 metering points were excluded from the
analysis as those metering points did not present
actual end consumption i.e. the metering points
recorded transfer or selling of the electricity.
Although these metering points formed only 0.2% of
all metering points, the consumption recorded by
them was 75.4% of total consumption recorded in
the data hub.
The 2013 data contains reading for almost all Danish
meters. It contains hourly readings for only a small
subset of companies using more than 100,000 KWH
a year. The undercoverage on smart meters cannot
be estimated at the moment. There are no known
examples of overcoverage in the core data. But there
can be overcoverage in some subpopulations due to
poor linking with other registers.
Percent of units
that fail checks
It is expected that the reading are in accordance with
real consumption as it is invoiced to customers.
There are no identified significant errors in the data.
The necessary quality checks are still in
development.
Percent of units
that are
adjusted
In Estonian data there were two cases when the end
date of the grid agreement differed from the supply
agreement's end date and there were two duplicate
agreements in the initial dataset.
All the address information of metering points was
normalized and corresponding address id-s and
address object id-s were identified and stored in a
separate table.
The metering data are not adjusted.
Percent imputed There are no missing values detected in the records.
The core meter data does not have unexplained
missing values, but missing data can occur in linked
data.
No imputation is applied and data is handled in read
only mode.
Periodicity Data are provided yearly at the moment - higher
frequency may be possible in the future.
The possibility of monthly delivery is being
examined.
Delay As the network operators have three months for
correcting data, the data are provided after that
period.
Still unknown.
Common units No duplicate records were discovered. The data hub aggregates all the data from the
different providers, and handles conflicting data.
Metadata on this process is not available to Statistics
Denmark, but there are no indications that it
concerns more than a miniscule proportion of
records.
Page 22
22
Partners not having access to real data are working with generated synthetic data. The objective of
the use of synthetic data is twofold:
1. Generate demo output with “realistic” results;
2. Test, scale and develop (new) statistics and algorithms where linkage to enterprise or
household characteristics is necessary.
For generating synthetic data Australian data was used for generating households’ electricity
consumption data.
Data linking
For linking purposes it was possible to use two strategies – linking by registry code or linking by
address information. The main challenge is to identify end production and consumption of statistical
units, therefore it is crucial to identify linkage between metering points in smart meters dataset and
units in administrative registers and also exclude all metering points not related to end consumption.
Data linking was carried out by two countries – Estonia and Denmark. For both countries it was
important to link statistical units by using address id and both had problems with finding a one-to-
one match between statistical unit and observed unit – a metering point. In the Estonian case study
the most of the problems were related to quality of address information, but in the Danish case there
were also difficulties identifying what was the periodicity of the reading and billing.
The main problems during the linking were:
The quality of address information. It was the main problem in the Estonian case to extract a
valid address id. In the Danish dataset a valid address id was used.
Many metering points or many statistical units on the same address. It was a problem in both
cases and reduced the rate of linking. In the Estonian case, there is 5% of cases where
metering points have the same address id.
Identifying the actual consumer, as the information in the smart meters dataset identifies
only the contract holder. It was more related to businesses and from 150 thousand
businesses in the statistical business register only 22 thousand could be matched.
Own consumption of producers. There is a growing trend that more end consumers install
their own electricity production units and their own consumption is difficult to identify. The
same applies also for big industries and their own consumption. By the dataset description, it
may not be a problem in Danish case.
Apartment associations – those are registered as business entities, but are apartment
buildings. Mostly they are used for living but also many small companies are active on the
same address.
The quality of registers. The data in the Estonian register of buildings is not up-to-date with
regard of several building characteristics and thus were not used in this project.
Producing statistics
There were three goals with regard to expected outputs. First, to assess whether current survey
based business statistics can be replaced by statistics produced from the electricity meter data,
second, to produce new household statistics and third, to identify vacant or seasonally vacant
dwellings.
Page 23
23
Expected outputs for business statistics were final energy consumption statistics of businesses by
economic activity, by region and monthly, quarterly and annual aggregation. The goal was to find a
link between statistical units (businesses) and observed units (metering points) and identify the end
consumption of businesses. For linking two strategies were used – first, link business customers of
the Data Hub to the business register by using registry key and identify energy consumption by the
area of the economic activity; second, use address id to link the address of the business entity with
the address of the metering point and get an estimation of consumption. In the latter case two
strategies were possible – to use all metering points or only those which contract was owned by
businesses. Before linking we excluded many metering points related to open suppliers and other
network companies due to the fact that the consumption was not the end consumption. It was not
possible to exclude all the open suppliers from the further analysis as open suppliers were active in
many fields and their own consumption was significant.
Our conclusion after the analysis was that the electricity smart meters data has potential as a source
for producing business statistics, but the methodology of the linking has to be improved. Current
estimates in certain areas differ significantly from the survey data, as it is difficult to estimate the real
consumer. On the other hand, the dynamics of electricity consumption could indicate the change in
economic activity by a sector and could be used as an early warning indicator.
Regarding the household statistics, the main goal was to link households with the electricity smart
meters data and then identify how the energy consumption is related to household size, number of
rooms in a living place and other indicators available in the registries. This work was carried out on
Estonian and Danish data.
Figure 1. Yearly mean consumption by household size and type of living place - house (H) or apartment (K).Estonian data.
Despite the fact that the linking quality was not very high due to the partly coverage of smart meters
and quality of address data, the results indicated there is potential to use smart meters data for
producing statistics of households. The main potential of using smart meters data is the possibility to
link with different registers and thereby reveal new information otherwise unavailable to the society.
Information about the average consumption by the household size, type of dwelling or any other
Page 24
24
characteristics of the dwelling could be made available to the public as experimental statistics. This
gives an opportunity to the public to give feedback on the usefulness of this information as well as to
request some other information not yet available about the households' consumption.
The third output foreseen in this pilot was identification of the vacant living spaces. Information
about the vacant living spaces is relevant housing statistics that can be used in the population and
housing census, tourism statistics or by different industries e.g. real estate sector. It is of interest to
know how many dwellings are empty for a long time and how many are occupied seasonally. In
addition to producing statistics on vacant dwellings, the information can be used on dwelling level to
validate the information in the registers. Statistics Estonia is planning to carry out the next
population and housing census by using only register information. The pilot census showed that one
of the largest quality problems is the accuracy of the address of the main residence. Namely, people
are quite often not living on the address given in the register. To overcome the problem, alternative
data sources are looked for. One possibility is to use mobile data but electricity smart meters data
could also be beneficial. The work carried out regarding this output included a list of algorithms to
identify the vacant dwellings, computing the indicator for each dwelling, referring to occupancy
either on a certain day or during some other specified time period. This information was used to
validate the household’s main residence information in the register.
Estimating occupancy or vacancy of a living space is not a simple task. Several methods were
suggested and tested on synthetic data. The most promising are the outlier detection method and
random forest method. However, on the real data simpler methods were applied due to the lack of
data for the training set. The results are quite promising for those households that were included in
the analysis. For those, we obtained results that show about 18% of households do not live on the
address the population register has for them. This coincides quite nicely with the estimates obtained
from survey data which show this number to be about 20%.
Statistics Estonia plans to continue working with the electricity data and find more ways to use a
vacancy indicator. In an ideal case data could be used to not only validate the results but to improve
the main residence address information in the register based census.
One important outcome of this project is a quality assessment framework, which was used in this
report to evaluate all the outputs and can also be used in the future projects.
The main advantage of using the smart meters data are:
Possibility to link it with other data sources and gain new knowledge;
Data source can be used to validate or improve current survey based statistics;
Data source could improve the speed of producing statistics and also increase the quality of
regional statistics.
The main problems related to the data source are:
The description of data source is not detailed enough and metadata about the variables was
often missing;
Address information is not standardized and geo-coding needs lots of resources. Address is
the key variable in the linking and the quality of address information is crucial for the
successful use of the data source;
Page 25
25
Without additional information it is difficult to identify which metering point records the
actual end consumption and which one records transfers of the electricity;
The observed unit (metering point in a building) does not match the statistical unit (business,
household, dwelling). This leads to the loss of accuracy of the estimates as consumption is
assigned to the owner of the building not to the actual consumer;
Manual metering readings can cause a lot of work as is the case in Denmark. The dataset
contains dates when the manual readings are reported but no information about the period
the reported consumption refers to. However, this problem will disappear with
complete instalment of the smart meters;
One metering point corresponds to many consumers (apartment unions, real estate
companies renting rooms) and one needs to develop methodology to extract the
consumption of the single consumer.
Outlook
To reveal the full potential of the data there is still work to be done. Some of this work includes
developing methods that improve the quality of the linking, using classification methods to identify
customer type (business or household) and combining different data sources to reveal new
information.
Ways to improve the quality and extend the use of the data are:
Use machine learning algorithms to clean the address data.
Produce statistics by using sub-sets of businesses that could be linked.
After improving the address quality conduct regional analyses of the energy consumption.
Use unsupervised classification algorithms to identify different types of electricity
consumption patterns. This can be used to identify whether electricity is used for heating, for
example.
Develop a model to identify producers of electricity for own usage and model the amount of
electricity used.
Link the electricity data with economic activity data to see the correlation between economic
activity and energy consumption.
Link electricity data with weather data and building register to identify the impact of weather
to the electricity consumption.
Foreseen activities during SGA-2
In the second part of the smart meters pilot study, the aim is to suggest potential uses other than the
ones analysed in SGA-1, especially when linked to other data sources (different registers, weather
information etc.). The expected output of this study is a list of potential outputs, suggestions how to
visualize the results and evaluation to the feasibility of producing the proposed outputs in the official
statistical system.
The second aim is to give recommendations and summarise lessons learned to other countries, so
they can apply them when starting to use smart meters data. When countries know what awaits
them, they can save time and resources when dealing with the data.
Page 26
26
2.4 AIS Data
Aim of this work package is to investigate whether real-time measurement data of ship positions
(measured by the so-called AIS-system) can be used 1) to improve the quality and internal
comparability of existing statistics and 2) for new statistical products relevant for the ESS.
The added value of running a pilot with AIS-data at European level is that the source data are generic
worldwide and data can be obtained at European level.
Methodological, quality and technical results of the work package, including intermediate findings,
will be used as inputs for WP 8 of SGA-2. When carrying out the tasks listed below, care is taken that
these results will be stored for later use, by using the facilities described at WP 9.
Task 1 – Data access
This task involves exploration of the possibilities to collect the data at a European level. AIS-data are
available for national territories and the entire European territory. Aim of this task is to decide how
European data could be used for this project, and to investigate the possibilities of acquiring data
from EMSA (European Maritime Safety Authority), to be coordinated with Eurostat. The advantage of
using one AIS-dataset for the entire European territory is a) a better comparison of international
traffic between the countries and b) more synergy as all participating countries work on the same
dataset. A disadvantage is that these data are stored by private companies and handling fees have to
be paid.
Task 2 – Data handling
Aim of this task is to process and store the data in such a way that they can be used for consistent
multiple outputs. Key elements of this task are: 1. which programming language and environment
should be used for transformation? 2. where will the data be processed? and 3. how can we create
an environment which is easily accessible for all partners?
Task 3 – Methodology and Techniques
Develop traffic statistics: Linking with data from maritime statistics
AIS-data may be linked to data from maritime statistics. Added value of linking AIS-data to data from
maritime statistics is that the same reference population (= ship number) is used in all ports. As the
journeys and port visits of ships can be derived from AIS this linking provides the ESS information
about the origin/destination of the cargo, too. Aims of this task are: build a reference frame of ships
in European water, find out how data from maritime statistics can be linked to AIS-data and check
whether information improves the quality of current statistical outputs.
Traffic analyses
The number of ships during a certain time interval at certain coordinates (like inland waterways or at
certain points at sea) can be calculated by AIS-data. This information could be interesting for traffic
analyses and economic analyses. Aims of this task are: calculate the number of ships at certain
coordinates and visualise the results to analyse variations in time.
Concrete results of the tasks mentioned above are reports on:
Page 27
27
1 creating a database with AIS-data for official statistics: possibilities and pitfalls
2 deriving harbour visits and linking data from maritime statistics with AIS-data
3 sea traffic analyses using AIS-data
Obtaining AIS data on European level
There were three levels to choose from for obtaining AIS data
National level for each country collaborating in WP 4;
European level: all waters within the perimeter of the European countries;
World level: a data set covering the whole world.
National data is not very useful for European analysis. The only advantages are their pricing (most of
the participating countries already have national data) and the size (the data is not that big).
European data is much more suited for European as well as national statistics. Since European data
contains information about the routes vessels take within Europe, the data gives information of
routes, but also about harbours of loading and unloading within Europe. Hence, this data could also
have potential extra value for the national level. Concerning the world wide data, one would be able
to locate the harbours of loading and unloading around the world. However, this data is gathered by
satellites and, as a consequence, it is very expensive. Based on the remarks about the three levels of
AIS data it was decided to obtain European data. Several sources were identified to purchase the
data from, i.e.: EMSA, Kystverket, Hellenic Coastguard, Dirkzwager, Marine Traffic (.com) and the
Joint Research Centre (JRC).
The result of this investigation on possible sources for European AIS data is that we know for sure
that we cannot obtain European data from Kystverket, Hellenic Coastguard and JRC, because of legal
issues. At the beginning of this work package we did decide not to use the European AIS data by
Marine traffic, because this is a very expensive alternative and does not fit within our budget1. We
are still investigating the possibility of using AIS data from EMSA (together with Eurostat), because
they would provide the EMSA data for free. Dirkzwager could provide us 6 months of European AIS
data on a short period of time and for a very good price. For that reason we decided to use this
Dirkzwager data within our work package. This dataset contains 6 months of AIS data (8 October
2015 - 12 April 2016) and contains AIS data from land based stations only. Satellite data is not
included. If the EMSA data becomes available during SGA-2 we will also investigate this data in WP 4.
Decisions concerning tools and environment
Figure 1 describes the results on deciding which programming language and environment we should
use for pre-processing and analysing the European AIS data. For legal reasons we choose to keep the
data in the Netherlands and decode the data in Python. After the data was decoded, all files were
zipped to save space. The sets of files with positions and files with voyage related data were
uploaded to the UNECE Sandbox by a secure copy (scp). The locations data comprises 144 GB of
compressed data (about 200 GB uncompressed) and the messages comprised about 5 GB of
compressed data (about 7 GB uncompressed). It took approximately 7 hours to copy the data. After
uploading, the data was copied to the Hadoop File System (HDFS) and available for use. For legal
1 However, in July Statistics Netherlands and Marine Traffic signed a Memorandum of Understanding aimed at
sharing data and knowledge between both organisations.
Page 28
28
issues we decided to create an AIS group on the Sandbox, so only the usernames from the members
of WP 4 can access the Dirkzwager data.
From the HDFS, the data can be accessed using the tools that are available in the Hadoop stack,
which are: Pig, Hive, RHadoop and Spark. We chose to use Spark, because Spark makes it possible to
perform much more complex processing on data which is stored in the HDFS. Spark is compatible
with the programming languages Scala, Python, R and Java. However, it is advised to use it with
Python or Scala. Furthermore, within Spark one will be able to write SQL queries using SparkSQL.
Figure 1. Pre-processing, processing and storing the AIS data
For data analysis it is decided that resulting aggregates will be downloaded from the UNECE Sandbox
using HUE. Researchers of the different NSI's will be able to analyse the data using their tools of
choice, i.e. SPSS, SAS, R, or even import the data into a local database.
Results SGA-1 on the quality of AIS data
Dirkzwager has receivers all over the coastline and main ports of the Netherlands and a couple
outside the Netherlands: Cherbourg, Gibraltar, Zeebrugge, Antwerp and Hamburg. AIS receivers on
land can only pick up signals within the range of about 40 sea miles. Therefore, land receivers have a
very limited coverage of signals transmitted from sea which results in loss of information of ships on
Page 29
29
open sea. The data we received was all data without the satellite data, thus also non-European data
from partners2.
As described in deliverable 4.2, the coverage of ships in the Dirkzwager AIS data, is good but there is
also data missing and quite a lot of noise, for example some vessels seemed to be located in the
Sahara (see https://maartenpouwels.carto.com/viz/8d319f16-8195-11e6-af04-
0ecd1babdde5/public_map).
Another visualisation for following a ship shows us that following a ship during a couple of days gives
us a reasonable view of the journey of a ship, but we have also missing data here (see:
https://maartenpouwels.carto.com/viz/8d2f3bde-8197-11e6-bf3f-0ee66e2c9693/public_map).
We know there are three types of errors in AIS data:
Technical errors - related to dynamic data such as position of ship, speed, course, rotation which
comes from AIS device (sensors, cables and antenna).
Human errors – related to static (MMSI3, IMO number, ship’s name, call sign, type, length) or voyage
data (draught, destination) which are manually entered in the AIS device so therefore are a common
cause of errors. These values should be entered during installation of AIS instrument (static) or if
voyage information changes. It is worth noting that voyage data must be manually updated after
each port visit.
Systematic errors - due to faulty or missing input by the ship crews. Apart from these systematic
errors, all of the parameters can be erroneous due to technical issues (e.g. meteorological factors,
distance to receiver). These errors can take any form. This can for example result in a wrong IMO or
MMSI.
AIS quality thus depends on correct installation of the AIS device, frequent manual updates of
information, and technical devices. Most of the issues we deal with by detecting and removing
erroneous data. As the amount of data is huge, there are many errors. However the amount of
remaining data is still ample for further analyses.
These results made us decide to further investigate different AIS data sources. This was done by
subjecting the Dirkzwager data to a quality and metadata framework and then comparing Dirkzwager
to other data sources. We were interested to see how quality of Dirkzwager data matched national
AIS data. Therefore, national AIS data from Denmark, Greece and Poland was compared to AIS data
from Dirkzwager.
When investigating the quality of AIS data it is important to keep in mind that:
AIS is a radio signal, parts of the messages can get lost or scrambled due to factors such as
meteorology or magnetics.
2 Dirkzwager has 6 partners, amongst which AIShub (seem to filter the data more and covers all of Europe),
Marinetraffic (covers mostly of Mediterranean: Greece and Italy), one English partner (covers English coast) and Portvision (covers USA). 3 Maritime Mobile Service Identity
Page 30
30
Messages are transmitted encoded. As a result, an error in one transmitted ‘byte’ can result
in an error in one or multiple fields in the decrypted message. Most of the times, these errors
are detectable as the result yields an invalid variable, but sometimes they result in valid
variables. For instance, coincidentally the resulting MMSI can be a technically valid, but
incorrect MMSI, resulting from an erroneous detection. These errors can arise for every
variable, so this can for example result in erroneous latitude and longitude, yielding faulty
locations that are quite far away from the actual location of the ship. In turn, this can result
in a very high journey distance of ship.
Receivers have timeslots in which data is received. In busy areas with many ships, not all data
from all ships may fit into this time slot. This may result in the loss of data on some ships in
that time slot.
Ships can turn off their AIS transponder, resulting in the disappearance of a ship.
AIS was intended originally for safety at sea, to warn nearby ships. As it was not meant for
producing statistics, the variables entered manually by the shippers are not always reliable.
To provide an overview of different aspects of the quality of the Dirkzwager AIS data, we filled out a
preliminary framework for national statistical offices to conceptualise the quality of big data in
deliverable 4.3. Almost all factors of the quality framework are judged as mostly positive. Only
“spatial coverage” and “transparency and soundness of methods and processes for the metadata and
the data” are insufficient. Not all European coastal areas are covered and Dirkzwager provides partly
pre-processed data, but documentation on this is not available to us on how. Privacy is also an issue
that needs to be researched further.
Dirkzwager adds timestamps and performs validation checks on aspects such as position of ships and
ordering overlapping data sources. National AIS data of Denmark, Poland and Greece are completely
unfiltered and untreated.
In deliverable 4.3 AIS data from Dirkzwager is compared to national data for Denmark, Greece and
Poland. In almost all cases, national data contained (much) more data than Dirkzwager data.
Dirkzwager misses data on complete areas in coastal Europe. Thus, ports visits and journeys cannot
be analyzed for all European ports and ship routes. The number of messages in areas covered by
Dirkzwager is usually lower in the Dirkzwager data compared to the national data. It is clear that
some of Dirkzwager data is filtered depending on the data sources, but the exact nature of this
filtering is not clear, as the reduction of messages per ships differs. In general, we are not satisfied
with this filtering (or information on this filtering), and coverage of the Dirkzwager data. Coverage
differs per country, but if we want to analyze the whole of Europe it does not suffice. If Dirkzwager
data does cover a port, the data is sufficient to determine the port visits. However, it is not sufficient
to determine ships’ journeys, especially in areas with a capricious geography. Our algorithm
(described in deliverable 4.2) can deal with this, in terms of calculating the right number of journeys,
but it will result in an underestimation of the calculated distances. The lower frequency of messages
can also impact calculated traffic estimates and underestimate emissions.
During SGA-1 we developed robust algorithms to handle the noise in the data. We developed an
output-driven method to define a journey. Using the departure of ships gives us the start of a
journey. The end of the journey can be determined in three ways:
Page 31
31
The ship enters another port
The ship anchors
The ship leaves the area of AIS coverage
Processing could be optimized by filtering out AIS data in which the speed and heading of the ship
have not changed since the last message. This optimization might be performed in the future, but is
not in scope of the current project.
Results SGA-1 on linking European AIS data to data from maritime statistics and results on
possibilities to improve the quality of current statistical outputs
First results show that AIS data can be used as a backbone for maritime statistics. We have
developed a method to build a reference frame of maritime ships. From this, we can compose the
number of port visits, which seems to be more accurate than the maritime statistics. However, AIS
data do not contain the level of detail needed for the type and gross tonnage of the ships to be able
to generate port visit statistics. One method to accomplish this would be to combine AIS data with
Lloyd’s register of ships. We are also looking into other methods of deriving this from AIS data (e.g.,
deriving this from the type of terminal the ship is visiting).
European AIS data can improve current statistics. By using European AIS data it is possible to
determine ship routes in European waters. We performed four Proof of Concepts (PoC’s). The
outcomes are promising (detailed results are described in deliverable 4.3). The first PoC, on
developing an algorithm to calculate the intra-port journey by using AIS data, succeeded. Intra-port
travel distances can become a new statistical product. In the future, it would be interesting to
develop an algorithm that can detect intra-port movements, i.e. where a ship that moves from one
terminal to the other within the same port can be automatically detected. Another interesting aspect
is the detection of anomalies in the movements of ships signalling problems in the ports.
The second PoC, on using AIS data to define ports, has succeeded: it is possible to build a data driven
algorithm for defining ports. In the near future, Statistics Netherlands and Marine Traffic will
collaborate on the possibilities of building a reference frame of ports. Then, it would also be
interesting to zoom in on defining the types of terminals.
From the third PoC we conclude that next destination as reported by captains is not a usable variable
compared to the observed next destination. We also conclude from this PoC that distance measures
in time and space can be done. More work is needed to handle areas where coverage is not perfect.
It is also interesting to compare the distances in the port to port distance from Eurostat to the port to
port distances based on AIS data. This may result in using actual AIS journey data instead of the
average distance matrix in the future.
Finally, the last PoC shows that AIS data is useful to investigate fluvio-maritime transport. From the
perspective of traffic intensity, emissions and transit trade, it is interesting to further investigate
these fluvio-maritime journeys. This could also be used to gain insight in the relationship between
maritime and inland waterway transport.
Page 32
32
Results SGA-1 on sea traffic analyses by using AIS-data
We explored the possibility of calculating the number of ships during a certain time interval at certain
coordinates by using AIS-data. This information could be interesting for traffic and economic
analyses. From our investigation we can conclude AIS data is also useful to analyse sea traffic and to
analyse variations in time (see deliverable 4.3 for a more detailed description).
The figure below gives an examples of the visualization we made. In the visualization, one can choose
the date out of an available date list in the lower end, where a slider is available for selecting a
saturation threshold for the visualization. This means that all cells being more occupied than the
threshold are displayed as dark red and all less visited locations are less red. Playing with the slider
gives insight in more and less occupied regions in Europe. For the regions that are not displayed in
dark or less red, there is no data available in the Dirkzwager dataset on that specific day. Also very
low intensities are made invisible.
Figure 2: result on traffic analyses: the amount of ships in each cell of the grid during one
day based on the Lambert Azimuthal equal area projection for a threshold of 50
Page 33
33
Outlook
Although we are not satisfied with the quality of the Dirkzwager data yet, AIS data itself can help
improve current statistics. By having new data sources like Marine Traffic, EMSA and Luxspace
(satellite data) available in the future, the possibilities of AIS data seems to be even more promising.
In SGA-2 we will focus on describing possibilities of using AIS as a source for making new statistical
products (e.g. like intra-port distances, sea traffic and variations in time). We also wanted to involve
other statisticians working on maritime statistics on thinking about the use of AIS for improving
maritime statistics and for new statistical products. To this end, we sent out a questionnaire on the
use of AIS to maritime statisticians and all member countries of the ESSnet. We will describe the
results of this questionnaire, including ideas for making new statistical products by using AIS data, in
SGA-2.
In SGA-2 we will also investigate other AIS sources like Marine Traffic, Luxspace (satellite) and
hopefully EMSA. This will result in an advice on what data source would best fit analyses for
Eurostat’s purposes. Furthermore, we will develop a methodology for calculating emissions and
report on the impact of this methodology on the (European) level of emissions statistics. All project
results of WP 4 will result in a consolidated report.
Page 34
34
2.5 Mobile Phone Data
WP 5 on mobile phone data aims to do research on the potential use of these data for the
production of official statistics. Following the general bottom-up approach of the ESSnet, this work
package concentrates upon concrete sets of mobile phone data investigating and producing a
concrete statistical output in a given statistical domain assessing the methodological and
technological framework as well as some data quality aspects, also envisaging future perspectives for
their extensive use in official statistics production.
There arise several landmarks to achieve the goals. Firstly, mobile phone data are produced in the
frantic activity of the telecommunication industry, in particular in mobile phone networks built,
operated, maintained and technologically streamlined by international telecommunication
corporations. These data are strongly protected by diverse national and international legal
regulations due to highly sensitive personal privacy and confidentiality issues. Furthermore there
already exist economic interests to exploit commercially these data by these corporations
themselves. For all these reasons and some other, data access is a big issue to be tackled by the work
package. The short-term goal is to have access to concrete data sets to be used in the rest of
research activities. The long-term goal is to investigate the feasibility of a sustained access in
standard production conditions as well as the required characteristics of the accessed data.
Secondly, since usual survey methodology cannot be directly applied to these data, methodological
proposals must be produced in order to achieve high-quality statistical outputs as in traditional data
sources. This must be complemented with the technological requirements necessary to process
these data. This is the second general goal of the work package.
As third general goal, depending on the specific agreements with mobile phone corporations to get
access to the data, concrete statistical outputs in the statistical domains of human mobility and
tourism will be produced out of the accessed data sets. The production of statistical outputs will
entail the production of point estimates and the assessment of their quality, especially through their
accuracy and timeliness.
Finally, future perspectives will be envisaged, judging by the preceding diverse elements necessary to
produce statistical outputs from mobile phone data.
This work package will serve as an input for WP 7 on combining different statistical domains and WP
8 on big data methodology. Findings in WP 5 will be of valuable utility for the goals of these work
packages.
The degree of accomplishment of the preceding tasks and the main results are as follows:
Belgium, Finland, France, and Italy succeeded in their negotiations to have access to a
concrete mobile phone data set for the SGA-2. Spain and Romania are still under contact
pursuing this goal. UK, Netherlands and Germany will join the WP 5 for the SGA-2 with more
mobile phone data sets, mostly at an aggregated level.
No generic recommendation or golden rule to achieve success was found since the situation
is noticeably different by country (different legal regulations) and by mobile network
operator (different company structure and business interests).
Page 35
35
The agreements in every country are indeed also different regarding the access conditions
and data characteristics. The main consequence for SGA-2 and the rest of activities of WP 5 is
that data sets are not homogeneous regarding space and time extension and granularity,
event origin (either passive or active events) and other minor factors. Although this may be
seen as an obstacle, it will be used in SGA-2 to test what can be achieved in every case, so
exploring de facto several circumstances at the same time.
It is important to remark that in all cases consultations and/or resolutions of Data Protection
National Authorities have been necessary, possibly connoting a lack of clarity in legal
regulations regarding both the right of NSIs to access these data and the legally supporting
permit for mobile network operators to share their clients' data.
An overview of these experiences is contained in deliverable 5.2 of this work package.
As a starting point for the activities of WP 5, we designed a questionnaire to take stock of the
current status of the access to mobile phone data across the ESS. This questionnaire has
been designed taking as a basis a preliminary analysis of the many entangled issues regarding
access to mobile phone data for official statistics purposes.
The questionnaire has basically three parts. In the first part, we enquire about legal issues
regarding statistical, telecommunication and personal data protection regulations, all being
highly entangled in the question of the access to data. This is complemented with some
requested information about the characteristics of the mobile network operators. In the
second part, we focus on the access conditions: (i) in-situ, transmission, or via trusted third-
party, (ii) access for research or for standard production, (iii) conditions on dissemination
regarding intellectual property rights and industrial secrecy on data extraction algorithms
and methods, (iv) combination with official data, (v) data extraction cost compensation, and
some related details. In the final part, data characteristics are investigated: (i) raw or pre-
processed micro data vs. aggregated data, (ii) event source of data (active events, signalling,
etc.), (iii) spatial and time coverage, (iv) spatial and time granularity, (v) data on roamers or
not, (vi) details about possible pre-processing (anonymisation, geolocation, etc.). In addition,
questions were asked about the characteristics of the MNO and on some other aspects.
Detailed results of the survey as well as the questionnaire itself are contained in deliverable
5.1. Regarding responses, as a general overview, we can say that 28 out of the 32 ESS
members were surveyed (the remaining 4 do not participate in big data activities within the
ESS, so that they can be safely understood as having no mobile phone data related activity).
We got response from 25 NSIs, out of which 14 reported contacts with mobile network
operators, only 7 of them being successful in having access to mobile phone data. From the
corporations' point of view, 10 mobile network operators are reported to grant access to
their data, mainly for research purposes.
For the workshop in Luxembourg to gather mobile network operators, national statistical
offices, Eurostat and other stakeholders, 52 invitations to MNOs' representatives were issued
as well as to representatives of the members of the ESS (NSIs and Eurostat) and some other
international organizations (UN, OECD, ITU, DG Connect, DG Digit). STATEC acted as local
organizer (we explicitly acknowledge their support). Positium assisted the work package
members delivering one talk of their on-going experience and taking active part in the
debates and round tables. The meeting was organised in four sessions. Firstly, a presentation
of the context and actors was conducted by Eurostat, STATEC and Statistics Spain (INE) as
Page 36
36
work package leader. In the second session, three examples of on-going experiences in three
different European countries were presented and complemented with a round table
exploring the lessons learnt so far. In the third session, the core issues regarding the access
to mobile phone data were presented both from the Official Statistics and mobile network
operators’ points of view. A tour-de-table with vivid discussion was organized to exchange
the different views. A final session gathering the joint conclusions for the future closed the
meeting.
Detailed contents of the workshop were transcripted in the minutes of the meeting.
Two deliverables were produced in order to accomplish the goals of SGA-1 for WP 5.
Deliverable 5.1 contains (i) a preliminary analysis of the issues regarding the access to mobile
phone data, which was the basis to design the questionnaire surveying the status of this
access across the ESS, and (ii) the results of this survey.
As mentioned, basically five main groups of issues were identified regarding the access to
mobile phone data, namely (i) the characteristics of the MNO, (ii) the legal requirements, (iii)
the access conditions, (iv) the data characteristics, and (v) some other aspects.
For diverse reasons, the characteristics of the MNO stand as an important factor in granting access to
their data. Although each corporation is different, three main typologies have been identified: (i)
MNOs having invested (or on the verge of investing) on the development of a business line around
the statistical exploitation of their data, (ii) MNOs not having developed this business line but
decided and interested to do so, and (iii) MNOs not having any interest whatsoever so far in this
business line. It is important to underline that different possible interlocutors within these companies
can be approached. Being these companies large as they usually are, a number of departments or
operating units are involved in the question of granting access to data. In some cases, these
departments do not show themselves the same vision thus hampering the agreement. This reflects
the many facets involved in the question.
Legal requirements stand as the most visible obstacle to have access to mobile phone data. We have
detected that at least three kinds of legal regulations are at stake. Firstly, by and large statistical
regulations do not seem to bear a clear legal support regarding the right of NSIs to have access to
these data. This is mainly motivated because National Statistical Laws are already some decades old
and did not explicitly foresee the access to private data as mobile phone data and other big data
sources. However in many cases, a straight interpretation of the current wording of this legislation
still gives support to NSIs to request access under common strict confidential conditions. Secondly,
telecommunication regulations are clearly stringent on the conditions to access, to store and to
process these data, even by the MNOs themselves. In some cases, there even seems to be an
unsolved collision between these two kinds of legal requirements. Finally, in the European
framework personal data are strongly legally protected and National Data Protection Authorities
must be involved to clarify and to give explicit support to NSIs for their data request and conditions
for accessing, storing and processing. There seems to be a clear consensus that more clarity is
needed, especially in the European realm, regarding the legal status of the access to these data for
official statistics purposes.
The actual conditions to access the data are also part of the negotiations and of the agreement. Data
can be possibly accessed only in-situ in the company's premises or can be securely transmitted to the
Page 37
37
NSIs' information systems possibly through a trusted third-party. In all cases so far, access have been
granted for research purposes, the conditions for long sustained production being still an open
question. Data are to be extracted from the MNOs' information systems through technological
solutions which are wanted to be kept under intellectual property rights and/or industrial secrecy.
How much information is to be shared between MNOs and NSIs thus stands as another point in the
negotiations. Equally, the combination of mobile phone data with official data, especially with micro
data (data at the statistical unit level) is another factor to take into account. As clearly seen from the
operational framework, data extraction and data pre-processing procedures entail some cost and
effort, which must be dealt with in the negotiations to reach an agreement.
Finally, the operational conditions under which confidentiality is scrupulously observed must be
agreed upon. Many characteristics of data can be possibly considered. Firstly, data can be at the
mobile phone level or can be aggregated at some level (also to be agreed upon). The actual origin of
the data within the mobile network is also an important issue. They can be generated from the active
events (calls, messages, Internet connections,...) of subscribers or from any signalling activity
between mobile devices and antennas possibly recorded in the network. The amount of information
(and also of effort) greatly varies from one scenario to another. Complementary data from the
subscribers' contracts (sociodemographic data) or from the operational framework of the network
(position of antennas) are another factor. Pre-processing regarding the procedures of anonymisation,
of geolocating each piece of data and of giving time references to them is an important issue to be
agreed upon. In the case of aggregated data, this must be complemented with details of the
procedure of aggregation.
Lastly, there exist some other issues that NSIs must take into account. To name a few, access to data
from the main MNOs are to be considered, because having access to just one or two companies
introduces the unsolved question of partial coverage of the population and thus the subsequent
inference problem. Also, an a priori apparent collision of interests may arise between the private and
public sectors, both of them exploiting the same data. As in traditional survey sampling, this is only
apparent, as the public and private sectors may reinforce each other, especially if partnerships are
formed aimed at complementary actions. Finally, public opinion regarding the access to and sharing
of so highly sensitive data must be also taken into account, possibly with a joint transparent
communication strategy about the actual use of these data for the benefits of society.
Deliverable 5.2 summarises the main findings of SGA-1 activities in the form of guidelines and
recommendations for the partners of the ESS when initiating their own contacts and negotiations
with their national MNOs. This second deliverable has three main sections.
Firstly it contains a technical description of the operational framework of mobile telecommunication
networks to understand how data are generated. Although it is not strictly necessary to bear an
expert knowledge in this technology to negotiate access to mobile phone data, it is remarkably
beneficial to understand several factors in the underlying complexity.
Mobile phone data for statistical exploitation do not exist in a cellular network. They are organic data
created, reproduced, stored, and deleted in a frantic business cycle providing a telecommunication
service, not a statistical service. A clear specification of the requested data must thus be formulated
and negotiated.
Page 38
38
A mobile telecommunication network has a nested hierarchical structure so that at the top basic data
mainly for billing purposes are compiled whereas at the bottom (where multiple information systems
are geographically distributed across the national territory) a wealth of technical data exists.
The core set of variables for statistical exploitation embraces (i) (anonymous) identification variables
of each mobile device, (ii) time attribute(s), and (iii) geolocation attribute(s). The creation of these
variables depends on diverse factors operating in the network. Complementary variables can also be
extracted. A description of all these variables are included in the deliverable5.2.
From a wider perspective, the network complexity entails two immediate issues. On the one hand,
extraction costs must be carefully taken into account, especially when confronted with the firm
international principle in official statistics production of not paying for data for this purpose. On the
other hand, new professional skills are needed for the staff dealing with this task. Both aspects are
briefly tackled in this chapter.
All these technical aspects are summarised in a sequence of tables in the deliverable 5.2 as a
collection of issues to be dealt with the MNOs in a negotiation ranging from the premises where data
are to be processed over data coverage to network technology. This part of the deliverable strongly
follows the technical assistance by Positium.
Secondly, we gather all business guidelines especially using the results of the workshop in
Luxembourg and the opinions and visions exchanged with MNOs during this event. These findings
help us understand and disentangle many of the factors behind the difficulties to access mobile
phone data by official statistics producers.
Several on-going initiatives of collaboration between NSIs and MNOs were exposed. A round table
and a tour-de-table were held with a vivid debate on the different aspects involved in the access to
data.
As main relevant issues, these were identified:
Construct clear use cases to show feasibility and mutual trust.
Consensus on partnerships outperforming mandatory scenarios.
Concerns arise when moving from research to production.
Distributed vs. centralised data processing models. Solutions based on development of open
algorithms to be applied on secured data kept in data centres (e.g. MNOs premises).
Regulation on data protection and relationship with regulatory authorities is a big issue at
national and European level. More clarity needed.
Perception by society on the use of these sources. Communication strategy is needed.
Transparency for citizens.
Vicious circle in data access: it is necessary to build detailed case studies and delimit a
precise set of data to be requested to MNOs… but some kind of data access is needed in
advance for setting up these detailed case studies...
Relationships with MNOs could be different depending on their current strategies on big
data. NSIs must take into account these different starting points.
Quality assurance framework for our users.
Page 39
39
Thirdly, the individual experiences of each work package member are included in the deliverable in
the hope of providing illustrative guidance in the process of negotiating with MNOs. As main
guidelines, we have identified the following:
See/show the window of opportunity of building up a partnership between an NSI and an
MNO.
Get the right people committing their organizations with technical skills and competence to
build up the partnership.
Show empathy and value arising from the NSI’s contribution to this partnership.
Show absolute guarantees of confidentiality and privacy protection.
Show the limits of producing statistical outputs with no collaboration in contrast to the
combination of data sources and methodologies.
Be aware of the complexity of data extraction and the implications in cost extraction and
professional skills.
What data? Mobile phone data for statistical exploitation do not exist in a mobile
telecommunication network and a concrete specification must be formulated.
Define a concrete small research project.
Be attentive to legal issues.
Analyse costs.
Be transparent.
The overall goal is to provide relevant information about the access to mobile phone data seeking
optimal cost-effectiveness and efficiency within the ESS.
Outlook
For the SGA-2, the work package will concentrate on producing concrete statistical outputs using the
mobile data sets secured during SGA-1. The main goals for this second part of the ESSnet for this
work package will be:
From the methodological point of view we intend (i) to clarify the application of common
definitions of tourist, commuter, etc. on these data sets, and, in particular, the use of novel
techniques such as machine learning to apply these statistical concepts on these data; (ii) to
produce estimates together with their accuracy (especially bias) as the main measure of
quality; (iii) to research on the inference question from these data sets to the whole
population of analysis.
From the technological point of view, we will identify (i) needs, if any, for distributed
computing; (ii) special needs, if any, for data storage; (iii) novel needs, if any, for software.
From the data quality point of view, an evaluation of the statistical outputs especially
regarding accuracy and timeliness will be conducted and complemented with an assessment
of the adequacy of the current quality framework, again especially regarding accuracy and
timeliness.
Finally, some future prospects will be explored for the extensive use of mobile phone data in
the production of official statistics.
Page 40
40
2.6 Early Estimates
The aim of WP 6, Early estimates, was to investigate the potential of big data and others sources in
order to combine them for purposes of early estimates of statistical parameters.
The main goal of the WP 6 team was to explore how a combination of (early available) multiple big
data sources, administrative and existing official statistical data could be used in creating existing or
new early estimates for official statistics. The study included the exploration of:
big data sources and statistical areas where those sources could be used;
other administrative and statistical sources which could be combined with investigated big
data sources;
possible business cases which could be tested in the SGA-2 period;
data collection, data linking, data processing, methodological and IT issues;
results of one or two pilots which may help us to determine the most prosperous business
case for SGA2. Proposed pilot were Nowcasts of Turnover Indices or (and) Consumer
Confidence Index.
This report (SGA-1) focuses on results of the study related to a list of potential big data sources and
proposed business case for SGA-2. Three deliverables were produced, focusing on data sources and
business cases, on IT tools and on methodology, respectively.
Exploration of data sources
On the basis of brainstorming sessions at some of the NSIs involved in WP 6 (and WP 7), a
questionnaire sent to participated NSIs in the ESSnet Big Data project and discussion of members of
the WP 6 team, the initial list of possible big data and other sources and possible statistics which
could be calculated out of detected data sources was prepared.
Table1. List of possible data sources with statistical domain where they could be employed
STATISTICAL DOMAIN DATA SOURCES STATISTICS
Tourism (1) Mobile phone data, traffic counters at border crossing (including recognizing the number of plate of the vehicle), flight and train tickets, surveys…
Number of foreign tourists, number of (tourists) vehicles passing the country,
Tourism (2) Mobile phone data, surveys… Number of tourists, lengths of trips…
Population mobility Mobile phone data, surveys…
Number of (short) travels per day, average travelled distance per day…
Health statistics E-health recipes, personal health cards, pharmacies, surveys…
Use of medicines (by age groups, gender, territory…)
Agriculture
Airplane or satellite images, surveys… Utilized agricultural area, arable land, share of permanent crops in unutilized areas…
Quick and dirty statistics (in all statistical domains)
NSI data & e.g. Google trends tool Flash estimates of all kind of early statistics
Page 41
41
Statistics for the internal NSI purposes
Newsfeeds, social media data Monitor and detect the statistical products and areas which occur in statistical and other web news:
• Detect new statistical products which are very frequent and not covered by NSIs production yet
• Detect the statistical products produced by NSIs for which there is almost no demand
• This information which help the management of NSIs (together with stakeholders) to decide for which statistical product there is high public demand
Economic indicators:
• Gross domestic product (GDP)
• Consumer price index (CPI)
• Retail sale
• Balance of payments
• Economic sentiment indictors
Big Data: Job vacancies ads from job portals, traffic loops, Social data (Twitter, Facebook, etc.), supermarket scanner data, bank transaction data, news feeds/messages,
Registers and existing sources: Statistical Register of Employment, data from the Employment Agency, tax data, wages and salaries
Surveys: Turnover data from various short-term surveys, Business confidence index, Consumer confidence index
Flash and (or) intermediate estimates of economic indicators
Among the proposed data sources and associated statistics, the WP 6 team decided that the most
promising and interesting ones are combining sources which could be used for early estimates on
economic indicators. Three main reasons for this decision are:
There is a very high demand for flash estimates of economic indicators from stakeholders;
Many of the sources are available in most of the countries so it is possible to test them and
create the results for more than one country;
Even if the country does not have access to any of the big data sources, it is still possible to
test methods and processes on administrative and other existing sources.
Combining of data sources
When we think of combining of data sources in the traditional statistical production we mostly think
of combining them on micro level. If a common identificator exists, the linking of data is quite
straight forward, otherwise the various record linkage methods are applied in order to derive id in
data set where id is missing. In the area of big data the issue of combining of different data sets is
more complicated. Often the (big) data sources are completely different, so we are not able to
employ record linkage techniques, or one of the data sets contains unstructured data where we need
to employ big data techniques like machine learning in order to link data.
Page 42
42
The other possibility for linking of data sources is linking on macro level. Here we try to aggregate all
data sets making use of common identificators and including the data in nowcasting models.
Nowcasting4 is a very early estimate produced for an economic variable of interest over the most
recent reference period calculated on the basis of incomplete data using a statistical or econometric
model different from the one used for regular estimates. Soft data should not play a predominant
role in nowcasting models. Nowcasts may be produced during the very same reference period for
which the data is produced.
Conducting one of the pilots during the SGA-1 ESSnet project, several nowcasting methods were
under investigation. Among them the most promising in the sense of practical implementation was
the Principal Components Analysis (PCA) method. The central idea of PCA is to reduce the
dimensionality of a data set consisting of a large number of interrelated variables, while retaining as
much as possible of the variation present in the data set. This is achieved by transforming to a new
set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so
that the first few retain most of the variation present in all of the original variables.
When nowcasting early indicators with a PCA model, big data and other sources could be combined in two ways:
as a regressors in nowcast equation y = α1x1 + α2x2 + ⋯ + αkxk + β1y1. Variables x1, x2, ⋯ , xk are principal components of given set of (big) data and variable y1 is aggregated set of combined (big data) data source.
In the pilot conducted at SURS where we tested the PCA model in order to estimate Real
turnover index in industry (time series of interest), the Real turnover of industrial enterprises
(time series of enterprise data used for determination of principal components) was combined
with Economic sentiment indicator used as an additional predictor in linear regression.
as a micro data in nowcast equation y = α1x1 + α2x2 + ⋯ + αkxk + β1y1. Variables x1, x2, ⋯ , xk, y1are principal components of given two sets of (big) data
Business case for SGA-2
Early estimates of economic indicators
During the SGA-1 period, the WP 6 team explored big data and other sources which could be
combined for purposes of early estimates and conducted two pilots at Statistics Finland and the
Statistical Office of the Republic of Slovenia based on early estimates of some economic indicators.
Statistics Finland tested a series of shrinkage and factor analytic methodologies to compute nowcasts
of the main Finnish turnover indexes, using continuously accumulating firm-level data. They showed
that the estimates based on large dimensional models provide an accurate and timelier alternative to
the ones produced currently by Statistics Finland, even after taking into account data revisions. In
particular, it was found that the turnovers for some economic sectors could be estimated with high
accuracy five days after the reference month has ended, giving more accurate and faster predictions
compared to the first official internal release. The Statistical Office of the Republic of Slovenia
worked on a PCA model where a sharable application was created and tested on real industry indices
4 Overview of GDP flash estimation methods; Eurostat 2016 Edition
Page 43
43
where promising nowcasting results were obtained. This method was also tested on some big data
sources such as on-line job vacancy data.
Moreover all of the counties involved in WP 6 (Finland, Poland, Netherlands, Slovenia) and some of
the other NSIs involved in the ESSnet Big Data projects expressed quite high interest in earlier
estimates of main economic indicators produced at NSIs. Due to the expressed interest for
investigation in area of early estimates and the fact that many big data sources could be associated
with early economic indicators, it was decided to propose the pilot on early estimates of economic
indicators.
The aim of the pilot is to investigate big data and other existing sources for calculating flash and (or)
intermediate estimates of economic indicators. Early estimators of economic indicators which will be
considered are:
Gross domestic product (GDP)
Consumer price index (CPI)
Retail sale
Balance of payments
Economic sentiment indicators
Some work will also be dedicated to exploring possible new leading economic indicators.
During the conducting of the pilot the correlation of the data sources and early economic indicators
is planned to be explored and according to the results (detected combining sources and testing early
economic indicator), various models for flash and (or) intermediate estimates will be tested. The
most promising estimator is GDP, but the pilot will not limit itself to GDP due to the fact that results
of analysing data sources could propose calculation of (better) estimates of other economic
indicators.
Data sources
Many big data, statistical and other administrative sources could be linked to early economic
indicators. As one of the results of SGA-1, the list of possible sources which could be combined for
purposes of estimates of early economic indicators was prepared (Table 2). Some of the sources have
been already investigated; for some of them there is an issue with their accessibility. Availability of
time series of certain data sources should also be taken into account, due to our goal to nowcast
economic indicators.
Table 2. Overview of possible sources to be investigated
Big Data Registers and existing sources Surveys
Job vacancies ads from job portals Statistical Register of Employment Turnover data from various short-term surveys
Traffic loops Data from the Employment Agency Business confidence index
Social data (Twitter, Facebook, etc.) Tax data Consumer confidence index
Supermarket scanner data Wages and salaries …..
News feeds/messages …
Bank transaction data
Page 44
44
One of the data sources which could be easiest acquired is traffic sensor data. An additional
advantage is the availability of times series of traffic data, which is not the case for many other big
data sources. First results which have been obtained at Statistical Office of the Republic of Slovenia
(Figure 1) shows quite a fit between curves which show movement of annual GDP and movement of
estimates of annual GDP based on various annual aggregates of traffic density in Slovenia using data
in the period 2005-2014. The model used for estimation was simple linear regression.
Figure 1. Correlation between GDP and traffic sensor data
Figure 1 shows 5 examples of estimates of annual GDP due to the aggregated categories of vehicles
on Slovenian roads. Categorises which have been used are
Light trucks (up to 3,5 T)
Medium trucks (3,5 -7 T)
Heavy trucks (more than 7 T)
Trucks with trailer
Semi-trailers
In the example at the bottom right, all categories of vehicles were used as a regressor. Surprisingly
the best results were obtained where all vehicles are taken into account. However, based on initial
encouraging results, a more detailed analysis of traffic loops data (most of work planned for SGA-2)
has been started.
After some research it has been found that the data can be acquired from the Slovenian Ministry of
Infrastructure (and municipalities for local traffic). They gave us multiple choices for the format of
data and they also provided us with a sample of micro data. Samples of raw data were provided and
Page 45
45
so-called “edited data”. Row data presents categories of counted vehicles per traffic loop while
edited data represents time series of data of one or more traffic loops which were placed at the
same location point.
Raw data
The data is raw data from every traffic sensor placed on the Slovenian roads. As there exist different
kinds of sensors that count different categories of traffic, this would mean we would need to merge
the sensors on the same counting spot according to a formula that would adequately distribute these
differing categories. The number of categories differs according to the version of the sensor, as is
shown in the table:
QLD3 Sensor QLD3 counts all vehicles
QLD5 Sensor QLD5 distinguishes 5 vehicle categories
QLD6 Sensor QLD6 distinguishes 10 vehicle categories
QLTC8 SensorQLTC8 distinguishes 10 vehicle categories
QLTC10 Sensor QLTC10 distinguishes 10 vehicle categories
QLD Counted with different versions of sensors
Sensors have some common features. Every sensor counts traffic on 2 channels, this being the 2
opposing lanes on regional roads or the ordinary and fast lane on speedways and highways. The
counting interval is also the same for every sensor and it is 15 minutes. The data output file is a text
file with 11 categories of vehicles, regardless of the number of categories a sensor actually counts.
The uncounted categories are not marked, but are filled with zeroes. The data also contains other
information, such as the highest, lowest and average speed in the interval, the average of specifically
personal vehicles, the average time gap between vehicles, the occupancy of the lanes and the
temperature.
Traffic sensors
In 2015 there were 659 sensors in Slovenia which were not manual.
On the website promet.si there is information about traffic sensors
(https://www.promet.si/portal/sl/stevci-prometa.aspx, 26.1.2017) which gives you on the fly
information about the current traffic situation. Those traffic sensors covered all highways and other
roads in Slovenia as well. There is also available the map of all traffic loops in Slovenia which allows
us to geo locate their exact location
(http://www.di.gov.si/fileadmin/di.gov.si/pageuploads/Prometni_podatki/2015_karta_stm.pdf)
Roads
There are 12 categories of roads in Slovenia which is very important to know due to the possible
influence of traffic density of some roads on economic indicators (e.g. excluding the foreign vehicles
which cross the country). One of the steps in SGA-2 is investigation of what type of roads and what
categories of vehicles are most correlated with economic indicators.
More detail description of the categories of roads could be found at
http://www.stat.si/StatWeb/File/DocSysFile/8025
Page 46
46
Methodology
Nowcasting
The main idea is to (practically) test at least one of the nowcast methods for purposes of estimating
early economic indicators. As it was mentioned the PCA model has been tested at SURS for these
purposes.
Model: consists of two stages:
1. Principal component analysis (PCA)
dimensionality reduction
time series of (enterprise) data → standardize → choose the first few principal components
various conditions for choosing principal components:
The chosen principal components explain at least 70% (75%, 80%, 85%, 90%) of
variability of enterprise data.
Time series in the linear regression model are at least 7 (8, 10, 15, 20) times longer
than the number of the chosen principal components.
The last chosen principal component explains at least 5% of variability of enterprise
data.
2. Linear regression
Y (dependent variable): time series of interest, e.g. turnover index
x1, x2, ⋯ , xk, (predictors): e.g. the chosen principal components
SURS with the help of statistics Finland prepared an application (together with instructions how to
use it) which allow:
inputting different kinds of data
testing various conditions for choosing principle components
producing quality indicators which compare results of different nowcasting methods
producing quality indicators which compare results
Page 47
47
Figure 2. Comparison of estimates of real turnover index and actual index
Having an operational application where one of the methods for nowcasting is implemented, the
plan for SGA-2 is to include big data sources in the model and to test performance of estimates of
early economic indicators. There will be also focus of assessing the quality of the process and the
calculated estimates of early indicators.
Nowcasting turnover indexes (Finnish experience)
The main reasons of work done by Statistics Finland were:
Pressing issue was the long lag of publication and requirements from FRIBS and users (i.e.
national accounts);
StatFi wanted a practical solution, so another aim was simplicity and tractability in terms of
data source and method, if possible;
To propose methods that can be useful in big data, and that are commonly used in the
Nowcasting of macroeconomic variables;
Using continuously accumulating firm level data (hard data);
Estimating the common components underlying this data with factor analysis;
The common components would be predictors in nowcasting equations;
Looking at the machine learning literature, other options were available (such as LASSO,
RIDGE, Elastic Net regressions) that could deal with high dimensional econometric problems.
Page 48
48
The firm level data was used for those purposes. One of the issues which had to be solved was
multidimensionality of data. Widely used is the factor analysis which estimates the common and
idiosyncratic variance underlying the data. But shocks to large companies can have a sizeable impact
in the Finnish economy (economic activity is concentrated on a few multinationals e.g. Nokia). That is
why so-called shrinkage models were explored in order to better capture some of the firm specific
variations. They include all the firm growth rates in the estimation, and deal with the curse of over
fitting by shrinking parameter values towards 0. These models outperform factor models (in general).
Statistics Finland has also tested nowcasting the second month of the quarter and forecasting the
third, allowing to compute real time estimates of quarterly GDP. Methods were similar, using sales
inquiry for firm level data. Turnover indexes are widely followed in their own right, but are also used
as source material for producing the Trend Indicator of Output (TIO, i.e. the Finnish monthly
economic activity indicator).
Conclusions and outlook
The first aim of ESSnet Big Data WP 6 (SGA-1) was to find out which pilot would combine multiple
(big) data sources and have a real potential to be implemented (by at least two countries) in SGA-2
period. The proposed pilot is Early estimators of economic indicators. The proposal of the pilot has
been made and positive response from other countries shows that we are on the right track.
According to the plan the WP 6 team worked on following deliverables:
Detailed business plan for the pilot was prepared;
List of tasks per each country was prepared (involved in SGA- 2);
Initial investigation of available data sources in participated countries was done.
The second aim of the report was to display which of the two proposed pilots have greater potential
in order to be implemented during the first wave of pilots. After the first few months of investigation
it had been found out that the pilot NowCasting turnover indices is much more feasible in terms of
available data. Another advantage of this pilot is the models for nowcasting which will be tested
during the next period. The project team have seen the clear connection between those models and
the proposed pilots on early estimates where we could use the experiences from nowcasting the
NowCasting turnover indices.
For the SGA-2 period, Italy and Portugal will join the existing WP 6 team (for SGA-1 consisting of
Finland, Poland, Netherlands, and Slovenia). Internal WP 6 meetings were already organised with the
new members, where we discussed about the organisation of work, possible (big) data sources and
about now casting methods and economic indicators for which we intend to calculate early
estimates. The most interesting among them are quarterly GDP (GDP overall or some of the
components of which the GDP consists) and new leading economic indicators which measure the
business cycle. However, in case of estimates related to GDP we need to bear in mind that demands
for the precision of estimates are very strict.
Page 49
49
2.7 Multi Domains
Aim of WP 7 is to find out how a combination of big data sources, administrative data and statistical
data may enrich statistical output (how they can be used to improve current statistics) in domains:
‘Population’, ‘Tourism/border crossings’ and ‘Agriculture’.
This work package has a more scientific nature. From the methodological, qualitative and technical
point of view it is required to work with professional independence. The activities in this work
package are independent from the results of the work packages 1 to 5, although, where relevant,
information with other work packages is exchanged. Especially, WP 6 and WP 7 are related. These
work packages both concern the exploration of combining various sources, including big data
sources, for the production of statistics.
The work package team aims at describing the data collection, data linking, data processing and
methodological aspects when combining data in statistical domains and additional value could be
created by investigating the linkages between domains.
Since this work package considers many crosscutting issues, such as methodology, quality and
technical requirements, care will be taken that its outputs can be used as inputs for the WP 8 of SGA-
2. The participation of Statistics Netherlands, which leads WP 8 of SGA-2, is aimed at ensuring this.
Under SGA-1, apart from GUS (Statistics Poland) which is leading WP 7 and CBS (Statistics
Netherlands), this work package is carried out by 2 other representatives of ESSnet Big Data partners:
CSO (Statistics Ireland) and ONS (Statistics United Kingdom).
The WP 7 team divided work into 4 main groups of tasks:
Task 1. Data availability/Data inventory
Task 2. Data feasibility
Task 3. Data combination (SGA-2)
Task 4. Summary plus future perspectives (SGA-2)
Similarities and differences between countries, such as concerning the availability of registers, the
legality of data linkage, etc., is taken into account when carrying out the tasks.
At the end of SGA-1, work was carried out mainly in the field of pilots. Previous tasks had provided
appropriate preparation for further studies. In the second part of SGA-1 WP 7 started first practical
work. The results obtained so far, divided into individual domains, are presented below.
As the final results of SGA-1, WP 7 prepared at the end of January 2017 three final reports:
7.1 Report for Population domain
7.2 Report for Tourism/Border crossing
7.3 Report for Agriculture.
These (combined) reports contain basic information on data access (such as legal and privacy
aspects), data quality issues, methodology (including the combination of data) and the technical
aspects of the data.
Page 50
50
They have been submitted to the Review Board on 1st February 2017 and then WP 7 started SGA-2
preparation, because it was the final phase of SGA-1 for this team. According to the FPA, there was
overlap between SGA-1 and SGA-2, WP 7 was one of the work packages that has already started
work under the second agreement from March 2017.
All tasks were completed within the scheduled timeframes.
Population
The population domain covers mainly the study including an analysis of the attributes of the
population. The survey’s results are often presented in the form of population structure, for example
the number of females per 100 men, as well as an index, for example net migration per 1000 people.
Due to the unrepresentative nature of the data contained on the Internet and the possibility to apply
only on the basis of the persons that are active in the Internet, it was decided to link the indicators of
the population with the indicators of social research, such as life satisfaction or participation in the
elections (intention to vote). The decision to choose this scope of study also results from the
possibility of comparing the results obtained with the results coming from the traditional research
techniques.
It was decided to split the use case into three parts:
Daily satisfaction (Life satisfaction);
The moods of population associated with public events (e.g., Brexit, Voting);
The morbidity areas (e.g., flu).
Accordingly, the aim of this study is:
To examine the level of daily satisfaction of the population by analysing the content of
messages for the presence of defined expressions describing emotional states, e.g.,
happiness, joy, sadness, fear, anger;
To present the moods of the population associated with various public events;
To observe morbidity areas, e.g., flu.
Population indicators will be limited to:
Residence population (and migration);
Number of women per 100 men;
Population structure;
of persons using new technologies to communicate, such as social media, blogs etc. These indicators
will be related to social statistics, such as life satisfaction, moods and morbidity areas.
Data obtained through the proposed solutions enable:
extending the scope of the database;
obtain more current results;
adding more detailed cross-sections for the study population of users of social media and the
Internet (currently there is no such subpopulations in similar thematic studies).
Page 51
51
In the first stage (December 2016 - February 2017) it was decided to conduct a pilot Use Case 1, i.e.
Daily satisfaction within Life satisfaction. Others Use Case in Population domain, numbers 2 and 3 are
made in the months of February and March 2017.
Table 1. Schedule for carrying Use Case
Use Case 12.2016 01.2017 02.2017 03.2017
1.1.
1.2.
1.3.
The criteria for selecting the Use Case relate primarily to the availability of source data and the
reliability of the results. It was therefore decided to use the widely available source, access to which
is possible by using web scraping or a dedicated API.
It was assumed that the life satisfaction data may be based on the data sources in the form of social
networking sites and web pages. The advantage of social networking sites in relation to websites is
the possibility of obtaining the geographical location of the entry. Concretising the data source, the
following portals / tools were indicated:
DS1 – Twitter
DS2 – Google Trends
DS3 – Comments on Specific News/Events on Web Portals such as gazeta.pl, bbc.co.uk,
irishtimes.com, spiegel.de
According to the data quality assessment, these sources are characterized by a relatively high quality
of data sufficient to acquire and process data in order to achieve the main objective of the Use Case.
The main aim of Use Case 1.1 is to obtain data on Daily Satisfaction. This means that there is checked
satisfaction of life by analysing the entries posted on the Internet. To achieve this objective, it was
decided to use the API tools to retrieve data from the social networking site Twitter.
Due to the nature of entries on social networking sites it is not possible to graduate life satisfaction,
such as it is, e.g. in the European EU-SILC (European Union Statistics on Income and Living
Conditions).
The main advantage of the proposed solution is possibilities to carry out surveys more frequently
than with currently practiced study’s modules. In addition, the benefit is less burdening respondents.
Use Case realization includes seven steps. Within these stages the following tasks will be realized
respectively:
Classification of Tweets in order to extract from them the attributes related to life
satisfaction;
Analysis of the possibility to obtain the attributes of the population;
Preparation of a Python solution enabling the acquisition of tweets and processing them with
algorithms such Machine Learning;
Teaching Machine Learning algorithms the type of tweets classification;
Verification of the algorithms on the test data set;
Page 52
52
The choice of algorithms to conduct the survey in a test environment;
Conducting a pilot in a production environment.
To be prepared within the first phase is a solution using Python version 3, and the environment
Apache Spark for data acquisition and processing. For the implementation issues related to the
webscraping, a Twitter API library Tweepy was used. In order to implement the Machine Learning
algorithms Library scikit-learn was used. The whole is presented in Figure 1.
Figure 1. Big Data Framework used for Use Case 1.1.
The training set looks as presented in the Table 2 examples.
Table 2. The content of the training set
No. Text Target Language Id
1 Rousey is gonna quit UFC forever now lmao #SoHappy Satisfied EN F1
2 And I did absolutely nothing #satisfied ��❤ Satisfied EN F2
3 To był cudowny weekend ����� #love #happy #awesome #osom #bestweekend @ Czestochowa
Satisfied PL F3
4 Połączenie nowoczesnego designu z funkcjonalnością sprawi, że osiągniesz jeszcze lepsze wyniki.
Neutral PL F4
5 They want more happiness & more money in 2017 cause they're not satisfied w/the position of each. It don't matter the context. #Unsatisfied
Not satisfied EN F5
The general outline of this study includes the following topics:
Webscraping
Analysing of comments / messages
The selection and arrangement of useful information
Quantitative and qualitative classification of posts/messages/comments (machine learning)
To sum up, we can say that the project sets up the use of large sets of data from social networks to
assess the current state of the moods of people associated with the public event as well as related
indicators, such as life satisfaction. This allows studying the impact of public mood on decisions.
Particularly interesting could be the analysis of the moods of different countries citizens in relation to
the same events and mood changes over time.
The work plan for the future is to develop more indicators associated with satisfaction in various
areas of life, e.g. work, commuting time etc.
Twitter data
Tweepy
Sklearn
Training Dataset Machine Learning
algorithm
Data extracting Predictive
model
Labels
Feature vectors
Result set
Page 53
53
Tourism/border crossing
Regarding the data source we indicated the following data owners (in original language):
Generalna Dyrekcja Dróg Krajowych i Autostrad for Poland5 (GDDKiA)
Bundesanstalt für Straßenwesen for Germany (BASt)
Ředitelství Silnic a Dálnic for Czech Republic (RSD)
Národná diaľničná spoločnosť for Slovakia (NDS)
Kelių ir transporto tyrimo institutas for Lithuania (KTTI)
At this moment we started cooperating with GDDKiA, BASt, KTTI and NDS while we are awaiting the
answer from RSD.
Simultaneously, there is a sample survey on border traffic conducted by the Statistical Office in
Rzeszów. We sample days from quarter as well as measurements points from all possible
measurements points along the border.
On the one hand we have a large amount of data but only for few points of time while on the other
hand we have data of high frequency but of the low volume. Volume of data from GDDKiA is slowly
increasing over time. Some of the measurement points were not available in the past. Also we have
some mirror statistics – from German and Lithuanian side data is of high frequency and plausible
volume, for other countries data frequency and volume is available in the same way as for Poland. In
addition, we have data of high frequency for several points near the border from our sample survey
but we assume that the role of surveys will be significantly decreased due to the results of our efforts
and the data may be used up to a certain moment of time in future. After that moment we hope to
base results only on big data sources.
In the first step we prepared a template to fulfil spatial-temporal data. We selected length of time
series according to data availability and appropriate measurement points: all points from Continuous
Traffic Measurement and all points near the border. Most of the points from Continuous Traffic
Measurement is situated in the inner area of Poland, only few are near the border. For several points
mirror statistics are available. The template will be fulfilled for following data sources:
General Traffic Measurement from GDDKiA
Continuous Traffic Measurement from GDDKiA
Continuous Traffic Measurement for German border from BASt
General Traffic Measurement for Slovakian border from NDS
Continuous Traffic Measurement for Lithuanian border from KTTI
Survey on border traffic conducted by Statistical Office in Rzeszów as a support for improving
model parameters but not as a source of data for further estimations
Also additional data will be gathered:
Distance matrix between measurement points;
Number of registered vehicles in LAU1 level;
Other available data which can be connected with traffic.
5 General Directorate of National Roads and Motorways
Page 54
54
Continuous Traffic Measurement is conducted not for all measurement points for the whole period
of time – lacks of data at the beginning of time series – as well there are some missing data in the
middle of time series. Thus, data will be imputed with use of correlated time series available for the
missing moment of time. As a result, we shall obtain the greater set of time series available for the
same period.
Although there are some missing data, some spatial and temporal analysis have already been carried
out on available data. The Figure below presents a histogram of Annual average daily traffic (AADT).
Figure 2. AADT for measurement points in General Traffic Measurement in 2015
It turns out that traffic intensity closely follows the Pareto rule of 80/20. 67% of traffic flows through
32% of measurement points. Distribution is right-skewed and concentrated. Data variability is also
very high. Basic statistics of spatial distribution are presented below.
Table 3. Distribution of traffic intensity in General Traffic Measurement in 2015
Statistics Value
Mean 12818
Standard deviation 12329
Coefficient of variation 96%
Median 9622
Kurtosis 5,20
Skewness 1,92
Minimum 179
Maximum 73937
N 148
Basic temporal analysis reveals that most of the time series exhibit nice regularities. Simply, checking
R-squared of linear trend model we obtained that more than 65% of time series of relevant length of
Continuous Traffic Measurement has R-squared greater than 0,5. That allows us to produce nice
forecasts for these time series. On the other hand we discovered that some time series have a huge
fall (level shift) probably connected with changes in traffic network. Example of well-behaved and ill-
behaved time series are presented below.
0,00%
20,00%
40,00%
60,00%
80,00%
100,00%
120,00%
0
10
20
30
40
50
60
179 6326 12472 18619 24765 30912 37058 43205 49351 55498 61644 67791
Ab
solu
te f
req
ue
ncy
AADT
Absolute frequency Cumulative distribution
Page 55
55
Figure 3. Examples of time series of traffic intensity
There are a few crucial steps in building a model of traffic intensity which will be performed in our
use case:
Data imputation;
Connecting traffic intensity variables with exogenous variables;
Including distance matrix or adjacency matrix to improve data coherence;
Modelling level shifts;
Building a prior traffic intensity based on General Traffic Measurements;
Building a posterior traffic intensity based on Continuous Traffic Measurements from several
sources by applying combining data methods;
Temporal disaggregation of yearly data to quarterly data.
Agriculture
Agriculture is one of the sectors of the economy; its main task is to provide agricultural products.
Plant and livestock products are obtained through tillage and plant breeding and animal husbandry.
Agriculture is also an area which has a strong impact on the environment. In recent decades, the
agricultural sector has seen much change. The recent addition of research in this sector has seen
data produced at different stages of agricultural production. This data can be processed and analysed
contributing to increased efficiency, productivity, or better use of resources.
The free, full and open data policy adopted for the Copernicus programme foresees access available
to all users for the Sentinel6 data products, via a simple pre-registration. Registration is open to all
users upon completion of an on-line self-registration accessible via the Sentinels Scientific Data Hub.
Member States requiring data for national initiatives in the frame of the Sentinels Collaborative
Ground Segment need not register on this service, they are served via the dedicated access point.
Following registration, the user can immediately download Sentinel products generated
systematically from all acquired data. Please note that depending on the mission and the acquisition
time of the product, the full operational qualification may not yet be completed.
6 Source: https://sentinel.esa.int/web/sentinel/sentinel-data-access
0
2000
4000
6000
8000
10000
12000
14000
16000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
AA
DT
well-behaved ill-behaved
Page 56
56
Satellite data holds potential to be important for this domain area, for example generating
agricultural maps. With successive orbits over repeat areas, with a constant interval of time, satellite
images allow us to monitor changes in the field situation. The main satellite data applications in the
agriculture domain are as follows: monitoring of crop conditions, seasonal changes, soil properties
and mapping tillage activities.
Moreover, satellite data enables us to monitor changes in agricultural production or soil quality and
supports policy for sustainable development. Agricultural maps based on satellite images provide
independent and objective estimates of the cultivation extent in a given country or a growing season.
The use of aerial and satellite images for mapping land cover and identifying land use change is one
of the most advanced and most widely used methods of remote sensing. The most commonly used
methods are :
based on computer-aided interpretation of the types of land cover in high resolution satellite
images;
relying on semi-automatic classification of types of land cover in high resolution satellite
images using advanced techniques of identify classes, based on object classification (using
spectral, textural and contextual attributes of object).
Object-oriented methods are also used to classify land use using microwave images, which currently
has great importance to the free satellite data Sentinel-1 program Copernicus.
The scientific work focused on the creation and improvement of methods of processing and the
classification of different types of remote sensing data. This is carried out in a number of research
centres. In particular, it concerns the work on the establishment of optimal methods for classification
of multispectral images of optical and radar images. Different approaches to data classification are
tested, both at pixel and object level (analysis of spectral mixing, analysis using a decision tree,
analysis using neural networks, analysis of morphological image, multi-fractal analysis), in order to
optimize the results of the classification. Parallel studies are conducted into the combined methods
using the optical and radar images to classify elements covering the surface of the earth.
Classification algorithms are developed continuously depending on needs.
The extremely important aspect of all work that relates to the use of satellite data is their calibration
by "in situ" measurements.
Page 57
57
3. Issues encountered
3.1 General issues
Planning and related issues
The planning of the milestones and deliverables of SGA-1 is given in the form of a Gantt chart in
Annex II of the agreement of SGA-1.
Has the planning been realised? By now all milestones and deliverables have been realised, but in a
few instances there was some minor delay. The only case where the delay was more substantial was
WP 5. Milestone 5.3, the organisation of a meeting with Telecom-providers, did not take place before
but after the summer of 2016. This had to do with the time needed for preparation, in particular
ensuring the participation of relevant Telecom-providers. The work package leader, the ESSnet co-
ordinator and the Eurostat project manager of the ESSnet together reached the conclusion that
holding the meeting after summer would yield a far better result, and they agreed on postponing the
meeting accordingly. As a consequence, deliverable 5.1 (a report on the current status of data access
to mobile phone data in the ESS) was produced with a delay, as was the case with deliverable 5.2 (a
report with recommendations), but both deliverables were realised before the end of SGA-1. The
output of WP 5 was not affected, only the timing – and for good reasons.
More details on the realisation of the planning of the specific work packages will be provided in the
Final Report on the Implementation of the Action, due 60 days following the closing date of the
action.
The issues that occurred in carrying out the action of SGA-1 for the specific work packages are
discussed in section 3.2. There were no major cross-cutting issues, and the CG meetings proved to be
effective in solving all matters that affected multiple work packages.
Use of the Sandbox7
The Sandbox is a shared platform for storage and computation of big data, hosted and managed by
ICHEC (Irish Centre for High-End Computing). The Sandbox is one of the outcomes of the HLG Big
Data project, carried out in 2014 and 2015 and facilitated by UNECE. The use of the Sandbox as a
training and collaboration platform was successfully tested during the project and after the end of
the project, an agreement with ICHEC to grant the use of the Sandbox to organizations on a
subscription basis. The main characteristic of the Sandbox is the possibility to share datasets and
work with big data tools without any software installation and configuration. The tools included in
the Sandbox are accessible remotely from any computer simply through a web browser and do not
need special software or hardware requirements.
A subscription to ICHEC for the use of the Sandbox was acquired by the ESSnet project. The
subscription gives the possibility to all the project participants to create and account in the Sandbox
to upload/download datasets and use the tools for analysis.
7 This section was kindly written by Antonio Virgillito, who is the Sandbox officer for the ESSnet.
Page 58
58
A special section in the project wiki is dedicated to the Sandbox, with instructions on access and use
and documentation for all the tools:
https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/Sandbox
In the following we describe the ways the Sandbox has been used in SGA-1:
WP 2. A software for scraping web sites was uploaded in the Sandbox from members of WP 2 and is
available for all the users.
WP 3. Members of WP 3 requested the use of the Sandbox for experimenting analysis of smart
meters data. Synthetic smart meters data sets were used during the HLG project but were removed
after the project ended and were no longer available. So, a search about public data on the topic led
to a public open data set published by the Australian government. The dataset was uploaded directly
in the Sandbox and now is available for all users. It contains 350M real observations of electricity
consumption readings of 80.000 anonymized domestic customers over a time range of 3 years.
Examples on the use of the smart meters datasets for analysis and visualization were developed
during one of the ESTP courses on big data held in 2016 and is available in the project wiki.
WP 4. Members of WP 4 uploaded in the Sandbox a sample of the datasets concerning ship
positioning data. The dataset is available for all the users and it represents six months of
observations of positions of ships in European waters. Examples on the use of the AIS datasets for
analysis and visualization were developed during one of the ESTP courses on big data held in 2016
and is available in the project wiki.
WP 5. Access to the UNECE Sandbox was requested in order to have a view of such a platform as a
potential model to store and process data in-situ within the MNOs' own premises.
Page 59
59
3.2 Issues at the level of the work packages
1 Webscraping / Job Vacancies
The overriding issue for WP 1 is how to address the various technical challenges of using web scraped
job advertisements to produce concrete and meaningful outputs. This is summarised in Section 2.1.
Some of the issues encountered within this pilot are country specific. For example, Sweden has faced
an on-going problem in getting permission from their legal department to undertake web scraping.
Germany have a specific issue in gaining access to job vacancy survey micro data that is held by
another government department. The approach here has been to focus on other areas where
progress is possible. For example, in Sweden, the focus has been on analysing data obtained through
the Swedish Employment Agency. In Germany, a comparison of job portal and job vacancy survey
data was made based on aggregated data.
Some of the more general issues encountered within WP 1 are listed below:
Staffing
Some countries involved in WP 1 have experienced difficulties in retaining staff, in particular data
science specialists, who have been difficult to replace. The countries affected are Sweden who lost a
key team member in May, and the UK, where three team members left during 2016. Germany also
lost a key subject matter expert. Thus, the experience is that it is difficult to recruit and retain staff
with strong data science skills.
Working across environments
For the most part, the experimental web scraping activities within this work package have been set
up in open environments, separate from existing production environments, which require higher
levels of security. This has created some practical difficulties when looking to combine data collected
from different environments, i.e. web scraped job vacancy data with survey or business register data.
If the web scraped data is moved into the secure environment, then the range of data science
analytical tools is often limited. If the data is to be combined in a less secure environment then
additional processing on the secure data may be needed. This additional processing could include
removal of variables and/or the creation of temporary unique identifiers to enable some processes
(e.g. matching on company name) to be done in a less secure environment using open source
machine learning tools with the processed data then being moved back into the secure environment
to incorporate the secure elements (e.g. the survey responses). This issue is not a show stopper, but
it does underline that the IT systems of NSI are not yet geared up to combine and process data from
different sources using open source big data tools.
CEDEFOP
WP 1 has a good relationship with CEDEFOP and reached an earlier agreement to share data from
their pilot system. However, the analytical dataset lacked sufficient date and company information
that is needed to validate against survey data. CEDEFOP have now launched the next phase of the
web scraping project and so there is an opportunity to develop a deeper partnership with the Big
Data ESSnet. WP 1 participants met with CEDEFOP as part of a satellite meeting set up as part of the
Page 60
60
ESSnet dissemination workshop in Sofia, in February 2017. This was followed by a partnership
agreement between Eurostat and CEDEFOP to help facilitate collaboration.
Work package expansion
For SGA-2, four new partners (France, Belgium, Denmark, Portugal) will join WP 1 bringing the total
number of partners to ten. While this creates the opportunity to spread knowledge wider and to tap
into a wider pull of expertise, it also creates additional logistical challenges. Some of the new
partners joining during SGA-2 have already undertaken some work so there will be opportunities to
share information. However, there is a need to still “push deep” and focus efforts on producing the
meaningful concrete results that are proving elusive.
2 Webscraping / Enterprise Characteristics
There were two main issues in carrying out WP 2 work:
Issue 1: Legal and ethical limitations in accessing and storing data scraped from Enterprises
web sites.
Issue 2: Coordination about pilots.
Concerning issue 1, the most interested countries (as detailed in the Deliverable 2.1) were UK and
Sweden. The impact on the development of pilots implementing the use cases was as follows:
Sweden worked only on the Job Vacancy use case, especially as a contact point with WP 1.
UK worked on selected use cases (1,2,3) especially focusing on the analysis phase rather than
on the web scraping one (at massive scale).
Poland also has a not completely defined legal landscape, though this did not have a major impact on
the pilots.
Concerning issue 2, there was a problem of coordinating among each other the different pilots. In
this respect, it was decided that:
ICT Survey can be considered as benchmark, however there are differences at country level.
Volunteering countries can use software developed by other countries, if there are resources
for doing that. So far, BNSI volunteered to use Istat’s software.
A “logical architecture” for the pilots was shared, so that even if the pilots do use different
software, it is possible to classify and compare the different solutions.
3 Smart Meters
The biggest issues the partners have faced during the project are related to getting access to data,
delays caused by the lack of knowledge and experience with big data, and project members
leaving/changing.
Data access
The most important issue has been getting access to data. In the beginning of the project Statistics
Estonia and Statistics Denmark had access to some data, Statistics Sweden was having discussions to
get the data and Statistics Austria did not plan to try to get access for different reasons. The aim was
Page 61
61
that during the project Statistics Denmark would get access to newer and more detailed data and
Statistics Sweden would get access to some test data.
During 2016 Statistics Sweden had several meetings with the data owners and during the meetings
and discussions common concerns and interests were recognized on which to build further
cooperation. Agreement was reached on getting test data in the beginning of 2017.
Statistics Denmark has had access to the aggregated annual consumption data. Access to detailed
electricity consumption data was achieved in the end of the 2016 when a contract between the
datahub owner and the office was agreed on. As the access to detailed data was achieved after the
report was delivered, it was decided not to update the report with regard to the data handling part
when the new relevant information became available. These changes will be made during the second
part of the project.
Lack of skills
Although Statistics Estonia has had access to data from the beginning of the project, the office did
not have suitable IT infrastructure to store and handle the large amount of data. Neither had the
office knowledge how to install and set up the system. For getting the suitable infrastructure to store
the data, several servers were rented from our IT partner using the resources foreseen by this
project. It took three months to get servers with suitable configuration. However, a by far bigger
problem was getting the new software installed and running as there is no competence available in
the office. Some help were ESTP courses on big data that two specialists from Statistics Estonia
attended. In the end, with the help of an outside consultant the system was set up and at the
moment, the most common analyses can be carried out on the data. It takes some time to learn to
use the new system, but the issue that still needs to be solved is acquiring the skills to maintain and
upgrade the system, so more advanced tools could be added.
Personnel
Some delays in the work were caused by the change of personnel in the offices. Two members (one
from Statistics Sweden and one from Statistics Denmark) in this work package have been replaced by
new people during 2016.
4 AIS Data
Staffing issues
Very shortly after the first face-to-face meeting in May 2016 our team member from GUS (Dominik
Rozkrut) got another job and left our work package. It took till the beginning of September before
there were new participants from GUS available for this work package. Also one of the participants
from CBS in this work package (Maarten Pouwels) left this project in November 2016, because of
other priorities in his work. This was not a big issue, because Tessa de Wit could take over his tasks in
a very short time period. Tessa was already involved in this project since the beginning of September
2016.
Getting AIS data by EMSA
Page 62
62
We still did not succeed in getting free European AIS data from EMSA. We work together with
Eurostat on this topic. Together with Eurostat we visited DG Move in Brussels in July 2017 to explain
why we need this data. The High Level Steering Group has to decide now if we get AIS data of EMSA
available for our work package. If we do not get access to the EMSA data we cannot deliver
deliverable 4.7 of SGA-2, but it would not have any consequences for the other deliverables in this
work package.
5 Mobile Phone Data
There is no issue of relevance to be mentioned during the execution of SGA-1 for WP 5.
Access to the UNECE Sandbox has been requested but not for data storing and data processing
purposes. The aim was to have a view of such a platform as a potential model to store and process
data in-situ within the MNOs' own premises.
Regarding the consultancy agreement with Positium, the technical assistance during the workshop in
Luxembourg was conducted according to the initial budget. Concerning the two reports requested
regarding both the access and the processing of mobile phone data, the first report was produced
and received in time to be strongly used for the composition of deliverable 1.2. The second report
concerning methodology for the processing of data was satisfactorily received some time afterwards
and will play a relevant role in the SGA-2. Furthermore, Positium was invited to the first physical
meeting of partners of the WP 5 in relation to diverse issues regarding both the reports and the
access to mobile phone data from MNOs. This technical assistance was also part of the original
budget of SGA-1.
6 Early Estimates
In the proposal of the WP 6 we were very ambitious in the sense that we planned to carry out two
pilots which would give us "quick wins". Beside the pilot related to Nowcasting the turnover indices,
the WP 6 team tried to work also on the pilot which aim was to test the possibility to estimate the
sentiment indicators such is the Consumer Confidence Index (CCI) using data from social media
(Facebook, Twitter, ...). However, due to early findings where we found out that access to a sufficient
amount of data is impossible, we abandoned this idea. The problem was also access to the historical
data. The same issue occurred during the implementation of the pilot on early economic indicators
where we did not have access to time series of big data sources.
For SGA-2, other issues are IT infrastructure and IT tools in case of access to big data sources. In the
SGA-1 period the WP 6 team focused more on the “big data methods” rather than big data sources
due to obvious problems with the access to those sources.
7 Multi Domains
There is no issue of relevance to be mentioned during the execution of SGA-1 for WP 7. Access to the
Sandbox was not necessary at this stage of work. For experimental purposes the team are using their
own server with dedicated software running on Apache Spark and Apache Hadoop. At the moment
they are using Python language for pilot surveys.
Page 63
63
Annex: Communication and Dissemination
1 Summary
WP 9, Dissemination, focuses on the internal and external communication of the ESSnet Big Data.
The major activities in SGA-1 consisted of creating and maintaining a Mediawiki collaboration and
communication platform which also serves as public extranet, mirrored on the CROS Portal and
complemented by a restricted-access project website; supplying project participants with training,
support and assistance in using the collaboration and dissemination tools; and preparing the SGA-1
Dissemination Workshop on 23-24 February 2017 in Sofia (Bulgaria).
2 Introduction: targets, objectives and tasks
The groups and individuals targeted by WP 9 include, in order of importance:
the 22 partners participating in the ESSnet Big Data, in particular the ones active in SGA-1;
non-participating ESS NSIs as well as European and international organisations involved in big
data initiatives;
other organisations or individuals interested in official statistics based on big data, as well as
the media and the general public.
The objectives of WP 9 tailored to these target audiences are:
ensuring the exchange of information on tasks, planning, timing, processes, intermediate
products and final results among pilot project participants;
providing all necessary tools, datasets and training to project participants in an easily
accessible way;
disseminating results, as best practices, to the ESS and the wider community of official
statistics, and in the appropriate formats to anyone else potentially interested.
The concrete tasks to achieve these objectives are:
building a collaboration and communication platform for ESSnet partners to post or consult
information, collaborate on outputs and comment or discuss processes and outputs;
one or more external websites presenting project outputs, customised to the different
categories of users (the target audiences identified above);
training, support and assistance, either individually or via tutorials and documentation, on
the use of the platform and its tools, the preparation of reports and deliverables (via
templates) and the dissemination of results via the ESSnet Big Data websites or other
channels such as articles, publications, presentations at conferences, …;
a general dissemination workshop on results and lessons learned at the end of SGA-1.
3 Collaboration and communication platform/extranet
A Mediawiki website https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/ has been created
and is being expanded gradually and continuously.
Page 64
64
It serves two functions:
collaboration and communication platform for project participants: project information and
backgrounds, contact information (later moved to restricted-access site), resources and
tools;
extranet, presenting outputs in a well-structured way to anyone outside the project:
backgrounds, documentation, public reports and deliverables.
4 CROS Portal mirror site
On the CROS Portal a mirror site https://ec.europa.eu/eurostat/cros/content/essnetbigdata_en
using Confluence was created, providing access to the same content as the Mediawiki site (without
the ‘special’ project pages). This was achieved differently for pure navigation pages and content
pages: navigation pages at the two highest hierarchical levels were duplicated in the CROS Portal, but
the links to content pages in them leads to the Mediawiki content pages. This has the double
advantage of being manageable (navigation pages change infrequently) and maintaining only one
version of the ‘truth’.
5 Restricted-access project website
A restricted-access project website
https://webgate.ec.europa.eu/fpfis/wikis/pages/viewpage.action?spaceKey=EstatBigData&title=ESS
net+Big+Data using Confluence was created to store all confidential project information (e.g.
personal or financial data). This site is linked to from the central Mediawiki site but persons not
specifically granted access cannot view its content.
6 Training, support and assistance
Project participants were given the needed training, support and assistance to use the Mediawiki
collaboration and communication platform, either individually or via tutorials and documentation
created in the website (see
https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/index.php/Category:Tutorial) and via
templates for reports and deliverables. Assistance is also available for the dissemination of results via
the websites or other channels such as articles, publications or presentations at conferences.
7 Dissemination Workshop
The results of SGA-1 have been presented on a Dissemination Workshop on 23-24 February 2017 in Sofia (BG), with an attendance of about 80 persons (see https://webgate.ec.europa.eu/fpfis/mwikis/essnetbigdata/images/d/db/Dissemination_Workshop_2
017_02_23-24_Sofia_Minutes.pdf)