Top Banner
1 data analytics Workshop for official statistics Technical Report
20

data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

Mar 26, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

1

data analytics Workshop for official statistics

Technical Report

Page 2: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

2

Table of Contents Introduction ....................................................................................................................................................... 3

Workshop objective ....................................................................................................................................... 4

Report structure ............................................................................................................................................ 5

The Workshop sessions are then described in Annex 1, summarising the presentations and discussions that

took place. Annex 2 includes the Workshop programme. All presentations are available on the dedicated

webpage on the CROS portal. ........................................................................................................................... 5

Key conclusions and Recommendations from the Workshop .......................................................................... 5

Summary conclusions .................................................................................................................................... 5

Recommendations ......................................................................................................................................... 6

Annex 1 – Description of Workshop sessions.................................................................................................... 7

1.1 Opening session ....................................................................................................................................... 7

1.2 General Sessions: Expert lectures ........................................................................................................... 8

1.2.1 Organisation of the general sessions................................................................................................ 8

1.2.2 Expert lecture: Use cases and best practices in data analytics ........................................................ 8

1.2.3 Expert lecture: Trends in data analytics architectures ..................................................................... 8

1.2.3 Expert lecture: Future developments in data analytics and data science........................................ 9

1.2.4 Expert lecture: Engaging users and policymakers for data analytics ............................................. 10

1.3 General session – Mapping of existing data analytics initiatives in the ESS as well as in comparable

sectors ......................................................................................................................................................... 11

1.3.1 Organisation of the session ............................................................................................................ 11

1.3.2 Expert lecture: The Big Data Platform initiative of the EC Joint Research Centre ......................... 11

1.3.3 Expert lecture: The Logical Statistical Data warehouse of the Centre of Excellence on Data

Warehouse .............................................................................................................................................. 13

1.3.4 Expert lecture: Methods and algorithms in the UN Global Platform ............................................. 14

1.4 Parallel group session 1 – Data analytics in practice - real examples from the ESS .............................. 14

1.4.1 Objective of the session ................................................................................................................. 14

1.4.2 Summary of the session ................................................................................................................. 14

1.5 Parallel group session 2 – Statistical models and methods for data analytics ....................................... 15

1.5.1 Objective of the session ................................................................................................................. 15

1.5.2 Summary of the session ................................................................................................................. 15

1.6 Parallel group session 3 – Engaging with external stakeholders ............................................................ 15

1.6.1 Objective of the session ................................................................................................................. 15

1.6.2 Summary of the session ................................................................................................................. 15

1.7 Parallel group session 4 – What are the success factors for data analytics? ......................................... 16

1.7.1 Objective of the session ................................................................................................................. 16

1.7.2 Summary of the session ................................................................................................................. 16

Page 3: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

3

1.8 Parallel group session 5 – Emerging data analytics tools and techniques.............................................. 16

1.8.1 Objective of the session ................................................................................................................. 16

1.8.2 Summary of the session ................................................................................................................. 16

1.9 Parallel group session 6 – Facilitating the data analytics of others ........................................................ 16

1.9.1 Objective of the session ................................................................................................................. 16

1.9.2 Summary of the session ................................................................................................................. 17

1.10 Parallel group session 7 – Joint ESS data analysis undertakings ........................................................... 17

1.10.1 Objective of the session ............................................................................................................... 17

1.10.2 Summary of the session ............................................................................................................... 17

1.11 Parallel group session 8 – Infrastructure and architecture as a solid basis for data analytics ............. 17

1.11.1 Objective of the session ............................................................................................................... 17

1.11.2 Summary of the session ............................................................................................................... 17

1.12 Parallel group session 9 – Data analytics in a changing environment .................................................. 18

1.12.1 Objective of the session ............................................................................................................... 18

1.12.2 Summary of the session ............................................................................................................... 18

Annex 2 – Programme of the Workshop ......................................................................................................... 19

Day 1 (10 September 2018) ......................................................................................................................... 19

Day 2 (11 September 2018) ......................................................................................................................... 19

Day 3 (12 September 2018) ......................................................................................................................... 20

Workshop programme flyer ............................................................................................................................ 20

Introduction Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official

statistics1, the "data analytics Workshop for official statistics" (daWos) is a two-day event organised by

Eurostat and held in Amsterdam (10-11 September 2018) that aimed at addressing the challenges and

opportunities for the National Statistical Institutes (NSIs), the European Statistical System (ESS) and the

broader community of Official Statistics, in the area of data analytics.

Data analytics (DA) is the process of examining data sets in order to draw conclusions about the information

they contain, increasingly with the aid of specialised systems and software. Data analytics methodologies

include exploratory data analysis (EDA), which aims to find patterns and relationships in data, and

confirmatory data analysis (CDA), which applies statistical techniques to determine whether hypotheses

about a data set are true or false. EDA is often compared to detective work, while CDA is akin to the work of

a judge or jury during a court trial -- a distinction first drawn by Tukey in his 1977 book on EDA2.

1 This includes among others: the 2014 and 2016 ESS Big Data Workshops, the 2016 ESS Visualisation Workshop and

the ESS Workshop on dissemination of Official Statistics as open data.

2 J.W.Tukey (1977): Exploratory Data Analysis, Pearson.

Page 4: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

4

More advanced types of DA, sometimes also referred to as Data science3, include data mining, which involves

sorting through large data sets to identify trends, patterns and relationships; predictive analytics, which seeks

to predict customer behaviour, equipment failures and other future events; and machine learning, an

artificial intelligence technique that uses automated algorithms to churn through data sets more quickly than

data scientists can do via conventional analytical modelling. Big data analytics applies data mining, predictive

analytics and machine learning tools to sets of big data that often contain unstructured and semi-structured

data. Text mining provides a means of analysing documents, emails and other text-based content.

DA applications involve more than just analysing data. Particularly on advanced analytics projects, much of

the required work takes place upfront, in collecting, integrating and preparing data and then developing,

testing and revising analytical models to ensure that they produce accurate results. Altogether, DA

technologies and techniques are widely used in commercial industries to enable organisations to make more-

informed business decisions and by scientists and researchers to verify or disprove scientific models, theories

and hypotheses.

Workshop objective The approach to the daWos event was thematically focused. Beyond sharing, from different perspectives

(e.g., in terms of methodology, technology, or policy), the actual experience and practice in DA (e.g., related

to data access and sharing, applications and services, tools and techniques, architecture and technologies, or

skills and know-how development), it also had the goal to build capabilities in this area to use data in better

way, or at least to provide new insights for Official Statistics.

The workshop aimed at increasing awareness of the issue of DA; identifying emerging best practices to

identify synergies and options for joint development efforts; presenting and sharing best practices, original

ideas, new tools, and past (successful or failed) experiences. Overall, these objectives were submitted to

contribute to the broader objective of facilitating cooperation within the ESS in the development and

implementation of common methodological solutions and practical applications, aligned with the ESS Vision

2020 flagship strategy.

To achieve all these objectives, the workshop included:

- Presentations from experts in different areas related to DA. These lectures aim at exposing the

audience with current trends and state-of-the-art development in the field of DA – not necessarily

connected to Official Statistics and not only referred to experiences in the ESS, but also from other

statistical systems and from the private sector – so as to provide useful insights into DA and related

issues.

- The workshop’s pillar was a set of parallel group sessions devoted to the identification and exchange

of relevant experiences in specific issues of DA, within and outside the ESS.

In this setting, participants could share the results achieved through existing DA activities, i.e., in-house use

cases and projects. In addition, they were also invited to further detail the mainstreaming of these activities,

e.g., how they were implemented in practice and whether they were actually deployed in production. This

way, the needs and requirements regarding methodological development and technological investment

could be addressed while describing all other issues met, in terms of management, financing and human

capital for instance.

3 D.Donoho (2017): 50 Years of Data Science, doi:10.1080/10618600.2017.1384734.

Page 5: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

5

Report structure In order to enhance its reading and use, this report first presents the key conclusions from the Workshop

(next section).

The Workshop sessions are then described in Annex 1, summarising the presentations and discussions that

took place. Annex 2 includes the Workshop programme. All presentations are available on the dedicated

webpage on the CROS portal.

Key conclusions and Recommendations from the Workshop This section recalls the main conclusions and recommendations issued from the parallel group and plenary

sessions, which are presented in detail in Annex 1.

Summary conclusions In recent years, the opportunity to, in various creative ways, deliver information about the impact of certain

policies has emerged together with new sources of data. Still, whereas the statistical community has made

good progress on using these new sources, many questions and challenges remain when it comes to apply

DA so as to extract the relevant information. Advanced, state-of-the-art, DA tools and techniques are

increasingly necessary to harness new data sources. The availability of brand new techniques and

technologies make a significant change in leveraging today's accessible computing power to enable the

processing of large amounts and various types of data into relevant information through statistical analysis

and modelling. By creating faster, more agile and more tailored insights in data and content, it is – sometimes

prophetically – believed they can help answer increasingly complex questions that were previously

considered beyond reach, from description: "what has happened", to prescription: "what should we do?",

through diagnosis: "why did it happen?" and prediction: "what will happen?".

Although the evidence-based policymaking guiding principles seem not to change (as for the requirements

in terms of transparency, privacy and ethics, quality and robustness, and timeliness), the rise of DA – together

with new technologies and trends in sharing, handling, processing and analysing data certainly – calls for an

upgrade of evidence-based policy making practices in the NSIs. One of the main challenges facing NSIs is not

only to build the data analytics capabilities required to harness the new data sources (e.g., machine learning,

data mining, pattern recognition, natural language processing, etc…), but also to understand how these

capabilities can be deployed to help satisfy the demand from policymakers and policy users for faster, more

agile and more tailored insight. Beyond the need for new approaches and technologies, there is also a

necessary adaptation of new job profiles and skill sets for an effective use of data analytics services. There

seems to be a general understanding that new practices will require dynamic and innovative partnerships,

including public-private ones. This also represents further challenges for NSIs since their role may decrease

while that of independent data institutes might grow.

Furthermore, many users of Official Statistics are not willing to compromise at all around the quality and

trust of information for the promise of some improvements in timeliness, frequency and granularity. In a

context of “post-truth” society, not only quality and trust, but also openness, reproducibility, reliability and

sustainability of the statistical products and services are essential to ensure that evidence-based

policymaking is transparent and defensible. All such considerations further raise a number of significant

issues that need to be addressed by NSIs before deciding whether to embrace DA in the actual and regular

production of Official Statistics.

Page 6: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

6

Recommendations Following the various discussions that took place at the daWos event, including the sharing of best practices,

practical experience as well as lessons learned by the participants, useful recommendations emerge to

address DA from different angles:

Applications and services: A catalogue that identifies the different use cases using DA as

implemented in the NSIs could be useful. The applications and services should be mapped against

existing standards for comparability. This should contribute to the identification of emerging best

practices and possible synergies and options for joint development efforts, addressing policy

issues at cross-border level.

Tools and techniques: A catalogue with the different DA software, algorithms and models that

have been used or are still used in the NSIs – including the wealth of assets made freely available

on the market – could be established to provide new users with some guarantee regarding the

quality of the tools and techniques. These software, algorithms and models could be considered

for further – possibly centralised – validation/certification by the NSIs and the community of

Official Statistics. When engaging in new software development, NSIs should start small and build

quickly software blocks/components that are modular, interoperable, and reusable to be shared

within the community and help enrich an ecosystem of DA tools and techniques. Prior to using

black-box DA systems – e.g. based on artificial intelligence – the issues of interpretability and/or

explicability to the final user of the product should be carefully considered. In general, the

reproducibility of production workflows should also be addressed to ensure transparency and

trust in the products.

Architecture and technologies: The data and process architecture needed to support future DA

capabilities are still under consideration. Besides traditional databases or new big data

infrastructure, logical data warehouses are worth exploring since they enable to deal with legacy

infrastructure without huge investment through data virtualisation. Curation of metadata and

abstract knowledge are essential in that perspective. Virtualised containerised environments are

also worth studying for serving applications.

Governance: Existing (good) governance regarding data should be extended to (good)

governance of methods and algorithms. Shared protocols towards a code of good practice for

using software resources could be developed, or more simply, already existing protocols, e.g.

best practices from the open source community, could be adopted. Fair principles (findable,

accessible, interoperable, and reusable) could be enforced for tools and techniques, good level

of documentation should be adopted for methods and algorithms.

Skills and expertise: DA is an exciting field that combines scientific inquiry, statistical knowledge,

subject-matter expertise, and computer programming. The necessary adaptation of job profiles

and skill sets to emerging roles for an effective use of DA services within the ESS needs to be

addressed (e.g., through the design of appropriate trainings).

Page 7: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

7

Annex 1 – Description of Workshop sessions

This section describes first the contents of the opening, then the general sessions and finally the group

sessions (e.g., followed by wrap-up sessions aiming at summarising the discussions that took place during

the group sessions and highlighting the main topics/concerns /decisions of interest). Sessions were organised

around specific topics selected by Eurostat as follows (detailed descriptions are given in the corresponding

sections):

- Opening session

- The general sessions included a presentation on use cases and best practices in data analytics, trends

in data analytics architectures, future developments in data analytics and data science and engaging

users and policymakers for data analytics. Moreover, a general session containing three presentations of

experts and addressing other ongoing initiatives had concluded the first day.

- Group session 1: Data analytics in practice - real examples from the ESS

- Group session 2: Statistical models and methods for data analytics

- Group session 3: Engaging external stakeholders

- Group session 4: What are the success factors for data analytics?

- Group session 5: Emerging data analytics tools and techniques

- Group session 6: Facilitating the data analytics of others

- Group session 7: Joint ESS data analysis undertakings

- Group session 8: Infrastructure and architecture as a solid basis for data analytics

- Group session 9: Data analytics in a changing environment

1.1 Opening session Participants were welcomed, on behalf of Eurostat, by Ms Martina Hahn, Head of Unit Methodology and

Corporate Architecture in Eurostat. At first, the participants were reminded that the event is organised

to contribute to the goals defined in the ESS Vision 2020. The event actually allowed participants to

discuss how well the ESS is equipped with the necessary skills and technology of DA for Official Statistics,

how much this really helps users and how efficient is the investment by NSIs and Eurostat in this “new”

activity.

Ms Hahn stressed that the presence in the event of a mix of statisticians, IT experts, as well as the

presence of several international institutions (among others OECD and JRC) was a good sign of interest

in this topic. The objectives of the event were also recalled:

Identify and discuss the different use cases for DA by Official Statistics producers, in particular

the impact and opportunities created by external trends and developments (e.g. new user

demands, emergence of a rich DA market should be identified in this context).

Take stock of the development of DA in the ESS and identify the main challenges ahead for

fulfilling the use cases for DA, in particular the challenges related to data confidentiality.

Discuss the data and process architecture needed to support these new use cases and the future

DA capabilities in the NSIs and other ESS members, identifying in particular emerging best

practices and possible synergies and options for joint development efforts.

Page 8: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

8

1.2 General Sessions: Expert lectures

1.2.1 Organisation of the general sessions The general sessions were organised as “expert lectures” delivered by four selected experts from

different fields (NSIs, academia and consultancy), showing diverse viewpoints and covering general issues

in order to give a quite complete idea of the potentiality of DA. In addition, at the end of the second day,

a plenary session on the existing DA initiatives in the ESS was organised. The sessions were followed by

discussions with the participants. The four expert lectures discussed are presented in the following.

1.2.2 Expert lecture: Use cases and best practices in data analytics Speaker: Mr Cédric Archambeau | Principal Applied Scientist, Amazon

This presentation exposes few of Amazon's data driven approaches adopted to make product

recommendations to its customers. The speaker shows how the expertise they built over the years in

machine learning and data science helps Amazon scale in a wide range of other domains, and create

better customer experiences. The application of machine learning to problems in natural language

processing, computer vision, search and forecasting are also discussed.

In the first part of the talk, an overview of a number of machine learning applications is given. The speaker

explains how these applications fit within the Amazon ecosystem to address the challenges that are faced

and how they help scale. While machine learning is routinely used in recommendation, fraud detection

and ad allocation, it plays a key role in devices such as the Kindle or the Echo, as well as the automation

of Kiva enabled fulfilment centres, statistical machine translation and automated Fresh produce

inspection. In the second part, the speaker discusses how machine learning is actually made more

accessible within the company by, for example, automating the fine-tuning of machine learning and, in

particular, deep learning algorithms. Applying complex predictive systems, such as machine learning-

based systems, in the wild requires manually tuning and adjusting knobs, broadly referred to as system

parameters or hyper-parameters. Black-box optimisation and in particular Bayesian optimisation

provides a natural framework for addressing this problem by taking the human expert out of the fine

tuning loop. Bayesian optimization is also shortly introduced in this area.

1.2.3 Expert lecture: Trends in data analytics architectures Speaker: Mr Rick van der Lans | Founder of R20/Consultancy BV, ambassador of Kadenza

To deliver data in support of their business intelligence needs, most organizations have developed a

classic data warehouse. New groups of business users with data science and data investigative needs

have also developed the data lake, which is most often a standalone system with almost no relationship

to the existing data warehouse. On top of this, some organizations have already started to develop a

third data delivery system for delivering data to specific business users—the data marketplace. Again,

this third data delivery system is being developed as an analytical island, not to mention the still-newer

world of streaming analytics.

Developing all these data delivery systems independently is far from ideal. Development-wise, wheels

are reinvented, resulting in low productivity, metadata replication, and inconsistencies across reports

and analyses. It is crucial for organisations to somehow bring these systems together. One solution is by

deploying a unified data delivery architecture based on data virtualization technology. Such architecture

can support a wide range of business users, from those demanding a highly agile environment such as

the marketplace to those requiring governable and auditable reports. This session had discussed how

these environments can be merged into a unified architecture.

Page 9: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

9

Figure 1 – Data virtualization overview.

1.2.3 Expert lecture: Future developments in data analytics and data science Speaker: Mr Fernando Perez-Cruz |Chief Data Scientist at the Swiss Data Science Center

Implicit generative modelling has recently scratched the surface on how deep learning can be used as a

universal simulator. Until recently, deep learning has been used quite successfully to solve long standing

discriminative problems in computer vision, speech and natural language processing, basically showing

that hand-coded human-engineered features are suboptimal in the presence of: complex problems in

which human only have a basic understanding of the variability of the data; and, the availability of large

labelled data sets.

Figure 2 – Generative Adversarial Networks (GANs) in the bigger AI “big picture”. Flowcharts show how the different parts of an AI system relate to each other within different AI disciplines; shaded boxes indicate components that are able to learn from data.

Recently, Variational Auto-Encoders and Generative Adversarial Networks (GANs) have shown that the

same representation learning can be used for generative modelling. These implicit generative models do

not provide an interpretable model for the available data, but a universal simulator that it is able to

generate data similar to the one used for training. These tools can be used to simplify complex

simulations (e.g. climate models) or limited observations (e.g. cosmology or particle physics), opening

the door to Artificial Intelligence powered advances in many different fields of science. In this talk, the

speaker first presented the general approaches and methods as well as their potential use and current

shortcomings. In the second part of the talk, the speaker described a recent application of GANs for

Page 10: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

10

password guessing. This is an ideal application to understand the need for GANs and understand why

they work and what their limitations are.

1.2.4 Expert lecture: Engaging users and policymakers for data analytics Speaker: Mr Franco Accordino |DG CONNECT

There is a growing need to improve forward thinking in policymaking practices and to enable citizens and

policy makers to co-create ideas and share evidence in order to feed policy reflections, and eventually

inform policy decisions. New policies are often thought up on the basis of current trends rather than by

capturing future opportunities offered, for instance, by long-term advances in science and technology.

The need to focus on short-term measures often prevents governments and businesses from orientating

their policy choices towards future possibilities, partly because they have been elected to come up with

tangible responses to current challenges that matter to citizens and partly because long-term investment

decisions may be too risky. This may make it difficult to put in place sustainable solutions to structural

problems.

Figure 3 – Building blocks for users', policymakers and citizens' engagement.

The challenges can be articulated along two main axes, highlighting typical tensions between different

policymaking mindsets: (i) evidence about the status of the real world vs. inspiration from longer-term

thinking; (ii) delegated leadership vs. participatory leadership. Policy Making 3.0 is a participatory and

evidence-based model designed to provide an answer to the above challenges. The essential elements

of the Policy Making 3.0 process are the following:

1. The implementation of policies co-developed by policymakers and stakeholders has an impact

on the real world (individuals, society, economy, environment etc.).

2. The real world is monitored and data are gathered, measured and analysed through knowledge

mining and statistical tools, which makes it possible to identify trends, issues and challenges and

to elicit scientific evidence.

3. The scientific evidence provides information that stakeholders and policymakers can use to

reshape policies.

4. Stakeholders and policymakers interact in social networks where other factors rather than

evidence emerge, such as personal opinions, corporate interests, lobbying, ideological values and

other ‘non-measurable’ factors (i.e. that cannot be easily sensed and automatically captured).

Page 11: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

11

Such factors often prevail over the scientific evidence. There are also boundary constraints that

come in the form of values and laws (e.g. constitutional rules).

5. Policies may also be inspired by desirable visions and aspirations that are not necessarily in line

with current, short-term trends and can also be considered as part of the ‘emotional’ and

intuitive factors that influence decisions.

The Policy Making 3.0 model is implemented by Futurium, an online laboratory setup to co-develop

futures and policy ideas. This laboratory combines the informal nature of social networks with a

methodological approach of foresights to engage stakeholders in the co-creation of the futures that they

all want. The architecture consists of the following components: front-end participatory tools, knowledge

harvesting tools for both policymakers and stakeholders, data-crawling tools to extract knowledge from

popular social networks and embed it into the Futurium, data-gathering tools to fetch real world data

and to input it into the Futurium.

DORIS - Data Oriented Services is a pool of services enabling policy and support departments to fulfil

their daily job in an easy and cost-effective manner. DORIS is largely based on algorithms and data from

various sources and providers, meaningfully integrated and made accessible through tailored interfaces

(dashboards) and it is customisable to specific use-cases, available as general-purpose service (drive-in)

via web app or any Application Programming Interface (API).

Figure 4 – Example of data service: DORIS.

1.3 General session – Mapping of existing data analytics initiatives in the ESS as well as in

comparable sectors

1.3.1 Organisation of the session The general session at the end of the first is devoted to mapping on the existing DA initiatives in the ESS

as in comparable sectors. The presentation of the three experts discussed:

- The Big Data Platform initiative of the EC Joint Research Centre;

- The Logical Statistical Data warehouse of the Centre of Excellence on Data Warehouse;

- Methods and algorithms in the UN Global Platform.

1.3.2 Expert lecture: The Big Data Platform initiative of the EC Joint Research Centre Speaker: Mr Pierre Soille |EC – DG JRC

The project background is the explosion of digital data sources that led to the big data paradigm (Volume,

Velocity, and Variety of data streams); Earth observation (EO) entering big data thanks Copernicus

Page 12: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

12

Sentinel satellites (full, free, and open data). Indeed, the increasing amount of free and open geospatial

data of interest to major societal questions calls for the development of innovative data-intensive

computing platforms for the efficient and effective extraction of information from these data. Following,

the big data task force of the Joint Research Centre (DG JRC) of the European Commission recommended

in late 2014 to start a big data pilot project on EO and Social Sensing. The development of the JRC Earth

Observation Data and Processing Platform (JEODPP) started in 2016.

The JEODPP platform is versatile in the sense that it accommodates different service levels to satisfy the

needs of a variety of users: batch processing, provision of legacy environments, and interactive

visualization and processing. All services are accessed through a web browser so that no dedicated client

software needs to be installed on the devices accessing the platform. A simplified representation of the

JEODPP architecture is shown in the figure below in the form of a three layers stack with the resources

layer at its basis, followed by the service layer, and the client layer at its top. The platform already

supports a variety of projects serving policy areas in agriculture, forestry, environment, disaster risk

management, development, health, and energy.

Figure 5 – JEODPP platform architecture: simplified view with its main layers and components.

The exponential growth of data and data sources is a matter of fact. The big data paradigm is permeating

all fields. Fair data principles also apply to data analysis. Challenge of turning data into insights facilitated

by platforms with data co-located with processing. Jupyter notebooks contribute to reproducible analysis

as well as knowledge sharing and collaborative working. It should be taking into account the importance

of interactive analysis and visualisation. Open standards including open API are needed to avoid platform

lock-in.

The project evolution is Big Data Analytics (2019-2020) with innovative approaches (AI/machine learning)

for combining large amounts of data originating from different sources. The project is supported by the

JEODPP and the initial focus will be on geospatial data and their combination with other data sources. It

will be the key enabler of data and knowledge sharing across JRC and towards partners and the link with

Copernicus Data and Information Access Services.

Page 13: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

13

1.3.3 Expert lecture: The Logical Statistical Data warehouse of the Centre of Excellence on Data

Warehouse Speaker: Ms Sonia Quaresma |INE, PT

A Statistical-Data Warehouse (S-DWH) can be defined as a single corporate Data Warehouse fully based

on a metadata. An S-DWH is specialised in supporting production for multiple-purpose statistical

information. With an S-DWH different aggregate data on different topics should not be produced

independently from each other but as integrated parts of a comprehensive information system where

statistical concepts, micro data, macro data and infrastructures are shared. The Information Systems

connect the business to the infrastructure, this is represented by a conceptual organization of the

effective S-DWH which is able to support tactical demands.

In the layered architecture, in terms of data system, different assets are identified:

- the staging data are usually of temporary nature, and its contents can be erased, or archived, after the

DW has been loaded successfully;

- the operational data is a database designed to integrate data from multiple sources for additional

operations on the data. The data is then passed back to operational systems for further operations and

to the data warehouse for reporting;

- the Data Warehouse is the central repository of data which is created by integrating data from one or

more disparate sources and store current as well as historical data;

- data marts are kept in the access layer and are used to get data out to the users. Data marts are derived

from the primary information of a data warehouse, and are usually oriented to specific business lines.

Therefore, data, macro data and infrastructure are shared.

The Metadata Management of metadata used and produced in all different layers of the warehouse are

specifically defined in the Metadata framework and the Micro data linking. This is used for description,

identification and retrieval of information and links the various layers of the S-DWH, which occurs

through the mapping of different metadata description schemes; It contains all statistical actions, all

classifiers that are in use, input and output variables, selected data sources, descriptions of output tables,

questionnaires and so on. All these meta-objects are collected during design phase into one metadata

repository. It configures a metadata-driven system well-suited also for supporting the management of

actions or IT modules, in generic workflows.

A distributed computing platform leads to the Logical Statistical Data warehouse (LSDW) for the future.

LSDW warehouse means adding Semantic Data Abstraction. The abstraction layer allows a conceptual

generalization of the sources and of the integration outputs and being based on a semantic middleware

that supports context integration it describes each data taxonomy and relates it to each use-case

ontology in place. Moreover, a Virtualization Interface which allows access to all corporate data, in

different contexts as well as the creation of new contexts.

Figure 6 – Logical Statistical Data Warehouse.

Page 14: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

14

1.3.4 Expert lecture: Methods and algorithms in the UN Global Platform Speaker: Mr Joni Karanka |ONS, UK

The UN Global Working Group (GWG) on Big Data for official statistics was created in 2014 by the UN

Statistical Commission to explore the benefits and challenges of the use of new data sources and

technologies for official statistics and SDG indicators. The GWG addresses issues pertaining to

methodology, quality, technology, data access, legislation, privacy, management and finance, and

provide adequate cost-benefit analyses. The main assets of statistical offices are data and algorithms -

for exploration and data science.

The UN Global Platform is envisaged as a marketplace for sharing and developing core catalogues of

services, data, metadata, methods, APIs, information technology tools and training materials. The global

network operates as a federated network of platforms at the national, regional and global levels, which

ensures interoperability and information-sharing among the platforms in the network through agreed

and defined interfaces. The platform is intended for use by its trusted partners and should meet the

requirements for research and development in the use of multisource data. Transparent partnership

agreements will need to be developed with private- and public-sector organizations so that the platform

partners contribute and derive value through a business model that is individually sustainable for all

stakeholders. The platform offers technology infrastructure and constitutes a network for data

innovation to facilitate global collaboration of the official statistical community.

The presentation discussed the algorithms in Official Statistics, methods as assets, and the methods

service using Algorithmia of the Big Data Global Working Group for the UN (hosted here).

1.4 Parallel group session 1 – Data analytics in practice - real examples from the ESS Chair: Mr Ó'Lúing Mervyn | Central Statistics Office (IE)

1.4.1 Objective of the session While intuitively understandable, the concept of DA can mean different things to different stakeholders.

At this session, examples from ESS members have illustrated what DA could actually achieve for a

statistical institute. A discussion on what are the lessons learned from these examples was also initiated.

1.4.2 Summary of the session Different use cases were discussed, e.g., some prototype/pilots and experimental products. DA may

be truly disrupting the way NSIs process data (e.g, from data collection to decision-making, through data

collection and indicator estimation). However, there are many limitations for adopting and running those

pilots in production and those were discussed. It was stressed that, prior to adopting a "full operational

Figure 7 – Technology components and main features for methods and algorithms sharing.

Page 15: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

15

stack" in production, DA tools/methods (or parts of it), whose selection is often ad-hoc, need to be made

truly reusable and applicable. Beyond development and deployment, the operational integration of DA

is challenged by cultural/corporate consideration since DA is often seen as "yet another tool" with little

demonstrated added value by production units.

1.5 Parallel group session 2 – Statistical models and methods for data analytics Chair: Susie Fortier | StatCan (CA)

1.5.1 Objective of the session Advanced methods (e.g. model-based estimation, multivariate methods, forecasting/nowcasting and

microsimulation) are already being used by statistical institutes - sometimes in regular production, other

times in supporting processes or in analytical projects.

At this session, the use of advanced statistical methods in DA for official statistics was discussed. The

session also included a presentation from Statistics Iceland on rare events of time: outliers as extreme

values and models for population projections taking into account fertility, mortality and immigrations.

1.5.2 Summary of the session The presentation introduced the probabilistic methods used for the demographic projections on the

Icelandic population. The key points made during the presentation concern the description of data and

the formulation of the mathematical problem, while recalling the definition of rare events of time series

(outliers and extreme values) and introducing methods and models as dynamical (ARDL) models (short

term), arima/exponential smoothing models (long term). Finally, functional data modelling with time

series coefficient functions of orthonormal function expansions was presented.

1.6 Parallel group session 3 – Engaging with external stakeholders Chair: Laust Hvas Mortensen | Statistics Denmark (DK).

1.6.1 Objective of the session External users of DA may come from many different stakeholder groups (policymakers, policy analysts,

NGOs, journalists, researchers etc.) Moreover, requests for DA may arrive at a central contact point of a

statistical institute, or directly at the entity concerned. This raises the question on how to assure an

equitable treatment of stakeholders, and a sound use of resources. At this session, different approaches

taken to the interaction with external DA stakeholders were discussed.

1.6.2 Summary of the session Some of the topics discussed during this session are:

- how to communicate probability concept, in particular uncertainty. The way we communicate

uncertainty and change for different types of statistics needs to be tailored to suit the audience. We

should offer users the opportunity to “zoom in” according to their needs and be able to find detailed

information when they require it;

- transparency (assumptions to produce official statistics). Good statistical practice is fundamentally

based on transparent assumptions, reproducible results, and valid interpretations. In some situations,

guideline principles may conflict, requiring individuals to prioritize principles according to context.

However, in all cases, stakeholders have an obligation to act in good faith, to act in a manner that is

consistent with these guidelines, and to encourage others to do the same. Above all, professionalism

in statistical practice presumes the goal of advancing knowledge while avoiding harm; using statistics

in pursuit of unethical ends is inherently unethical;

- presenting data using infographics. These types of graphics present complex information quickly and

clearly. Infographics are increasingly popular because they can provide a great amount of complex

information succinctly, using visually appealing elements that draw attention and facilitate retention;

Page 16: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

16

- importance of confidentiality in maintaining trust in official statistics. With Big Data, data subjects may

be unaware they are generating data and what it can be used for, despite the efforts of the social

media platforms in this respect. One of the challenges is to manage the acceptance of data re-use and

data linkage, which would not necessarily be expected by data subjects;

- identify the users and provide the specific products. Understanding how our statistics are used is

essential to maximising the public value of official statistics and ensuring that users are able to make

sound and informed judgements from official statistics.

1.7 Parallel group session 4 – What are the success factors for data analytics? Chair: Mervyn Ó'Lúing |Central Statistics Office (IE)

1.7.1 Objective of the session Due to their non-traditional nature, DA initiatives might face obstacles during development as well as

during subsequent deployment.

At this session, based on practical experience, the different aspects (organisational and infrastructural as

well as technical) to ensure the success of DA initiatives in NSIs were explored.

1.7.2 Summary of the session A presentation from Statistics Poland on what are not success factors for DA has been discussed. The key

question was on using advanced DA and the issues about the publication of the results. Two cases were

described. The first case was about the human capital composite indicators and the risk factors of

misinterpret the data government authorities. The second case concerns the big data application and the

framework of Map Reduce paradigm which raise questions regarding the quality of new data sources.

1.8 Parallel group session 5 – Emerging data analytics tools and techniques Chair: Joni Karanka | ONS (UK)

1.8.1 Objective of the session Countless advanced tools for DA are being developed. Some of them have a huge potential, whereas

others appear to not be applicable to official statistics (for instance because of their "black box" or

proprietary nature) - and yet others look impressive, but are in reality just statistical methods relabelled

as "data science".

At this session, it was discussed how to actually cut through the hype to find and deploy the modern data

analysis tools and techniques that are genuinely useful for official statistics.

1.8.2 Summary of the session The state-of-the-art and advanced DA tools and techniques already in use in the ESS (i.e. automatic data

retrieval, web-scraping, data cleaning, natural language processing, machine learning, pattern

recognition) and some examples (CSB-NL) was presented. Moreover, technical solutions (tools and

software) to share data and enable users to analyse them and the need for a common "playground"

platform (e.g., like the UNECE sandbox) was discussed.

1.9 Parallel group session 6 – Facilitating the data analytics of others Chair: Fabio Ricciato | Eurostat

1.9.1 Objective of the session NSIs – regardless of whether they focus on providing complete DA services – might wish to provide

standardised components (data access, analytical tools) – either for internal use or for external users of

data. By providing small components rather than complete deliverables, a community of data analysts

could be fostered. At the same time, while the reputational risks are mitigated by not providing any

Page 17: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

17

analyses, the mere act of providing certain analysis tools might be construed as non-objectivity. At this

session, the nature and the approach to provide and share data analysis components were discussed.

1.9.2 Summary of the session The NSIs should improve the accessibility to their data for analysts. The improvements involve further

development of APIs, metadata and open data standards. The access to microdata for external data

analysts should generally also be improved. Beyond data, NSIs should provide analytical tools that take

into account the typology of the user (everyday users vs. expert users) and the cost of the engagement.

At the end, DA services should be made available to all users. A collaborative/participative approach for

users consists in sharing the source code. Computing/testing platforms help further engaging external

users with methods and tools but more resources are needed.

1.10 Parallel group session 7 – Joint ESS data analysis undertakings Chair: Matyas Meszaros |Eurostat

1.10.1 Objective of the session Many statistical institutes face similar data analysis requests. As for many other areas, there might be a

potential for ESS members to collaborate on certain aspects of DA: sharing tools, approaches and

methods. This the more as both the ESS Vision 2020 and the ESS priorities beyond 2020 set out needs

that could be tackled by DA initiatives.

At this session, the potential for ESS collaboration on DA was explored.

1.10.2 Summary of the session This session discussed some tangible actions and initiatives to launch for truly engaging NSIs in DA so as

to foster (or enhance when it already exists) cooperation. Examples from past collaborations run within

the ESS framework were regarded as best practices (e.g., JDemetra, VTL, mu- and Tau-argus). Some of

the suggested solutions are: the definition of a ESS-wide catalogue of ongoing projects and their use, the

adoption of a common repository for sharing resources, the use of generic and concise documentation

with examples, the running of training courses/online courses at basic level, the creation of a

helpdesk/community support, the sharing of small building blocks with enabled multilingualism.

1.11 Parallel group session 8 – Infrastructure and architecture as a solid basis for data

analytics Chair: Sonia Quaresma |INE (PT)

1.11.1 Objective of the session The support for systematic DA (beyond ad hoc one-shot exercises) could imply new architectural and

infrastructural requirements for a statistical institute. In some cases, this might be best tackled in a

general overhaul of legacy systems.

At this session, the landscape of present and future infrastructure to support DA was discussed.

1.11.2 Summary of the session The session included a presentation on data architecture done by ISTAT which illustrates the legacy

architecture integrated taken into account into the new model (e.g., adopting virtual layers within logical

warehouse). Existing Statistical Data warehouses (SDw) should be modernised if they are to stay relevant.

The Logical Statistical Data warehouse (LSDw) is the next evolutionary step up from the SDw. New

sources increase complexity of IT components moves the DWH architectures toward logical

architectures. The Logical DWH is a new management architecture combining the strengths of traditional

repository warehouses with alternative data management and access strategy. A Logical DWH is an

evolution and augmentation of DWH practices, not a replacement. Data Virtualization enables Logical

Page 18: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

18

DWH. The Logical Statistical Data Warehouse is a virtual central statistical data store based on logical

layers for managing all available data of interest, improving to: produce the necessary information,

(re)use data to create new data/new outputs, perform DA, execute analysis, produce reports, and

support dashboard tools. Case Study: SBS-ICT by Web Mining. The case study focuses on the use of survey

data as a ground truth to create a classification model enabling the prediction of variables on Enterprises

ICT Survey.

1.12 Parallel group session 9 – Data analytics in a changing environment Chair: Jacopo Grazzini | Eurostat

1.12.1 Objective of the session New challenges (such as the General Data Protection Regulation) and opportunities (such as the Third

Data Package) are entering the DA scene. Further initiatives are also appearing on the horizon.

At this session, how to position DA in this changing environment was discussed.

1.12.2 Summary of the session Thought it is recognised that DA products are mature enough, the necessary investment by the NSIs to

integrate these solutions in production was discussed. Beyond the feasibility, the need for NSIs to adapt

is also acknowledged. In acknowledging the potential of DA, it was also mentioned that transformations

are actually needed for NSIs to adopt new roles/capabilities. The impact of algorithmic decision-

making and how NSIs can communicate about it, e.g. to comply with the requirement for transparency,

accountability and traceability, was further discussed. In particular, the need for humans to be involved

in the decision-making process was stressed while the adoption of DA methods and tools may guarantee

little control. Beyond the question of the data themselves, ethical/legal issues preventing the adoption

of DA in production, as well as possible ways to facilitate it, was raised.

Page 19: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

19

Annex 2 – Programme of the Workshop

Day 1 (10 September 2018)

09:00–12:00 daWos satellite meeting

Implementing and showcasing data analytics

12:00–13:00 Welcome and registration of participants

13:00–14:00 lunch

14:00–14:30 Opening session – Ice breaker

14:30–15:30 Expert lecture on use cases and best practices in data analytics

Cédric Archambeau – Principal Applied Scientist, Amazon

15:30–16:30

Group sessions

Data analytics in practice -

real examples from the ESS

Statistical models and methods

for data analytics

Engaging with external

stakeholders

16:30–17:00 coffee break

17:00–17:30 Plenary wrap-up session reporting on group sessions I

17:30–19:00 Report and mapping of existing initiatives in the ESS and in comparable sectors

20:00 dinner

Day 2 (11 September 2018)

9:00–10:00 Expert lecture on trends in data analytics architectures

Rick van der Lans – Founder of R20/Consultancy BV

10:00–11:00 Group sessions

What are the success

factors for data analytics?

Emerging data analytics tools and

techniques

Facilitating the data

analytics of others

11:00–11:30 coffee break

11:30–12:00 Plenary wrap-up session reporting on group sessions II

12:00–13:00 Expert lecture on future developments in data analytics and data science

Fernando Perez-Cruz – Chief Data Scientist at the Swiss Data Science Center

13:00–14:00 lunch

14:00–15:00 Group sessions III

Joint ESS data

analysis undertakings

Infrastructure and architecture as a

solid basis for data analytics

Data analytics in a

changing environment

Page 20: data analytics Workshop for official statistics · Inspired by the outcomes of a series of thematic workshops dedicated to the modernisation of official statistics1, the "data analytics

20

15:00–15:30 Plenary wrap-up session reporting on group sessions III

15:30–16:00 coffee break

16:00–17:00 Expert lecture on engaging users (and policymakers) for data analytics

Franco Accordino – Directorate-General for Communications Networks, Content and

Technology (European Commission)

17:00–18:00 Closing session

Day 3 (12 September 2018)

09:00–13:00 daWos satellite meeting

The geographical dimension of data

analytics

--

-

daWos satellite meeting

Balancing confidentiality and utility in data

analytics

13:00–14:00 lunch

Workshop programme flyer