European Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development framework in EUDAT WP7.3 Authors Emanuel Dima, Christian Pag´ e, Yvonne K¨ ustermann, Reinhard Budich Status Draft/Review/Approval/Final Version v.1.0 Date November 21, 2013 Abstract: This deliverable is reporting on the progress on the construction and integration of the Generic Exe- cution Framework (GEF), as well as additional required tools and components. The focus is on any lessons learned from the first stage of technology adaptation and construction. It describes how existing EUDAT user technologies have been incorporated, including any necessary adaptations. It also suggests expected behavior of the framework against User Community needs. The final report (D7.5.2) will describe and assess the EUDAT GEF, including any adaptations that were necessary to accommodate the requirements of the user communities.
27
Embed
Deliverable D7.5 - Cerfacscerfacs.fr/wp-content/uploads/2017/03/GLOBC_Page... · European Data Grant agreement number: RI-283304 Deliverable D7.5.1 Technology adaptation and development
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
European Data
Grant agreement number: RI-283304
Deliverable D7.5.1Technology adaptation and development framework
in EUDAT WP7.3
Authors Emanuel Dima, Christian Page, Yvonne Kustermann,
Reinhard Budich
StatusDraft/Review/Approval/Final
Version v.1.0
Date November 21, 2013
Abstract:
This deliverable is reporting on the progress on the construction and integration of the Generic Exe-
cution Framework (GEF), as well as additional required tools and components. The focus is on any
lessons learned from the first stage of technology adaptation and construction. It describes how existing
EUDAT user technologies have been incorporated, including any necessary adaptations. It also suggests
expected behavior of the framework against User Community needs. The final report (D7.5.2) will
describe and assess the EUDAT GEF, including any adaptations that were necessary to accommodate
the requirements of the user communities.
Document identifier: EUDAT-DEL-WP7-D7.5.1
Deliverable lead Christian Page
Related work package 7
Author(s) Emanuel Dima, Christian Page, Yvonne Kustermann, Reinhard
Budich
Contributor(s) Stephane Coutin, Pascal Dugenie
Due date of deliverable 01/10/2013
Actual submission date 21/11/2013
Reviewed by Morris Riedel, Ari Lukkarinen
Approved by
Dissemination level PUBLIC
Website www.eudat.eu
Call FP7-INFRA-2011-1.2.2
Project number 283304
Instrument CP-CSA
Start date of project 01/10/2011
Duration 36 months
Disclaimer: The content of the document herein is the sole responsibility of the publishers and it does
not necessarily represent the views expressed by the European Commission or its services.
While the information contained in the document is believed to be accurate, the author(s) or any
other participant in the EUDAT Consortium make no warranty of any kind with regard to this material
including, but not limited to the implied warranties of merchantability and fitness for a particular purpose.
Neither the EUDAT Consortium nor any of its members, their officers, employees or agents shall
be responsible or liable in negligence or otherwise howsoever in respect of any inaccuracy or omission
herein.
Without derogating from the generality of the foregoing neither the EUDAT Consortium nor any of
its members, their officers, employees or agents shall be liable for any direct or indirect or consequential
loss or damage caused by or arising from any information advice or inaccuracy or omission herein.
A Catalog of Services available through the GEF will be available to the users, through
an API and a user interface. The catalog will be generated automatically from querying
all the registered GEF endpoints for available services. Each service metadata will contain
a human readable description, the locations (i.e. GEF endpoints/data centers) where the
service is available, the data types that can be used for the service and the details about
other parameters required by the service.
Currently the location of a dataset can be determined only indirectly and approximatively,
from the server domain specified in the URL.
At each GEF endpoint the same functionality should be available by using the OPTIONS
method of the HTTP protocol.
1.6 Related work
SHIWA (SHaring Interoperable Workflows for large-scale scientific simulations on Available
DCIs) is a FP7 project that aims to develop new technologies for workflow systems interop-
erability.3 The project provides an execution platform where the workflows can be executed
on various Distributed Computing Infrastructures (DCIs).
The SCAPE (SCAlable Preservation Environment) FP7 project is aiming to build a scal-
able platform for digital preservation.4 The preservation processes will be realized as data
pipelines and implemented as workflows expressed in the Taverna workflow system. SCAPE
will deploy large scale workflows and execute them on cloud infrastructures, also collecting
the provenance data produced during this process.
Many other research projects use various workflow systems for complex data analysis or
contribute to the workflow ecosystem in other ways. Contrail5 offers autonomic workflow
execution on cloud infrastructures. e-LICO6 provides services and tools to assist the user in
designing scientific workflows. Wf4Ever7 provides a management environment for Research
Objects, which it defines as comprising scientific workflows, the provenance data gathered
at execution, the interconnections between them and other resources and the related social
aspects.
Work on converting workflows from one workflow representation to another has been done
in the frame of the SCI-BUS8 project (conversion from the desktop based KNIME system
to the DCI based system gUSE9). A more general solution to the problem of workflow
translation was given in the frame of the SHIWA project by introducing an intermediate
workflow language, IWIR10.
3http://www.shiwa-workflow.eu4http://www.scape-project.eu/5http://contrail-project.eu6http://www.e-lico.eu7http://www.wf4ever-project.org/8http://www.sci-bus.eu9L. de la Garza, J. Kruger, C. Scharfe, M. Rottig, S. Aiche, K. Reinert, and O. Kohlbacher, 2013. From the Desktop to
the Grid: Conversion of KNIME Workflows to gUSE. http://ceur-ws.org/Vol-993/paper9.pdf10Kassian Plankensteiner, Johan Montagnat, and Radu Prodan. 2011. IWIR: a language enabling portability across grid
workflow systems. In Proceedings of the 6th workshop on Workflows in support of large-scale science (WORKS ’11).
ACM, New York, NY, USA, 97-106. http://doi.acm.org/10.1145/2110497.2110509
All the steps of the processing should work in parallel, like a bash pipeline. The simplicity
of the GEF makes this scenario possible, with the appropriate backend. A suitable project
to use for this case is Storm15, a free and open-source framework for processing massive
streams of data (already used by Twitter).
DoW also specifies that the GEF should translate any workflow into a common format and
execute it. The current prototype implementation can only accept some type of workflows
and executes them using their native engine, thus ensuring 100% compatibility. The existing
work on workflow translation can be used in order to provide a common enactment engine
for all the workflow systems.
3 User Interface
Meeting the needs of all users of a software system is a challenging process, as one can
conclude from the diversity of the existing user interfaces technologies and design choices.
For example in the climate community, the “ESGF based ENES data infrastructure provides
a rich set of different data access methods to meet the different user demands”16. To
download ESGF 17 data there are too many interfaces available to list them here, as a first
glance to this diversity at the data how-to proofs.
An attempt to categorize users could, for example, look like the following:
The expert user wants an efficient interface no matter if it is complicated or not. This
user repeats very similar tasks a lot of times. If the interface is not optimized, using it is
too tedious to concentrate on the relevant scientific tasks. A web interface is no good
solution here. Clicking options takes too much time in comparison to a command line
interface which allows to vary only one aspect in the request and then execute variation
of this request.
The novice or occasional user needs a rich interface but not a complicated one with a lot
of possibilities which could be overwhelming.
Anybody else: people generally interested in the topic, or people following a link in a news-
paper article. They need an interface to the GEF with an extremely restricted selection
of workflows and options.
15http://storm-project.net16Citation from https://verc.enes.org/help/how-to-./data-access, last access November 21, 2013.17ESGF means Earth System Grid Federation and is an international collaboration for a data infrastructure in climate
Accessing the data is possible via web interfaces, via scripting interfaces i.e. in python24 or
via scripts called from the command line. This diversity can confuse. Which is the right
method for downloading the data? To the authors there is no summary text to be found,
which describes all access methods. Asking the support of the DKRZ, one of the three
access sites for the ENES data, they point to the this page25 but they add the information
that it is also possible to get the data from CERA26 in case it is replicated there.
To check how good the current federation infrastructure is working, we have interviewed
members of the community. To illustrate how unnecessary difficult the data download is,
here – to express it in Scrum terms27 – one user story of a PhD student of the Max-
Planck-Fellowship Program at the Institute for Meteorology in Hamburg. She had a task in
mind (let us call it her scientific workflow) and knew which data she needed (data download
workflow). In the first attempt using one of the web interfaces for download did not work
and using one of the scripts for the download failed due to the lack of knowledge how the
experiment names are encoded (internal knowledge of file management). After finding the
right web interface and the right credentials, the download started and was interrupted due
to quota exceeding. The PhD student ended up to solve the download task by asking her
supervisor. The supervisor downloaded it for her, instead of telling her how to do it (here
we can assume that also explaining how to succeed in downloading would have been difficult
to explain). Thus in the end, the download part of the workflow was accomplished via
“social engineering”. A lot of knowledge played a role: from knowing the right web portals,
replication places for faster download, machine to download with fast network access and
choosing the directories where you have permission and enough quota, to mention only a
part of the required knowledge. To choose a PhD student for the given user story is done
on purpose: an experienced user cannot show all difficulties, because this type of user does
not realize all difficulties any more. This user story illustrates how important even a mere
download workflow is for the climate community.
Another example for the required technical “download knowledge” is given at the CERA
page: “Jblob is a command-line based program for downloading data from the CERA
database. Please note, this program does not replace the graphical user interface. It is
mostly useful for people who know which data to download and for batch downloads.”28. If
it comes to subsetting the data to avoid transferring data which is not needed, there is not
even an interface to use until now.
The ESGF implements the required standards of data being distributed, such as the Data
Reference Syntax (DRS)29 which specifies how the data files must be structured, as well
as required metadata described by a Common Information Model (CIM)30 and Controlled
Vocabulary (CV). This ensures uniformity among the data centers and the data sets.
Currently there is capability embedded in the ESGF software stack that enables the ex-
traction of spatial and temporal data subsets through the data query. However, the ENES
24https://github.com/stephenpascoe/esgf-pyclient25https://verc.enes.org/help/how-to-./data-access26http://cera-www.dkrz.de27In Scrum are used so called user stories to guide implementation.28emphasis added29Taylor, K. E., Balaji, V., Hankin, S., Juckes, M., Lawrence, B., and Pascoe, S. (2010). CMIP5 Data Reference Syntax
(DRS) and Controlled Vocabularies.30Guilyardi, E., Balaji, V., Callaghan, S., DeLuca, C., Devine, G., Denvil, S., Valcke, S. et al. (2011). The CMIP5 model
and simulation documentation: a new standard for climate modelling metadata. ClIVAR Exchanges, 16(2), 42-46.
community has several data processing tools which can be used to apply complex data pro-
cessing to data subsets. But, as said before, these tools can only be used after the data has
been downloaded to the user, on the user’s own computer systems.
The data volumes being used by community’s users is currently increasing rapidly. This
happens not only because there are more users, but also because of the increase in the data
volumes being generated by the community, due to spatial resolution increase, ensembles of
simulations, a larger number of experiments that are developed to enable the community to
answer more scientific questions, for example.
This rises the question whether there is a need for reserving bandwidth. Reserving bandwidth
using a normal internet connection is not possible with the current network technologies and
requires network research. A possible approach is a software defined network architecture.
This would mean that the user receives information from the GEF on how long the download
takes depending on when the user starts it. The scientist could decide whether at all and if
so when to start the download. This is investigated in EUDAT in the task WP7.2.
The scientist can concentrate on the semantics, on what data to use for which computation
to answer the scientific question. This means that also the time for the data subsetting
must be estimated. It is under the hood of the GEF whether the data needs to be just
transferred, or subsetted prior to the transfer.
The metadata taskforce is implementing the search functionality which gives back a handle
(PID) for every request. This PID subsequently is used by the GEF. Currently in ESGF the
user has to search manually31. Though, having a search interface giving back a PID is a
long term aim. In the near future the request will give back a URL, a DOI or a PID.
5.1.2 Data Subsetting
Currently subsetting the data is done after downloading. This results in unnecessary data
transfer and thus bandwidth usage. The subsetting in our workflow should be done directly
at the data centers prior to the data transfer.
This approach is realistic since the subsetting is done via the cdo-command-suite32, which
is portable and thus easy to install at all the heterogeneous data centers. To accelerate the
run of cdo, it would be possible to distribute it over a compute cluster via a Pig call (see
Section 2.3.4), if cdo would be made map-reduce-capable. The current version of cdo does
support OpenMP. The map reduce paradigm is not supported yet.
5.1.3 Scientific Computing
For some research questions it makes sense to provide standard workflows where the user can
choose custom parameters.33 For more complex questions it is necessary that the scientist
can design the part of the workflow after the data retrieval.
For designing workflows it is possible to use cross community-tools which provide a GUI.
Examples are Kepler, Taverna and Vistrails. These three examples were all investigated
31Some search capabilities are http://esgf.org/wiki/ESGF˙Search˙API, http://esgf.org/wiki/ESGF˙Search˙REST˙
API and https://github.com/stephenpascoe/esgf-pyclient.32https://code.zmaw.de/projects/cdo33Example workflows can be found at https://verc.enes.org/computing/workflows.