Z. Z. Vilakazi iThemba LABS / UCT-CERN Research Centre Curation in Natural Curation in Natural sciences sciences
Jan 24, 2016
Z. Z. VilakaziiThemba LABS / UCT-CERN Research Centre
Curation in Natural Curation in Natural sciencessciences
● Common effort of the ALICE and LGC Collaborations.
● Thanks to my colleagues of the ALICE-MUON
Collaboration.
Special thanks to Jean Cleymans, Bruce Becker, Artur
Szostak, Gareth de Vaux, Sukalyan Chattopadyay,
Corrado Cicalo, Timm Steinbeck, Volker Lindenstruth,
Heinz Tilsner, Florent Staley and others
Acknowledgments
Management of large data sets
Inter-operability
Standards and protocols
Security and certification
Topics for discussionTopics for discussion
Digital Curation
Maintainance of digital research data and other digital materials over their entire life-cycle and over time for current and future generations of users.
Processes of digital archiving and preservation
Also includes all the processes needed for good data creation and management, and the capacity to add value to data to generate new sources of information and knowledge.
", and services in this field."
Centre for Digital Curation
Digital Curation(2)
Curation and long-term preservation of digital resources will be of increasing importance for a wide range of activities within research and education.
Through sensors, experiments, digitisation and computer simulation, digital resources and data are growing in volume and complexity at a staggering rate.
The cost of producing these resources is very high: satellites, particle accelerators, genome sequencing, and large scale digitisation and electronic publishing collectively represent a cumulative investment of billions of pounds in digital research and learning.
Long-term curation and preservation of digital resources is seen as a challenge which is difficult if not impossible for individual institutions to resolve on their own due to the complexity and scale of the challenges involved.
Curation in Physical Sciences Data is being generated in large volumes.
In laboratories; old archival material (design specifications, codes etc) can serve as reference resources.
Remote information access through online publications.
Data management and real-time remote analysis Heavily dependent on bandwidth
New middleware is being developed for access ofdata across geographically disparate centres.
Data sharing in astro; nuclear and particle physics Usually characterised by large collaborations (in excess
of 100 people)
MetaData are essential for the selection of events
Can use the Grid file catalogue for one part of the MetaData
During the Data Challenge we used the file catalogue for storing part of the MetaData
simulation
reconstruction
analysis
interactivephysicsanalysis
batchphysicsanalysis
batchphysicsanalysis
detector
event summary data
rawdata
eventreprocessing
eventreprocessing
eventsimulation
eventsimulation
analysis objects(extracted by physics topic)
Data Handling and Computation for
Physics Analysisevent filter(selection &
reconstruction)
event filter(selection &
reconstruction)
processeddata
les.
rob
ert
son
@ce
rn.c
h
CERN
Experimental conditions in heavy-ion colliders
Experimental conditions in heavy-ion colliders
Beam : Pb-Pb, Ca-Ca, p-p, p-A Rates :
8000 events/s Minimum bias 50-100/s central events (2-5%
tot) acquisition rate 100 Hz
(central) 1000 Hz (dimuons) 1 month/year (106 s) =107
central events Multiplicity : dn/dy from 2000
to 8000 so a total of about 60000
Consequences
More than 60 GBytes produced per second in Alice:•High Level Trigger (HLT) + compression to reduce raw data to 1.2 GB/s : 2 to 3 PB/year in 1 month of data taking•Very fast acquisition and network
ALICE will be one of the largest data base in historyNeed a GRID to distribute and analyse data
The Grid Vision
The GRID: networked data processing centres and ”middleware” software as the “glue” of resources.
Researchers perform their activities regardless geographical location, interact with colleagues, share and access data
Scientific instruments and experiments provide huge amount of data
Classification of Grids
Computational Grids (including CPU scavenging Grids) which focuses primarily on computationally-intensive operations
Data Grids or the controlled sharing and management of large amounts of distributed data
Equipment Grids which have a primary piece of equipment e.g. a telescope, and where the surrounding Grid is used to control the equipment remotely and to analyse the data produced.
Grid beyond high energy physics
Due to the computational power of the EGEE new communities are requiring services for different research fields
Normally these communities do not need the complex structure that required by the HEP communities
In many cases, their productions are shorter and well defined in the year
The amount of CPU required is much lower and also the Storage capabilities
20 applications from 7 domains
High Energy Physic, Biomedicine, Earth Sciences, Computational Chemistry
Astronomy, Geo-physics and financial simulation
36
LCG services – built on two majorscience grid infrastructures
EGEE - Enabling Grids for E-ScienceOSG - US Open Science Grid
LCG Service Hierarchy
Tier-0 – the accelerator centre Data acquisition & initial processing Long-term data curation Distribution of data Tier-1 centres
Canada – Triumf (Vancouver)France – IN2P3 (Lyon)Germany – Forschunszentrum KarlsruheItaly – CNAF (Bologna)Netherlands Tier-1 (Amsterdam)Nordic countries – distributed Tier-1
Spain – PIC (Barcelona)Taiwan – Academia SInica (Taipei)UK – CLRC (Oxford)US – FermiLab (Illinois) – Brookhaven (NY)
Tier-1 – “online” to the data acquisition process high availability
Managed Mass Storage – grid-enabled data service
Data-heavy analysis National, regional support
Tier-2 – ~100 centres in 20 countries Simulation End-user analysis – batch and interactive
Tier0 / Tier1 / Tier2 Networks
Tier-2s and Tier-1s are inter-connected by the general
purpose research networks
Any Tier-2 mayaccess data at
any Tier-1
Tier-2 IN2P3
TRIUMF
ASCC
FNAL
BNL
Nordic
CNAF
SARAPIC
RAL
GridKa
Tier-2
Tier-2
Tier-2
Tier-2
Tier-2
Tier-2
Tier-2Tier-2
Tier-2
Cape Town ?
Summary of Tier0/1/2 Roles
Tier0 (CERN): safe keeping of RAW data (first copy); first pass reconstruction, distribution of RAW data and reconstruction output to Tier1; reprocessing of data during LHC down-times;
Tier1: safe keeping of a proportional share of RAW and reconstructed data; large scale reprocessing and safe keeping of corresponding output; distribution of data products to Tier2s and safe keeping of a share of simulated data produced at these Tier2s;
Tier2: Handling analysis requirements and proportional share of simulated event production and reconstruction.
Very difficult to estimate Network requirements!
N.B. there are differences in roles by experimentEssential to test using complete production chain of each!
Tier2
Tier1
Tier2
Tier1
Production of RAW
Shipment of RAW to CERN
Reconstruction of RAW in all T1’s
Analysis
AliEn job control
Data transfer
Physics Data Challenge(s)F
. C
arm
inat
ti (C
ER
N)
ALICE Network in the World
Yerevan
CERN
Saclay
Lyon
Dubna
Cape Town, ZA
Birmingham
Cagliari
NIKHEF
GSI
Catania
BolognaTorino
Padova
IRB
Kolkata, India
OSU/OSC
LBL/NERSC
Merida
Bari
http://www.to.infn.it/activities/experiments/alice-grid
37 people21 insitutions
Active sites
Undersea Cable Capacity
Asymmetric Inter-regional Bandwidth
Result: Sample Bandwidth Costs for African Universities
Source: IEEAF
Management of large data sets$$ and RDatabase management Skills
Digital divide : Cyber infr: network/HR/libraries/data sets/LAN etc
Inter-operability: e.g Astro-Grid, mammo Grid etc Standards and protocols
Preservation and qualityAccess (meaning of numbers)/terminology and use of unfamiliar dataConfiguration managementEx: Particle data book
Security and certification
Certification authoritiesDialogue between researchers & librarians
Role of libraries and curatorsGuidelinesAcademic training programme/ schools outreach
Schools: New curriculum development (lost data)Research students: access to previous theses
Resource management
Topics for discussionTopics for discussion
Challenges
Strategy for Natural sciences across different domains