Top Banner
Long-Term Data Preservation [email protected] WLCG Overview Board, March 2013 Twitter: #DPHEP International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics
32

Long-Term Data Preservation

Feb 22, 2016

Download

Documents

chung

Long-Term Data Preservation. International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics. [email protected] WLCG Overview Board, March 2013 Twitter: #DPHEP. Overview. Summary of DPHEP Blueprint recommendations - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Long-Term Data Preservation

Long-Term Data Preservation

[email protected] Overview Board, March 2013

Twitter: #DPHEP

International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics

Page 2: Long-Term Data Preservation

2

Overview

• Summary of DPHEP Blueprint recommendations

• Opportunities: collaboration with other disciplines & funding

• A “2020 vision” and its implementation

Page 3: Long-Term Data Preservation

3

DPHEP BLUEPRINT

Page 4: Long-Term Data Preservation

4

DPHEP EntitiesOrganisational Body Description Input and positioning DPHEP Output

DPHEP Organisation for Data Preservation in High-Energy Physics

Projects in data preservation at experiment and laboratory level

Working groups on common projects, status report documents

DPHEP Chair Overall coordination of DPHEP

Appointed by ICFA, represents DPHEP in relationship with other bodies

Yearly reports to ICFA, representation to other related scientific bodies

DPHEP Project Manager Project management, administrative, technical, funding

Main operational coordinator, maintain contacts, organises meetings, lead proposals for funding

Reports to the steering committee

Advisory committee Group of external personalities

Synergy with the wider HEP community, input from other fields and initiatives  

Project proposals, documents for scrutiny

Steering committee Internal executive body, chaired by the DPHEP Chair

Contributions from the participation members

Strategic and operational decisions

Funding bodies Funding agencies are invited to take note on the progress reports and periodically analyse the relevance of the funding

Direct funding to the DPHEP organisation, under the supervision of the Project Manager

Quarterly progress reports

Page 5: Long-Term Data Preservation

5

DPHEP EntitiesOrganisational Body Description Input and positioning DPHEP Output

DPHEP Organisation for Data Preservation in High-Energy Physics

Projects in data preservation at experiment and laboratory level

Working groups on common projects, status report documents

DPHEP Chair Overall coordination of DPHEP

Appointed by ICFA, represents DPHEP in relationship with other bodies

Yearly reports to ICFA, representation to other related scientific bodies

DPHEP Project Manager Project management, administrative, technical, funding

Main operational coordinator, maintain contacts, organises meetings, lead proposals for funding

Reports to the steering committee

Advisory committee Group of external personalities

Synergy with the wider HEP community, input from other fields and initiatives  

Project proposals, documents for scrutiny

Steering committee Internal executive body, chaired by the DPHEP Chair

Contributions from the participation members

Strategic and operational decisions

Funding bodies Funding agencies are invited to take note on the progress reports and periodically analyse the relevance of the funding

Direct funding to the DPHEP organisation, under the supervision of the Project Manager

Quarterly progress reports

Implemented via multi-lateral Collaboration

Agreement (draft circulated)

Page 6: Long-Term Data Preservation

6

DPHEP EntitiesOrganisational Body Description Input and positioning DPHEP Output

DPHEP Organisation for Data Preservation in High-Energy Physics

Projects in data preservation at experiment and laboratory level

Working groups on common projects, status report documents

DPHEP Chair Overall coordination of DPHEP

Appointed by ICFA, represents DPHEP in relationship with other bodies

Yearly reports to ICFA, representation to other related scientific bodies

DPHEP Project Manager Project management, administrative, technical, funding

Main operational coordinator, maintain contacts, organises meetings, lead proposals for funding

Reports to the steering committee

Advisory committee Group of external personalities

Synergy with the wider HEP community, input from other fields and initiatives  

Project proposals, documents for scrutiny

Steering committee Internal executive body, chaired by the DPHEP Chair

Contributions from the participation members

Strategic and operational decisions

Funding bodies Funding agencies are invited to take note on the progress reports and periodically analyse the relevance of the funding

Direct funding to the DPHEP organisation, under the supervision of the Project Manager

Quarterly progress reports

Chair of Study Group was Cristinel Diaconu /

CPPM & DESY who continues in this role

Page 7: Long-Term Data Preservation

7

DPHEP EntitiesOrganisational Body Description Input and positioning DPHEP Output

DPHEP Organisation for Data Preservation in High-Energy Physics

Projects in data preservation at experiment and laboratory level

Working groups on common projects, status report documents

DPHEP Chair Overall coordination of DPHEP

Appointed by ICFA, represents DPHEP in relationship with other bodies

Yearly reports to ICFA, representation to other related scientific bodies

DPHEP Project Manager Project management, administrative, technical, funding

Main operational coordinator, maintain contacts, organises meetings, lead proposals for funding

Reports to the steering committee

Advisory committee Group of external personalities

Synergy with the wider HEP community, input from other fields and initiatives  

Project proposals, documents for scrutiny

Steering committee Internal executive body, chaired by the DPHEP Chair

Contributions from the participation members

Strategic and operational decisions

Funding bodies Funding agencies are invited to take note on the progress reports and periodically analyse the relevance of the funding

Direct funding to the DPHEP organisation, under the supervision of the Project Manager

Quarterly progress reports

CERN provides Project Manager 2013 – 2015 after which may rotate

Page 8: Long-Term Data Preservation

8

DPHEP EntitiesOrganisational Body Description Input and positioning DPHEP Output

DPHEP Organisation for Data Preservation in High-Energy Physics

Projects in data preservation at experiment and laboratory level

Working groups on common projects, status report documents

DPHEP Chair Overall coordination of DPHEP

Appointed by ICFA, represents DPHEP in relationship with other bodies

Yearly reports to ICFA, representation to other related scientific bodies

DPHEP Project Manager Project management, administrative, technical, funding

Main operational coordinator, maintain contacts, organises meetings, lead proposals for funding

Reports to the steering committee

Advisory committee Group of external personalities

Synergy with the wider HEP community, input from other fields and initiatives  

Project proposals, documents for scrutiny

Steering committee Internal executive body, chaired by the DPHEP Chair

Contributions from the participation members

Strategic and operational decisions

Funding bodies Funding agencies are invited to take note on the progress reports and periodically analyse the relevance of the funding

Direct funding to the DPHEP organisation, under the supervision of the Project Manager

Quarterly progress reports

Broadened to include “influential” names, e.g.

from APA, SCIDIP-ES

Page 9: Long-Term Data Preservation

9

DPHEP EntitiesOrganisational Body Description Input and positioning DPHEP Output

DPHEP Organisation for Data Preservation in High-Energy Physics

Projects in data preservation at experiment and laboratory level

Working groups on common projects, status report documents

DPHEP Chair Overall coordination of DPHEP

Appointed by ICFA, represents DPHEP in relationship with other bodies

Yearly reports to ICFA, representation to other related scientific bodies

DPHEP Project Manager Project management, administrative, technical, funding

Main operational coordinator, maintain contacts, organises meetings, lead proposals for funding

Reports to the steering committee

Advisory committee Group of external personalities

Synergy with the wider HEP community, input from other fields and initiatives  

Project proposals, documents for scrutiny

Steering committee Internal executive body, chaired by the DPHEP Chair

Contributions from the participation members

Strategic and operational decisions

Funding bodies Funding agencies are invited to take note on the progress reports and periodically analyse the relevance of the funding

Direct funding to the DPHEP organisation, under the supervision of the Project Manager

Quarterly progress reports

Representatives of parties to

Collaboration Agreement

Page 10: Long-Term Data Preservation

10

DPHEP EntitiesOrganisational Body Description Input and positioning DPHEP Output

DPHEP Organisation for Data Preservation in High-Energy Physics

Projects in data preservation at experiment and laboratory level

Working groups on common projects, status report documents

DPHEP Chair Overall coordination of DPHEP

Appointed by ICFA, represents DPHEP in relationship with other bodies

Yearly reports to ICFA, representation to other related scientific bodies

DPHEP Project Manager Project management, administrative, technical, funding

Main operational coordinator, maintain contacts, organises meetings, lead proposals for funding

Reports to the steering committee

Advisory committee Group of external personalities

Synergy with the wider HEP community, input from other fields and initiatives  

Project proposals, documents for scrutiny

Steering committee Internal executive body, chaired by the DPHEP Chair

Contributions from the participation members

Strategic and operational decisions

Funding bodies Funding agencies are invited to take note on the progress reports and periodically analyse the relevance of the funding

Direct funding to the DPHEP organisation, under the supervision of the Project Manager

Quarterly progress reports

e.g. EU, NSF, STFC, INFN, …

Page 11: Long-Term Data Preservation

11

DPHEP Blueprint DeliverablesObjective Deliverable (Measurable)

Positioning as forum Catalogue of technical knowledge and practical solutions Description of possible alternatives for governance.

Co-ordination of projects Common R&D projects meet the expectations of the stakeholders.

Harmonisation and liaison

Synchronisation of preservation projects in the field. Identification of areas where external knowledge needs to be transferred to HEP.

Design sustainable future

Characterisation of discipline-wide toolkit for preservation Business plan for long-term preservation in HEP.

Outreach and advocacy Understanding of needs/opportunities for medium- and small-sized collaborations. Concrete discussions with funding bodies/laboratories.

Proposed activities of the DPHEP Organization – p85, Blueprint document.These deliverables are to be met within 2 years of becoming fully operational.

Page 12: Long-Term Data Preservation

12

DPHEP Preservation LevelsPreservation Model Use case1. Provide additional

documentationPublication-related information search

2. Preserve the data in a simplified format

Outreach, simple training analyses

3. Preserve the analysis level software and data format

Full scientific analysis based on existing reconstruction

4. Preserve the reconstruction and simulation software and basic level data

Full potential of the experimental data

Page 13: Long-Term Data Preservation

13

DPHEP LevelsPreservation Model Use case1. Provide additional

documentationPublication-related information search

2. Preserve the data in a simplified format

Outreach, simple training analyses

3. Preserve the analysis level software and data format

Full scientific analysis based on existing reconstruction

4. Preserve the reconstruction and simulation software and basic level data

Full potential of the experimental data

Page 14: Long-Term Data Preservation

14

DPHEP LevelsPreservation Model Use case1. Provide additional

documentationPublication-related information search

2. Preserve the data in a simplified format

Outreach, simple training analyses

3. Preserve the analysis level software and data format

Full scientific analysis based on existing reconstruction

4. Preserve the reconstruction and simulation software and basic level data

Full potential of the experimental data

Page 15: Long-Term Data Preservation

15

DPHEP LevelsPreservation Model Use case1. Provide additional

documentationPublication-related information search

2. Preserve the data in a simplified format

Outreach, simple training analyses

3. Preserve the analysis level software and data format

Full scientific analysis based on existing reconstruction

4. Preserve the reconstruction and simulation software and basic level data

Full potential of the experimental data

HepMC / Rivet toolkit may play a useful –

and sustainable – role here. See DPHEP7

Page 16: Long-Term Data Preservation

16

DPHEP Summary

• There is a lot of knowledge and experience in the existing DPHEP community that can be leveraged for other efforts, e.g. LHC & LEP

• LHC is clearly of key interest to WLCG OB but we should not forget LEP before it is too late!

• On-going (small) effort to document current situation and options for moving forward

CERNLIB felt to be (a) critical factor but there are many external distributions

Page 17: Long-Term Data Preservation

17

OPPORTUNITIES & FUNDING

Page 18: Long-Term Data Preservation

18

Collaboration with others

• Many other disciplines, ranging from science to arts & humanities, already (very) active

• Numerous conferences and workshops have been up and running for years

• We have been accepted – partly due to halo effect of the Higgs discovery – with open arms

• Concrete discussions on further collaboration are funding advancing well

Not limited to Data Preservation – e.g. SKA!

Page 19: Long-Term Data Preservation

19

Funding

• DASPOS is up and running with NSF funding• Research Data Alliance – with indirect EU, NSF, AUS and

other funding – will play a role– Co-chair of RDA WG on DP

• Clear signs that EU Horizon 2020 will include Data Preservation– e-IRG meeting, EIROforum w/s, RDA, …

• Now is the time to firm up partnerships & prepare for up-coming projects

STFC and other UK bodies particularly active in above activities: how can we profit from this?

Page 20: Long-Term Data Preservation

20

A 2020 VISION

Page 21: Long-Term Data Preservation

21

2020 Vision for LT DP in HEP• Long-term: disruptive change(s), e.g. LC era

– All archived data – e.g. that described in Blueprint, including LHC data – easily findable, fully usable by designated communities with clear (Open) access policies and possibilities to annotate further

– Best practices, tools and services well run-in, fully documented and sustainable; built in common with other disciplines, based on standards

Vision achievable, but we are far from this today

Page 22: Long-Term Data Preservation

22

Long-Term Commitment

• To achieve long-term data preservation, we need long-term commitment(s)

• By 2035, there will have been:– 3-4 updates to the ESPP;– 4-5 new DGs;– X re-organizations of CERN-IT.

We need commitments that outlive all of these!

Page 23: Long-Term Data Preservation

23

2020 Vision – The OAIS Model

Page 24: Long-Term Data Preservation

24

OAIS Components• In the OAIS model, there are the concepts of producer and consumer• DASPOS aims to take data produced by e.g. CMS and show that e.g. ATLAS

can reproduce a full analysis, using the software, meta-data, documentation etc.

• This exercise will be started at DPHEP7 (March 21-22) and hopefully repeated regularly – e.g. annually – so that by 2020 the entire process is well understood, documented and repeatable

It is proposed that the (Archive) Information Packages are simply XML documents stored in Invenio

• The exact tool-set and feature requirement is still TBD• Some tools used on a daily basis – e.g. Twiki! – not suitable for long-term

archives Good opportunity for sharing experiences and best practices with other

disciplines / projects, e.g. SCIDIP-ES, APA

APA – “Too Big an Issue for any single organisation – we must work together”

Page 25: Long-Term Data Preservation

25

Archival Storage• Experience from WLCG and beyond tells us that data loss and

corruption will (and does) occur!– See WLCG SIRs, Tim Bell’s presentation to DPHEP3

• But there are things that we can do to mitigate risks and recover (often), e.g. rule-based systems: apply checksum and other “tests” upon schedule and/or actions

• What is the current situation at WLCG sites?• Can we coordinate / agree suitable actions?• Coordinate via HEPiX, IEEE MSST, APA, EUDAT, RDA etc.• Collaboration with industry, e.g. IBM-led FP7 project

Recovery often performed by experiments by re-replicating data: how will this be done in the long-term?

Page 26: Long-Term Data Preservation

26

DPHEP Level 4• Retaining the full potential of the data is the only really

interesting option – but it is by far the most difficult!• Difficult does not mean impossible – and we can profit from

a period of “meta-stability” while we concentrate on this• Past experiments typically ported / re-wrote major parts

of their offline environment several times over a period of decades

• This is inevitable for LHC too – we could make this easier, but it will require an initial investment!

Collaboration with others who face similar problems could help but much of this we have to solve ourselves

Page 27: Long-Term Data Preservation

27

Where to Invest?

Tools and Services, e.g. Invenio Archival Storage Functionality

Support to the Experiments for DPHEP Level 4

Page 28: Long-Term Data Preservation

28

Suggested Topics for DPHEP7• “Ingest Issues” (10’)

– How did you (the experiment) decide what data to save, how to make it discoverable / available, how is it documented, where is the data / meta-data etc. What are the access policies and target communities?

– What tools do you use? • “Archive issues”: (10’)

– How is the archive managed? How are errors detected and handled? What is the experience?

– What storage system / services are used?• “Offline environment issues”: (20’)

– What have been the key challenges in keeping the offline environment alive? What are the key lessons learned / pitfalls to be avoided? What would you have done differently if long-term preservation had been a goal from the early days of the experiment?

DPHEP8: around or during CHEP? TBD in coming weeks… Doodle

Outline for site / experiment talks at DPHEP7, March 21-22, CERN

Page 29: Long-Term Data Preservation

29

S.W.O.T.Strengths DPHEP is well established within the community and

recent contacts to other disciplines are very encouragingWeaknesses Effort is very scarce within the project at a time when

manpower is already stretched to the limit elsewhereOpportunities Through a convergence of events there are clear

possibilities for significant funding and collaboration in the EU’s Horizon 2020 programme and most likely corresponding programmes in other areas of the world, e.g. NSF-funded projects

Threats Failure to invest now would jeopardise attempts to “rescue” LEP data as well as to take other preservation events (BaBar, Tevatron, Hera etc.) to a stable and sustainable state. It could also limit our ability to prepare for – and hence participate in – future projects

Page 30: Long-Term Data Preservation

30

Summary

• We have outlined the current status of Long-Term Data Preservation in HEP and areas for fruitful collaboration with others

• Funding, e.g. through EU Horizon 2020, is looking good – we need to invest now to secure this!

Much work needs to be done to turn a dream into reality – particularly and critically in the area of future-proof offline environments

• However, this is expected to result in a cost-saving in the long-term by reducing effort in inevitable migrations

Page 31: Long-Term Data Preservation

31

Where to Invest – Summary

Tools and Services, e.g. Invenio:could be solved. (2-3 years?)

Archival Storage Functionality:should be solved. (i.e. “now”)

Support to the Experiments for DPHEP Level 4:must be solved – but how?

Page 32: Long-Term Data Preservation

32

International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics