Data Preservation in HEP Use Cases, Business Cases, Costs

Exa-Scale Data Preservation in HEP
[email protected]
APA/C-DAC Conference
February 2014
International Collaboration for Data Preservation and
Long Term Analysis in High Energy Physics
Background
• Whilst this talk concerns data from High Energy
Physics (HEP) experiments at CERN and
elsewhere, many points are generic
• The scale: 100PB today, reaching ~5EB by 2030
– “Trusted” repositories of this size– and with a lifetime
of at least decades – are a sine qua non of our work
• I will also talk about costs, business cases,
problems and opportunities…
2
BEFORE!
Data flow to permanent storage: 4-6 GB/sec
200-400 MB/sec
1-2 GB/sec
1.25 GB/sec
1-2 GB/sec
CERN-JRC meeting Bob Jones
4
Tier 0 – Tier 1 – Tier 2
Tier-0 (CERN):
•Data recording
•Initial data
reconstruction
•Data distribution
Tier-1 (11 centres):
•Permanent storage
•Re-processing
•Analysis
Tier-2 (~130 centres):
• Simulation
• End-user analysis
Tier-2 centres in India:
•Kolkata (ALICE)
•Mumbai (CMS)
Frédéric Hemmer
The LHC Computing Grid, February 2010
5
Managing 100 PBytes of data
27 January 2014
CERN-JRC meeting
Bob Jones
6
LHC Schedule
2009 2010 2011 2011 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024
First run
LHC startup
900 GeV
LS1
Phase-0 Upgrade
(design energy,
nominal luminosity)
7 TeV
L=6x1033 cm-2s-2
Bunch spacing = 50 ns
Second run
LS2
Phase-1 Upgrade
(design energy,
design luminosity)
14 TeV
L=1x1034 cm-2s-2
Bunch spacing = 25 ns
CERN-JRC meeting Bob Jones
Third run
LS3
…
2030?
HL-LHC
Phase-2 Upgrade
(High Luminosity)
14 TeV
L=2x1034 cm-2s-2
Bunch spacing = 25 ns
14 TeV
L=1x1035 cm-2s-2
Spacing = 12.5 ns
7
ATLAS
Higgs
Candidates
8
AFTER!
CERN has ~100 PB archive
10
But its still
early days for the LHC!
Only EYETS (19 weeks) (no Linac4 connection during Run2)
LS2
starting in 2018 (July)
18 months + 3months BC (Beam Commissioning)
LS3
LHC: starting in 2023 => 30 months + 3 BC
injectors: in 2024
=> 13 months + 3 BC
2015
2016
2017
2018
2019
2020
2021
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
LHC
Injectors
bbbbbbbbbbbbb
ooooooooooo
oooooooooooo
ooooooooo
ooooooooo
ooooooooooooo
ooooooooooo
oooooooooooo
ooooooooo
oooooooooo
Run 2
ooooooooo
ooooooooo
ooooooooo
ooooo
ooooooooo
ooooooooo
ooooooo
ooooooooo
ooooooo
ooooooooo
ooooooooo
ooooooooo
oooooo
ooooooooo
ooooooooo
ooooooo
oooooooooo
ooooooo
LS 2
bbbbbbbbbbbboooo
bbbbbbbbbbbboooo
oooooooooo
ooooooooo
ooooooooo
oooooooooo
ooooooooo
ooooooooo
Run 3
ooooooooo
ooooooooo
ooooooooo
ooooooooo
oooooooooo
ooooooooo
ooooooooo
ooooooooo
t
2022
2023
2024
2025
2026
2027
2028
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
LHC
Injectors
ooooooooo
ooooooooo
ooooooooo
ooooooooo
oooooooooo
ooooooooo
ooooooooo
ooooooooo
LS 3
oooooooooo
2029
ooooooooo
ooooooooo
bbbbbbbbbbbb
ooooooooo
2030
bbbbbbbbbbbbooooo
2031
ooooooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooo
oooooooooo
ooooooooo
ooooooooo
ooooooooo
2032
2033
Run 4
ooooooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooo
oooooooooo
ooooooooo
ooooooooo
ooooooooo
oooooooooo
ooooooooo
ooooooooo
ooooooooo
2034
2035
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
LHC
Injectors
LS 4
Run 5
bbbbbbbbbbbboooooo
ooooooooo
ooooooooo
ooooooooo
oooooooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooo
bbbbbbbbbbbboooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooooo
ooooooooo
ooooooooo
ooooooooo
oooooooooo
ooooooooo
ooooooooo
ooooooooo
LS1 Status Report – 116th LHCC
Frédérick Bordry
4th December 2013
LS 5
bbbbbbbbbbbboooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooo
ooooooooo
bbbbbbbbbbbboooooo
ooooooooo
ooooooooo
ooooooooo
oooooooooo
ooooooooo
ooooooooo
ooooooooo
LHC schedule approved by CERN management and LHC experiments
spokespersons and technical coordinators
Monday 2nd December 2013
11
ECFA
European Committee for Future Accelerators
High Luminosity LHC (HL-LHC)
Update of the European Strategy for Particle Physics adopted
30 May 2013 in a special session of CERN Council at Brussels.
Statement c:
c) The discovery of the Higgs boson is the start of a major programme of work
to measure this particle’s properties with the highest possible precision for
testing the validity of the Standard Model and to search for further new physics
at the energy frontier. The LHC is in a unique position to pursue this programme.
Europe’s top priority should be the exploitation of the full potential of the LHC,
including the high-luminosity upgrade of the machine and detectors with a
view to collecting ten times more data than in the initial design, by around
2030. This upgrade programme will also provide further exciting opportunities
for the study of flavour physics and the quark-gluon plasma.
October 1, 2013
HL-LHC Workshop
12
Data: Outlook for HL-LHC
450.0
400.0
350.0
PB
300.0
CMS
250.0
ATLAS
200.0
ALICE
LHCb
150.0
100.0
50.0
0.0
Run 1
Run 2
Run 3
Run 4
• Very rough estimate of a new RAW data per year of running using a
simple extrapolation of current data volume scaled by the output rates.
• To be added: derived data (ESD, AOD), simulation, user data…
Predrag Buncic, October 3, 2013
ECFA Workshop Aix-Les-Bains - 13
2 – New Theore cal Insights
1 – Long Tail of Papers
3
4
3 – “Discovery” to “Precision”
Zimmermann(
Alain Blondel T L EP design study r-ECFA 2013-07-20
Volume: 100PB + ~50PB/year
(+400PB/year from 2020)
5
14
1. DPHEP Portal
2. Digital library tools (Invenio) & services (CDS, INSPIRE,
ZENODO) + related tools (HepData, RIVET, …)
3. Sustainable software, coupled with advanced virtualization
techniques, “snap-shotting” and validation frameworks
4. Proven bit preservation at the 100PB scale, together with a
sustainable funding model with an outlook to 2040/50
5. Open Data (“Open everything”)
15
DSS
Case B) increasing archive growth
Start with 10PB, then +50PB/year, then +50% every 3y (or +15% / year)
Internet
Services
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
16
DSS
Case B) increasing archive growth
Internet
Services
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
17
DSS
Case B) increasing archive growth
Internet
Services
Total cost: ~59.9M$
(~2M$ / year)
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
18
DSS
Case B) increasing archive growth
Internet
Services
CERN IT Department
CH-1211 Genève 23
Switzerland
www.cern.ch/it
19
Summary
1. DPHEP portal: build in collaboration with other
disciplines, including RDA IG and the APA…
2. Digital libraries: continue existing collaborations
3. Sustainable “bit preservation” – certified
repositories as part of EINFRA-1-2014
4. “Knowledge capture & preservation”: BIG
CHALLENGE not addressed in multi-disciplinary
way: next funding round?
5. Open “Big Data”: a Big Opportunity (for RDA?)
20
Portal Example # 1
21
Portal Platform – Zenodo?
22
Documentation projects with INSPIRE
> Internal notes from all HERA experiments now available on INSPIRE

Experiments no longer need to provide dedicated hardware for such things

Password protected now, simple to make publicly available in the future
> The ingestion of other documents is under discussion, including theses,
preliminary results, conference talks and proceedings, paper drafts, ...
 More experiments working with INSPIRE, including CDF, D0 as well as BaBar
David South | Data Preservation and Long Term Analysis in HEP | CHEP 2012, May 21-25 2012 | Page 23
LEP Cost would be “now” …
 Completely different, of course …
 Direct resource cost is already compatible
with zero for LEP experiments




Total ALEPH DATA + MC (analysis format) = 30 TB
ALEPH: Shift50 = 320 CernUnit. One of today’s pizza
box largely exceeds this
CDF data: O(10 PB), bought today for <400kEur
CDF CPU ~ 1MSi2k = 4 kHS06 = 40kEur
 Here the main problem is knowledge
/support, clearly

Can you trust a “NP peak” 10 years later, when
experts are gone?
 ALEPH reproducibility test (M.Maggi, by NO mean
a DP solution) ~0.5 FTE for 3 months
25
Open Data?
26
Costs and Scale
• There are 4 (main) collaborations + detectors at the LHC: the largest has
3000 members
• The annual cost of WLCG (infrastructure, operations, services) is
~EUR100M
• The CERN database services costs around 2MCHF per year for Materials
(licenses, maintenance, hardware) and 2MCHF for personnel
• The central grid Experiment Integration Support team varied between 4-10
people, plus significant effort at sites and within experiments
• The DPHEP Full Costs of Curation workshop concluded that a team of ~4
people, with access to experts, could “make significant progress” (be
careful with this number!)
27
Conclusions
• Long-term data preservation is a journey, not a
destination
• As such, it is best not to venture out alone
• A clear understanding of costs & benefits is necessary
to secure funding
• We are eager to share our knowledge and experience
(exa-scale “bit preservation”)
• We have learned a lot through collaboration through
the APA – and keen to learn more in the future
28