Exa-Scale Data Preservation in HEP [email protected] APA/C-DAC Conference February 2014 International Collaboration for Data Preservation and Long Term Analysis in High Energy Physics Background • Whilst this talk concerns data from High Energy Physics (HEP) experiments at CERN and elsewhere, many points are generic • The scale: 100PB today, reaching ~5EB by 2030 – “Trusted” repositories of this size– and with a lifetime of at least decades – are a sine qua non of our work • I will also talk about costs, business cases, problems and opportunities… 2 BEFORE! Data flow to permanent storage: 4-6 GB/sec 200-400 MB/sec 1-2 GB/sec 1.25 GB/sec 1-2 GB/sec CERN-JRC meeting Bob Jones 4 Tier 0 – Tier 1 – Tier 2 Tier-0 (CERN): •Data recording •Initial data reconstruction •Data distribution Tier-1 (11 centres): •Permanent storage •Re-processing •Analysis Tier-2 (~130 centres): • Simulation • End-user analysis Tier-2 centres in India: •Kolkata (ALICE) •Mumbai (CMS) Frédéric Hemmer The LHC Computing Grid, February 2010 5 Managing 100 PBytes of data 27 January 2014 CERN-JRC meeting Bob Jones 6 LHC Schedule 2009 2010 2011 2011 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 First run LHC startup 900 GeV LS1 Phase-0 Upgrade (design energy, nominal luminosity) 7 TeV L=6x1033 cm-2s-2 Bunch spacing = 50 ns Second run LS2 Phase-1 Upgrade (design energy, design luminosity) 14 TeV L=1x1034 cm-2s-2 Bunch spacing = 25 ns CERN-JRC meeting Bob Jones Third run LS3 … 2030? HL-LHC Phase-2 Upgrade (High Luminosity) 14 TeV L=2x1034 cm-2s-2 Bunch spacing = 25 ns 14 TeV L=1x1035 cm-2s-2 Spacing = 12.5 ns 7 ATLAS Higgs Candidates 8 AFTER! CERN has ~100 PB archive 10 But its still early days for the LHC! Only EYETS (19 weeks) (no Linac4 connection during Run2) LS2 starting in 2018 (July) 18 months + 3months BC (Beam Commissioning) LS3 LHC: starting in 2023 => 30 months + 3 BC injectors: in 2024 => 13 months + 3 BC 2015 2016 2017 2018 2019 2020 2021 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 LHC Injectors bbbbbbbbbbbbb ooooooooooo oooooooooooo ooooooooo ooooooooo ooooooooooooo ooooooooooo oooooooooooo ooooooooo oooooooooo Run 2 ooooooooo ooooooooo ooooooooo ooooo ooooooooo ooooooooo ooooooo ooooooooo ooooooo ooooooooo ooooooooo ooooooooo oooooo ooooooooo ooooooooo ooooooo oooooooooo ooooooo LS 2 bbbbbbbbbbbboooo bbbbbbbbbbbboooo oooooooooo ooooooooo ooooooooo oooooooooo ooooooooo ooooooooo Run 3 ooooooooo ooooooooo ooooooooo ooooooooo oooooooooo ooooooooo ooooooooo ooooooooo t 2022 2023 2024 2025 2026 2027 2028 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 LHC Injectors ooooooooo ooooooooo ooooooooo ooooooooo oooooooooo ooooooooo ooooooooo ooooooooo LS 3 oooooooooo 2029 ooooooooo ooooooooo bbbbbbbbbbbb ooooooooo 2030 bbbbbbbbbbbbooooo 2031 ooooooooo ooooooooo ooooooooo ooooooooo ooooooooo ooooooooo ooooooooo ooooooooo oooooooooo ooooooooo ooooooooo ooooooooo 2032 2033 Run 4 ooooooooo ooooooooo ooooooooo ooooooooo ooooooooo ooooooooo ooooooooo ooooooooo oooooooooo ooooooooo ooooooooo ooooooooo oooooooooo ooooooooo ooooooooo ooooooooo 2034 2035 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 LHC Injectors LS 4 Run 5 bbbbbbbbbbbboooooo ooooooooo ooooooooo ooooooooo oooooooooo ooooooooo ooooooooo ooooooooo ooooooooo ooooooooo ooooooooo ooooooooo bbbbbbbbbbbboooooo ooooooooo ooooooooo ooooooooo ooooooooooo ooooooooo ooooooooo ooooooooo oooooooooo ooooooooo ooooooooo ooooooooo LS1 Status Report – 116th LHCC Frédérick Bordry 4th December 2013 LS 5 bbbbbbbbbbbboooooo ooooooooo ooooooooo ooooooooo ooooooooo ooooooooo ooooooooo ooooooooo bbbbbbbbbbbboooooo ooooooooo ooooooooo ooooooooo oooooooooo ooooooooo ooooooooo ooooooooo LHC schedule approved by CERN management and LHC experiments spokespersons and technical coordinators Monday 2nd December 2013 11 ECFA European Committee for Future Accelerators High Luminosity LHC (HL-LHC) Update of the European Strategy for Particle Physics adopted 30 May 2013 in a special session of CERN Council at Brussels. Statement c: c) The discovery of the Higgs boson is the start of a major programme of work to measure this particle’s properties with the highest possible precision for testing the validity of the Standard Model and to search for further new physics at the energy frontier. The LHC is in a unique position to pursue this programme. Europe’s top priority should be the exploitation of the full potential of the LHC, including the high-luminosity upgrade of the machine and detectors with a view to collecting ten times more data than in the initial design, by around 2030. This upgrade programme will also provide further exciting opportunities for the study of flavour physics and the quark-gluon plasma. October 1, 2013 HL-LHC Workshop 12 Data: Outlook for HL-LHC 450.0 400.0 350.0 PB 300.0 CMS 250.0 ATLAS 200.0 ALICE LHCb 150.0 100.0 50.0 0.0 Run 1 Run 2 Run 3 Run 4 • Very rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates. • To be added: derived data (ESD, AOD), simulation, user data… Predrag Buncic, October 3, 2013 ECFA Workshop Aix-Les-Bains - 13 2 – New Theore cal Insights 1 – Long Tail of Papers 3 4 3 – “Discovery” to “Precision” Zimmermann( Alain Blondel T L EP design study r-ECFA 2013-07-20 Volume: 100PB + ~50PB/year (+400PB/year from 2020) 5 14 1. DPHEP Portal 2. Digital library tools (Invenio) & services (CDS, INSPIRE, ZENODO) + related tools (HepData, RIVET, …) 3. Sustainable software, coupled with advanced virtualization techniques, “snap-shotting” and validation frameworks 4. Proven bit preservation at the 100PB scale, together with a sustainable funding model with an outlook to 2040/50 5. Open Data (“Open everything”) 15 DSS Case B) increasing archive growth Start with 10PB, then +50PB/year, then +50% every 3y (or +15% / year) Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 16 DSS Case B) increasing archive growth Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 17 DSS Case B) increasing archive growth Internet Services Total cost: ~59.9M$ (~2M$ / year) CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 18 DSS Case B) increasing archive growth Internet Services CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 19 Summary 1. DPHEP portal: build in collaboration with other disciplines, including RDA IG and the APA… 2. Digital libraries: continue existing collaborations 3. Sustainable “bit preservation” – certified repositories as part of EINFRA-1-2014 4. “Knowledge capture & preservation”: BIG CHALLENGE not addressed in multi-disciplinary way: next funding round? 5. Open “Big Data”: a Big Opportunity (for RDA?) 20 Portal Example # 1 21 Portal Platform – Zenodo? 22 Documentation projects with INSPIRE > Internal notes from all HERA experiments now available on INSPIRE Experiments no longer need to provide dedicated hardware for such things Password protected now, simple to make publicly available in the future > The ingestion of other documents is under discussion, including theses, preliminary results, conference talks and proceedings, paper drafts, ... More experiments working with INSPIRE, including CDF, D0 as well as BaBar David South | Data Preservation and Long Term Analysis in HEP | CHEP 2012, May 21-25 2012 | Page 23 LEP Cost would be “now” … Completely different, of course … Direct resource cost is already compatible with zero for LEP experiments Total ALEPH DATA + MC (analysis format) = 30 TB ALEPH: Shift50 = 320 CernUnit. One of today’s pizza box largely exceeds this CDF data: O(10 PB), bought today for <400kEur CDF CPU ~ 1MSi2k = 4 kHS06 = 40kEur Here the main problem is knowledge /support, clearly Can you trust a “NP peak” 10 years later, when experts are gone? ALEPH reproducibility test (M.Maggi, by NO mean a DP solution) ~0.5 FTE for 3 months 25 Open Data? 26 Costs and Scale • There are 4 (main) collaborations + detectors at the LHC: the largest has 3000 members • The annual cost of WLCG (infrastructure, operations, services) is ~EUR100M • The CERN database services costs around 2MCHF per year for Materials (licenses, maintenance, hardware) and 2MCHF for personnel • The central grid Experiment Integration Support team varied between 4-10 people, plus significant effort at sites and within experiments • The DPHEP Full Costs of Curation workshop concluded that a team of ~4 people, with access to experts, could “make significant progress” (be careful with this number!) 27 Conclusions • Long-term data preservation is a journey, not a destination • As such, it is best not to venture out alone • A clear understanding of costs & benefits is necessary to secure funding • We are eager to share our knowledge and experience (exa-scale “bit preservation”) • We have learned a lot through collaboration through the APA – and keen to learn more in the future 28
© Copyright 2025 Paperzz