LHC Computing Models

LHC Computing Models
Commissione I
31/1/2005
Francesco Forti, Pisa
Gruppo di referaggio
Forti (chair), Belforte, Menasce, Simone,
Taiuti, Ferrari, Morandin, Zoccoli
Outline
 Comparative analysis of the computing
models (little about Alice)
 Referee comments
 Roadmap: what’s next
Disclaimer: it is hard to digest and summarize the available
information. Advance apologies for errors and omissions.
January 31, 2005
LHC Computing Models - F. Forti
2
A Little Perspective
 In 2001 the Hoffman Review was conducted to quantify
the resources needed for LHC computing
 Documented in CERN/LHCC/2001-004
 As a result the LHC Computing Grid project was
launched to start building up the needed capacity and
competence and provide a prototype for the experiments
to use.
 In 2004 the experiments have been running a Data
Challenge to verify their ability to simulate, process and
analyze their data
 In Dec 2004 Computing Model Documents have been
submitted to LHCC, who has reviewed them on Jan 1718, 2005
 The Computing TDRs and the LCG TDR are expected
this spring/summer.
January 31, 2005
LHC Computing Models - F. Forti
3
Running assumptions
 2007 Luminosity 0.5x1033 cm-2 s-1
 2008-9 Luminosity 2x1033 cm-2 s-1
 2010 Luminosity 1x1034 cm-2 s-1
 but trigger rate is independent of luminosity
 7 months run pp = 107 s (real time 1.8x107 s)
 1 month run AA = 106 s (real time 2.6x106 s)
 4 months shutdown
January 31, 2005
LHC Computing Models - F. Forti
4
Data Formats
 Names differ, but the concepts are similar:
 RAW data
 Reconstructed event (ESD, RECO, DST)
 Tracks with associated hits, Calorimetry objects, Missing
energy, trigger at all levels, …
 Can be used to refit, but not to do pattern recognition
 Analysis Object Data (AOD, rDST)
 Tracks, particles, vertices, trigger
 Main source for physics analysis
 TAG
 Number of vertices, tracks of various types, trigger, etc.
 Enough information to select events, but otherwise very
compact.
January 31, 2005
LHC Computing Models - F. Forti
5
General strategy
 Similar general strategy for the models:
 Tier 0 at CERN:
 1st pass processing in quasi-real time after rapid calibration
 RAW data storage
 Tier 1s (6 for Alice,CMS, LHCb; 10 for Atlas):
 Reprocessing; Centrally organized analysis activities
 Copy of RAW data; some ESD; all AOD; some SIMU
 Tier 2s (14-30)
 User analysis (chaotic analysis); Simulation
 Some AOD depending on user needs
January 31, 2005
LHC Computing Models - F. Forti
6
Event sizes
Parameter
Unit
Nb of assumed Tier1 not at
CERN
Nb of assumed Tier2 not at
CERN
Event recording rate
RAW Event size
REC/ESD Event size
AOD Event size
TAG Event size
Running time per year
Events/year
Storage for real data
RAW SIM Event size
REC/ESD SIM Event size
Events SIM/year
January 31, 2005
ALICE
p-p
Pb-Pb
6
Hz
MB
MB
kB
kB
M seconds
Giga
PB
MB
MB
Giga
ATLAS
CMS
LHCb
10
6
6
30
25
14
100
1
0.2
50
10
10
1
100
12.5
2.5
250
10
1
0.1
200
1.6
0.5
100
1
10
2
150
1.5
0.25
50
10
10
1.5
2000
0.025
0.075
57
N/A
10
20
0.4
0.04
1
300
2.1
0.01
2
0.5
0.2
2
0.4
1.5
0.4
4
LHC Computing Models - F. Forti
7
First Pass Reconstruction
 Assumed to be in real time
 CPU power calculated to process data in 107 s.
 Fast calibration prior to reconstruction
 Disk buffer at T0 to hold events before
reconstruction
 Atlas: 5 days; CMS: 20 days; LHCb: ?
Parameter
Time to reconstruct 1 event
Time to simulate 1 event
January 31, 2005
Unit
k SI2k sec
k SI2k sec
ALICE
ATLAS
p-p
Pb-Pb
5.4
675
15
35
15000
100
LHC Computing Models - F. Forti
CMS
LHCb
25
45
2.4
50
8
Streaming
 All experiments foresee RAW data streaming,
but with different approaches.
 CMS: O(50) streams based on trigger path
 Classification is immutable, defined by L1+HLT
 Atlas: 4 streams based on event types
 Primary physics, Express line, Calibration, Debugging
and diagnostic
 LHCb: >4 streams based on trigger category
 B-exclusive, Di-muon, D* Sample, B-inclusive
 Streams are not created in the first pass, but during
the “stripping” process
 Not clear what is the best/right solution.
Probably bound to evolve in time.
January 31, 2005
LHC Computing Models - F. Forti
9
Data Storage and Distribution
 RAW data and the output of the 1st
reconstruction are store on tape at the T0.
 Second copy of RAW shared among T1s.
 CMS and LHCb distribute reconstructed data
together (zipped) with RAW data.
 No navigation between files to access RAW.
 Space penalty, especially if RAW turns out to be
larger than expected
 Storing multiple versions of reconstruced data can
become inefficient
 Atlas distributes RAW immediately before reco
 T1s could do processing in case of T0 backlog
January 31, 2005
LHC Computing Models - F. Forti
10
Data Storage and Distribution
 Number of copies of reco data varies
 Atlas assumes ESD have 2 copies at T1s
 CMS assumes a 10% duplication amount T1s for
optimization reasons
 Each T1 is responsible for permanent archival of
its share of RAW and reconstructed data.
 When and how to throw away old versions of
reconstructed data is unclear
 All AOD are distributed to all T1s
 AOD are the primary source for data analysis
January 31, 2005
LHC Computing Models - F. Forti
11
Calibration
 Initial calibration is performed at the T0 on
a subset of the events
 It is then used in the first reconstruction
 Further calibration and alignment is
performed offline in the T1s
 Results are inserted in the conditions
database and distributed
 Plans are still very vague
 Atlas maybe a bit more defined
January 31, 2005
LHC Computing Models - F. Forti
12
Reprocessing
 Data need to be reprocessed several
times because of:
 Improved software
 More accurate calibration and alignment
 Reprocessing mainly at T1 centers
 LHCb is planning on using the T0 during the
shutdown – not obvious it is available
 Number of passes per year
Alice
3
January 31, 2005
Atlas
2
CMS
2
LHC Computing Models - F. Forti
LHCb
4
13
Analysis
 The analysis process is divided in:
 Organized and scheduled (by working groups)
 Often requires large data samples
 Performed at T1s
 User-initiated (chaotic)
 Normally on small, selected samples
 Largely unscheduled, with huge peaks
 Mainly performed at T2s
 Quantitatively very uncertain
January 31, 2005
LHC Computing Models - F. Forti
14
Analysis data source
 Steady-state analysis will use mainly AOD-style
data, but…
 … initially access to RAW data in the analysis
phase may be needed.
 CMS and LHCb emphasize this need by storing
raw+reco (or raw+rDST) data together, in
streams defined by physics channel
 Atlas relies on Event Directories formed by
querying the TAG database to locate the events
in the ESD and in the RAW data files
January 31, 2005
LHC Computing Models - F. Forti
15
Simulation
 Simulation is performed at T2 centers, dynamically
adapting the share of CPU with analysis
 Simulation data is stored at the corresponding T1
 Amount of simulation data planned varies:
Parameter
Unit
ALICE
p-p
ATLAS
CMS
LHCb
Pb-Pb
Events/year
Giga
1
0.1
2
1.5
20
Events SIM/year
Giga
1
0.01
0.4
1.5
4
Ratio SIM/data
%
100%
10%
20%
100%
20%
 Dominated by CPU power
 100% may be too much; 10% may be too little
January 31, 2005
LHC Computing Models - F. Forti
16
GRID
 The level of reliance/use of GRID middleware is
different for the 4 experiments:

 Alice: heavily relies on advanced, not yet available,
Grid functionality to store and retrieve data, and to
distribute CPU load among T1s and T2s
 Atlas: the Grid is built in the project, but basically
assuming stability of what is available now.
 CMS: designed to work without Grid, but will make
use of it if available.
 LHCb: flexibility to use the grid, but not strict
dependance on it.
Alice Atlas CMS LHCb
Number of times the word
49
9
65
1
“grid” appears in the
computing model documents (all included)
January 31, 2005
LHC Computing Models - F. Forti
17
@CERN
 Computing at CERN beyond the T0
 Atlas: “CERN Analysis Facility”
 but only for CERN-based people, not for the
collaboration
 CMS: T1 and T2 at CERN
 but T1 has no tape since T0 does the storing
 LHCb: unclear, explicit plan to use the event
fileter farm during the shutdown periods
 Alice: don’t need anything at CERN, the Grid
will supply the computing power.
January 31, 2005
LHC Computing Models - F. Forti
18
Overall numbers
2005 plan
2008; 20% sim
2008
2008
standard year
ATLAS
ATLAS (HR) CMS
CMS (HR) LHCb
LHCb(HR) Alice
ALICE(HR)
4.1
3.915
4.6
4.5
6.3
6.21
7.5
7.38
0.9
2.025
7.416
0.35
0.41
0.5
1.95
0.41
1.71
0.796
0.8
0.33
0.53
4.2
3.8
2.7
4.6
9
5.6
4.172
1.4
1.22
3.23
Tier 0 CPU
CPU at CERN
Tier 0 disk
disk CERN
Tier 0 tape
tape CERN
MSI2k
MSI2k
PB
PB
PB
PB
Tier 1 cpu
Tier 1 disk
Tier 1 tape
MSI2k
PB
PB
18
12.3
6.5
Tier 2 cpu
Tier 2 disk
Tier 2 tape
MSI2k
PB
PB
total CPU
total disk
total tape
MSI2k
PB
PB
CPU Increase
Disk Increase
Tape Increase
now/HR
now/HR
now/HR
WAN IN Tier0
Gb/s
WAN OUT Tier0 Gb/s
WAN IN per Tier1Gb/s
WAN OUT per Tier1
Gb/s
January 31, 2005
2001 Review
11.286
2.16
10.8
12.8
6.7
11.1
9.18
1.565
5.115
4.4
2.4
2.1
16.2
6.9
0
0
19.9
5.3
0
9.675
1.75
1.25
7.6
0.02
0
40.5
21.2
11.1
17.496
2.57
19.8
40.2
13.7
16.6
26.235
4.11
10.537
12.9
3.22
3.5
2.3
8.2
0.6
1.5
3.3
1.6
4.5
2.25
7.2
9.5
5.7
3.5
1.5
6.3
0.75
1.6
LHC Computing Models - F. Forti
0.53
8.424
1.08
1.48
10.9
1.7
0
8.325
1.08
2.82
1.5
3.0
1.2
1.5
10.6
6.3
8.7
26
8.5
11.4
15.84
1.61
4.71
1.6
5.3
2.4
0.31
1.5
19
Referee comments
 Sum of comments from LHCC review and italian
referees
 We still need to interact with the experiments
 We will compile a list of questions after today’s
presentations
 We plan to hold four phone meetings next week to
discuss the answers
 Some are just thing the experiment know they
need to do
 Stated here to reinforce them
January 31, 2005
LHC Computing Models - F. Forti
20
LHCC Overall Comments
 The committee was very impressed with the quality of
the work that was presented. In some cases, the
computing models have evolved significantly from the
time of the Hoffmann review.
 In general there is a large increase in the amount of disk
space required. There is also an increase in overall
CPU power wrt the Hoffmann Review. The increase is
primarily at Tier-1's and Tier-2's. Also the number
of Tier-1 and Tier-2 centers has increased.
 The experiences from the recent data challenges have
provided a foundation for testing the validity of the
computing models. The tests are at this moment
incomplete. The upcoming data challenges and service
challenges are essential to test key features such as
data analysis and network reliability.
January 31, 2005
LHC Computing Models - F. Forti
21
LHCC Overall Comments II
 The committee was concerned about the dependence on
precise scheduling required by some of the computing
models.
 The data analysis models in all 4 experiments are
essentially untested. The risk is that distributed user
analysis is not achievable on a large scale.
 Calibration schemes and use of conditions data have not
been tested. These are expected to have an impact of
only about 10% in resources but may impact the timing
and scheduling.
 The reliance on the complete functionality of GRID tools
varies from one experiment to another. There is some
risk that disk/cpu resource requirements will increase if
key GRID functionality is not used. There is also a risk
that additional manpower will be required for
development, operations and support.
January 31, 2005
LHC Computing Models - F. Forti
22
LHCC Overall Comments III
 The contingency factors on processing times and RAW
data size vary among the experiments.
 The committee did not review the manpower
requirements required to operate these facilities.
 The committee did not review the costs. Will this be
done? It would be helpful if the costing could be
somewhat standardized across the experiments before it
is presented to the funding agencies.
 The committee listened to a presentation on networks for
the LHC. A comprehensive analysis of the peak
network demands for the 4 experiments combined is
recommended (see below.)
January 31, 2005
LHC Computing Models - F. Forti
23
LHCC Reccommendations
 The committee recommends that the average and the peak
computing requirements of the 4 experiments be studied in more
detail. A month by month analysis of the CPU, disk, tape access
and network needs for all 4 experiment is required. A clear
statement on computing resources required to support HI running in
CMS and ATLAS is also required. Can the peak demands during
the shutdown period be reduced/smoothed?
Plans for distributed analysis during the initial period should be
worked out.
 Dependence of the computing model on raw event size,
reconstruction time, etc. should be addressed for each experiment.
 Details of the ramp up (2006-2008) should be determined and a
plan for the evolution of required resources should be worked out.
 A complete accounting of the offline computing resources required
at CERN is needed from (2006-2010). In addition to production
demands, the resource planning for calibration, monitoring, analysis
and code testing and development should be included - even though
the resources may seem small.
The committee supports the requests for Tier-1/Tier-2 functionality at
CERN. This planning should be refined for the 4 experiments.
January 31, 2005
LHC Computing Models - F. Forti
24
LHCC Conclusions
 Aside from issues of peak capacity, the
committee is reasonably certain that
the computing models presented are
robust enough to handle the demands
of LHC production computing during
early running (through 2010.) There is
a concern about the validity of the data
analysis components of the models.
January 31, 2005
LHC Computing Models - F. Forti
25
Additional comments from
INFN Referees
 Basic parameters such as event size and
reconstruction CPU time have very large
uncertainties
 Study the dependance on the computing models on
these key parameters and determine what are the
brick-wall limits
 Data formats are not well defined
 Some are better than others
 Need to verify that the proposed formats are good for
real life analysis For example:
 can you do event display on AODs ?
 can you run an alignment systematic study on ESDs ?
January 31, 2005
LHC Computing Models - F. Forti
26
Additional Comments II
 Many more people need to try and do analysis
with the existing software and provide feedback
 Calibration and condition database access have
not sufficiently defined and can represent
bottlenecks
 No cost-benefit analysis has been performed so
far
 Basically the numbers are what the experiments
would like to have
 No optimization done yet on the basis of the available
resources
 In particular: amount of disk buffers; duplication of
data; reuse of tapes
January 31, 2005
LHC Computing Models - F. Forti
27
Additional Comments III
 Are the models flexible enough ?
 Given the large unknowns, will the models be able to
cope with large changes in the parameters ? For
example:
 assuming all reconstructed data is on disk may drive the
experiments (and the funding agencies) into a cost brick-wall
if the size is larger than expected, or effectively limit the data
acquisition rate.
 evolution after 2008 is not fully charted and understood. Is
there enough flexibility to cope with a resource limited world?
 Are the models too flexible ?
 Assuming the grid will optimize things for you (Alice)
may be too optimistic
 Buffers and safety factors aimed at flexibility are
sometimes large and not fully justified
January 31, 2005
LHC Computing Models - F. Forti
28
Addition Comments IV
 The bandwidth is crucial
 Peak in T0T1 need to be understood
 The required bandwidth has not been fully
evaluated, especially at lower levels and for
“reverse” flow
 T1T1,T2 (eg MC data produced at T2)
 Incoming at CERN (not T0) of reprocessed data
and MC
 Need to compile tables with the same
safety factors assumed
January 31, 2005
LHC Computing Models - F. Forti
29
Specific comments on experiments
 Coming from LHCC review
 Not fully digested and not yet integrated by
INFN referees
 Useful to collect them here for future
reference
 Some duplication unavoidable. Your
patience is appreciated.
January 31, 2005
LHC Computing Models - F. Forti
30
ATLAS I
 Impressed by overall level of thought and planning which
have gone into the overall computing model so far.
 In general fairly specific and detailed
 Welcome thought being given to the process of and
support for detector calibration and conditions database.
 needs more work
 looking forward to the DC3 and LCG Service Challenge results
 An accurate, rapid calibration on 10% of data is crucial for the
model
January 31, 2005
LHC Computing Models - F. Forti
31
ATLAS II
 Concern about the evidence basis and experience
with several aspects of the computing model
 large reduction factor assumed in event size and processing
time, not really justified
 data size and processing time variation with background and
increasing luminosity
 lead to large (acknowledged but somewhat hidden)
uncertainties in estimates
 Data size and number of copies, particularly for the
ESD, have significant impact on the total costs.
 We note that these are larger for Atlas than for other
experiments.
 Also very large number of copies of the AOD
 Depend critically on analysis patterns which are poorly
understood at this time and require a fair amount of
resources
January 31, 2005
LHC Computing Models - F. Forti
32
ATLAS III
 Concern about the lack of practical experience with the
distributed analysis model
 especially if AOD are not the main data source at the beginning
 need resources to develop the managerial software needed to
handle the distributed environment (based on Grid MW),for
example if Tier1s need to help in case of backlog at Tier0
 Need to include HI physics in the planning.
 Availability of computing resources during the shutdown should
not be taken for granted.
 Real time data processing introduces a factor 2 extra
resource requirement for reconstruction.
 It is not clear that this assumption is justified/valid cf the ability to
keep up with data taking on average.
 The ATLAS TAG model is yet to be verified in practice.
 We are unclear exactly how it will work.
 Primary interface for physicists, need iterations to get it right.
January 31, 2005
LHC Computing Models - F. Forti
33
ATLAS IV
 Monte carlo
 Agree that assumption of 20% fully reconstructed Monte Carlo is
a risk and a larger number would be better/safer.
 Trigger rates
 We note that the total cost of computing scales with trigger rates.
This is clearly a knob that can be turned.
 The CERN Analysis Facility is more a mixture of a Tier-1
and Tier-2
 No doubt Atlas needs computing at CERN for calibration and
analysis
January 31, 2005
LHC Computing Models - F. Forti
34
CMS I
 Uncertainty of factor ~2 on many numbers taken as input
to the model
 c.f. ATLAS assumptions
 Event size 0.3 MB MC inflated to 1.5 MB
 factor 2.5 for conservative thresholds/zero suppression at startup
 Safety factor of 2 in the Tier-0 RECO resources should be made
explicit
 Should we try to use same factor for all four experiments?
 Fully simulated Monte Carlo
 100% of real data rate seems like a reasonable goal
 but so would 50% (Atlas assumes 20%)
 Heavy Ion
 Need a factor of 10 improvement in RECO speed wrt current
performance
 Ratio of CPU to IO means that this is possibly best done at Tier-2 sites!
January 31, 2005
LHC Computing Models - F. Forti
35
CMS II
 Use of "CMS" Tier-0 resources during 4-month
shutdown?
 Maybe needed for CMS and/or ALICE heavy ion RECO
 Re-RECO of CMS pp data on Tier-0 may not be affordable?
 We find clear justification for a sizable CERN-based
“analysis” facility
 Especially for detector-related (time critical) activities
 monitoring, calibration, alignment
 Is distinction between Tier-1 and Tier-2 at CERN useful?
 c.f. ATLAS
January 31, 2005
LHC Computing Models - F. Forti
36
CMS III
 CMS attempt to minimize reliance on some of the currently least
mature aspects of the Grid




e.g., global data catalogues, resource brokers, distributed analysis
Streaming by RECO physics objects
Specific streams placed at specific Tier-1 sites
RECO+RAW (FEVT full event) is the basic format for first year or two
 Conservative approach, but in our view not unreasonably so
 Some potential concerns:
 More difficult to balence load across all Tier-1s
 Politics: which Tier-1s get the most sexy streams?
 Analysis at Tier-1 restricted largely to organized production activities
 AOD production, dataset skimming, calibration/alignment jobs?
 except perhaps for one or two "special" T1s
January 31, 2005
LHC Computing Models - F. Forti
37
CMS IV
 Specific baseline presented, but
 A lot of thought has gone into considering alternatives
 Model has some flexibility to respond to real life
 Presented detailed resources for 2008
 Needs for 2007 covered by need to ramp up for 2008
 No significant scalability problems apparent for future
growth
 The bottom line:
 Assumptions and calculation of needed resources
seem reasonable
 Within overall uncertainty of perhaps a factor ~2?
January 31, 2005
LHC Computing Models - F. Forti
38
LHCb I
•LHCb presented a computing model based on a significantly
revised DAQ plan, with a planned output of 2 kHz
•The committee did not try to evaluate the merit of the new data
collection strategy, but tried to assess whether computing resources
seem appropriate given the new strategy.
•It’s notable that computing resources required for new plan are
similar (within 50% except for disk) to those in the Hoffman report
even though event rate is increased by an order of magnitude,
largely because of reduction in simulation requirements in new
plan.
The committee was impressed by the level of planning that has gone into the
LHCb computing model, and by the clarity and detail of the presentations.
In general, the committee believes that LHCb presented a well reasoned plan with
appropriate resources for their proposed computing model.
January 31, 2005
LHC Computing Models - F. Forti
39
LHCb II
Time variation of resource requirements. In the LHCb computing plan as presented,
the peak cpu and network needs exceed the average by a factor of 2. This variation
must be considered together with expected resource use patterns of other experiments.
LHCb (and others) should consider scenarios to smooth out peaks in resource
requirements.
Monte Carlo. Even in the new plan, Monte Carlo production still consumes more
than 50% of cpu resources. Any improvement in performance of MC or reduction in
MC requirements would therefore have a significant impact on cpu needs.
The group’s current MC estimates, while difficult to justify in detail,
seem reasonable for planning.
Event size. The committee was concerned about the LHCb computing model’s
reliance on the small expected event size (25 kB). The main concern is I/O during
reconstruction and stripping. LHCb believe that a factor of 2 larger event size
would still be manageable.
rDST size. The rDST size has almost as large an impact on computing
resources as the raw event size. The committee recommends that LHCb
develop an implementation of the rDST as soon as possible to understand
whether the goal of 50kB (including raw) can be achieved.
January 31, 2005
LHC Computing Models - F. Forti
40
LHCb III
Event reconstruction and stripping strategy. The multi-year plan of event
reconstruction and stripping seems reasonable, although 4 strippings per
year may be ambitious. If more than 4 streams are written, there may be
additional
storage requirements.
User analysis strategy. The committee was concerned about the use of Tier
1 centers as the primary user analysis facility. Are Tier 1 centers prepared
to provide this level of individual user support? Will LHCb’s planned
analysis activities interfere with Tier 1 production activities?
Calibration. Although it is not likely to have a large impact on computing plans,
we recommend that details of the calibration plan be worked out as soon as
possible.
Data challenges. Future data challenges should include detector calibration and
user analysis to validate those parts of the computing model.
Safety factors. We note that LHCb has included no explicit safety factors (other
than prescribed efficiency factors) in computing needs given their model. This
issue should be addressed in a uniform way among the experiments.
January 31, 2005
LHC Computing Models - F. Forti
41
The Grid and the experiments
 Use of Grid functionality will be crucial for the
success of LHC computing.
 Experiments in general and the italian
community in particular need to ramp up their
use of LCG in the data challenges
 Verify the models
 Feedback to developers
 Strong interaction between experiments and
LCG team mandatory to match requirements
and implementation
 Cannot accomodate large overheads due to lack
of optimization of resource usage.
January 31, 2005
LHC Computing Models - F. Forti
42
Conclusion and Roadmap
 These computing models are one step on the
way to LHC computing
 Very good outcome, in general specific and concrete
 Some interaction and refinement in the upcoming
months
 In the course of 2005:
 Computing TDRs of the experiments.
 Memorandum of understanding for the computing
resources for LCG phase II.
 Specific planning for CNAF and Tier2s in Italy.
 Expect to start building up the capacity in 2006.
January 31, 2005
LHC Computing Models - F. Forti
43