Monitoring ATLAS computing performance

IO PERFORMANCE
Lessons
DISCLAIMER
 I won’t be mentioning good things here.
 In hindsight things look obvious
 No plan survives the first data intact
DESIGN PROBLEMS
 T/P is way too complex for an average HEP physicist turned programmer
 This lead to copy/paste approach
 Even good documentation can’t help that
 Code bloating
 Very difficult to remove obsolete persistent classes/converters
 Tools needed added late: custom compressions/DataPool/ only now
working on error matrix compression
 No unit tests infrastructure
 Should have had a way to create “full” object
 Should have forced loopback t-p-t
 No central place to control what’s written out
 No tools for automatic code generation
 Fabrizio spent 3 months just fixing part of Trigger classes
MANAGEMENT PROBLEMS
 At least some performance tests should have been done before full
system deployment. At least to understand what affects performance.
 Trying to understand changes in what was written out after the fact is
not the way things should go.
 Way too many tools (ara, mana, event loop …)
 No real support
 Code bloat
 Opportunity cost
 Still there is no one good tool
 that people would be happy to use
 would provide us possibility to monitor and optimize
 to start thinking about analysis meta data storage and access one year after data taking
started is a bit late.
MANAGEMENT PROBLEMS
 No recommended way to do analysis
 Waiting to see what people would be doing is not the best idea. People can’t
know if their way will scale or not.
 We can’t test all approaches and surely can’t optimize sites for all of them
Group
production
AOD
NTUPLE
D3PDmaker
G
G
G
L
D3PD
Skim/slim
G
download
LG
Simple ROOT Analysis
PROOF based Analysis
L
Grid
Local
CPU bound
IO bound
DPD PROBLEMS
 Having thousands of simple variables in DPD files is just so … not OO.
 DPDs are expensive to produce
 Train based production will alleviate problem

If train production will not use tag db, tag db should be dropped.
 Probably way too large – and difficult to overhaul. If we have problems finding out
if AOD collection is used or not, problem is 10 times bigger with dpds
 Too small
 Should/could be merged using latest ROOT version
 Even worse with skimmed/slimmed. No simple grid based merge tool?
 No single, generally used framework to use them
 In half a year it will be way to late to start thinking about all of this as people will
be already used to their hacked together but working tools.
CPU
Full average
setup
stagein
stageout
exec
TOTAL
Overhead:
REAL EFF.:
469.62
44.99
106.72
34.1
951.04
1136.85
185%
41%
WALL
614.2
EFF
76%
Real
time[s]
DPD PROBLEMS
Egamma
std
Egamma
30MB
 Current situation
 local disk
Egamma
reordered
 ROOT 5.28.00e
 A lot of space for improvement!
•
 Improvements to come
CPU
time[s]
HDD
reads
Transferred HDD time
[MB]
[s]
100
25.91
21.11
5099
254
10
10
16.95
10.59
5431
254
12
1
13.51
7.95
4973
237
10
100
25.69
20.64
2400
254
11
10
14.71
9.53
2399
254
11
1
13.94
7.97
4986
237
11
100
20.66
20.29
2052
251
3
10
11.29
9.64
2661
251
4
1
11.00
7.79
3047
236
7
Now so much better that we are getting HDD seeks
limited even for 100% of data read
 Proper basket size optimization
•
 Too many reads with TTC
Two jobs or one read/write job brings CPU efficiency
down to unacceptable level
•
 Multi-tree TTC
Calculations typically done in analysis jobs wont hide
disk latency on 4 or 8 core CPU
1.00
CPU/Wall
0.80
•
Needs better file organization
•
Even reordering files would have sense
14
Speed [MB/s]
12
10
0.60
8
6
0.40
4
0.20
2
0
0.00
100 10
Egamma
1
100 10
JetMEt
1
100 10
Susy
1
100 10
Photon
1
100 10
Egamma
1
100 10
JetMEt
1
100 10
Susy
1
100 10
Photon
1
NO EFFICIENCY FEEDBACK
 Up to now we had no resource contention. That’s changing.
 People would not mind running faster and more efficiently, but have
no idea how good/bad they are and what to change.
 Will someone see the effect of not turning TTC in grid based job?
Not likely. Consequently won’t turn it on.
 I don’t know how to optimally split task. Do you?
 Can be changed relatively easily: once a week send a mail to all people
that used grid on what they consumed, how efficient they were and
what they can do to improve.
MY 2 CENTS
 If one core is not 100% used when reading fully optimized file why bother with




multicore things?
Due to very low “real” information density in production DPDs any
analysis/slimming/skimming scheme is bound to be very inefficient.
A user that have done at least one round of slim/skim can opt for a map-reduce
approach in form of proof.
But proof not an option for really large scale – production DPDs.
Thinking big – SkimSlimService
 Organize large scale map-reduce at a set of sites capable of having all of the production DPDs






on disk.
Has to return (register) produced skimmed/slimmed dataset in under 5 min. Limit size of the
returned DS.
That eliminates 50% of all grid jobs.
Makes users produce 100 variable instead of 600 variable slims.
Relieves them of thinking about efficiency and gives result in 5 min instead of 2 days.
We don’t distribute DPDs.
We do optimizations.