Department of Computer Science and Engineering

A First Look at
Reproducibility and
Non-Determinism in
CMS Software and ROOT Data
Peter Ivie, Charles Zheng, and Douglas Thain
University of Notre Dame
CHEP 2016
What we usually work on:
HTCondor
Lobster Data Analysis
Global Filesystems
Parrot + CVMFS
Parrot
CVMFS Driver
meta
data
squid
squid
proxy
squid
proxy
proxy
HTTP GET
Content
Addressable
Storage
www
server
data
meta
data
data
data
data
data
CAS Cache
CMS
So ware
967 GB
31M files
h p://cernvm.cern.ch/portal/filesystem
Today, a different question:
What happens if we attempt to run
the same thing twice?
Build CAS
HTTP GET
CMS
Task
Reproducibility in Scientific Computing
DASPOS Project
www.daspos.org
Different Layers of Reproducibility
(1) Can someone else follow simple instructions
and get the same software environment?
– Requires careful dependency management.
(2) Can I run the exact same code on the same
computer twice and get the same result?
– Requires deterministic computing techniques.
...
(99) Can someone else construct equivalent
expts and observe the same physical laws?
?
ATLAS
vs
CMS
– Difficult to do this without steps 1-98 working.
Some Reasons for Non-Determinism
• Provenance data intermixed with output.
• Unique identifiers embedded in objects.
• Pseudo-random numbers seeded by
environmental data (pid,uid,time)
• Entropic randomness (/dev/urandom)
• Parallelism results in re-ordering of data.
• I/O interactions with OS not consistent.
• OS/Compiler details affect ordering/precision
of floating point operations.
A very preliminary look…
An initial experiment.
CVMFS + SCRAM for consistent environment
LHE
SEED
A1
GEN
SIM
B1
DIGI
RECO
C1
Mini
AOD
D1
Same machine, same day, same user, same shell, . . .
LHE
A2
GEN
SIM
B2
DIGI
RECO
C2
Mini
AOD
Not one of the output files was the same!
D2
What constitutes the same output?
The same bytes?
(take a checksum)
The same data?
(ignore metadata)
The same physics?
(fit to a model)
0000000
0000020
0000040
0000060
0000100
0000120
0000140
0000160
0000200
5089
0000
00d6
a0a0
c0ee
80c0
d4ff
331a
b3b3
474e
8002
0100
00ff
0040
30ff
2aa5
3333
c0b3
0a0d
0000
5032
0000
c8c8
8060
ff2a
4d4d
c0c0
0a1a
e001
544c
00c0
4100
008b
00ff
664d
cccc
0000
0308
ff45
8000
e169
4000
e040
6666
e5cc
0000000
0000020
0000040
0000060
0000100
0000120
0000140
0000160
0000200
5089
0000
00d6
a0a0
c0ee
80c0
d4ff
331a
b3b3
474e
8002
0100
00ff
0040
30ff
2aa5
3333
c0b3
0a0d
0000
5032
0000
c8c8
8060
ff2a
4d4d
c0c0
0a1a
e001
544c
00c0
4100
008b
00ff
664d
cccc
0000
0308
ff45
8000
e169
4000
e040
6666
e5cc
Structure of a Typical ROOT File
TStorageFile
UniqueID
CycleID
Timestamp
TTree
TBranch
TBucket
UniqueID
Data
Data
…
TBranch
TBucket
UniqueID
Data
Data
…
TBucket
UniqueID
Data
Data
…
TBucket
UniqueID
Data
Data
…
TFile
UniqueID
CycleID
ROOT
File #1
TTree
Timestamp
TBranch
TBucket
UniqueI
D
Data
TFile
UniqueID
CycleID
Timestamp
TBranch
TBucket
UniqueI
D
Data
TBucket
UniqueI
D
Data
ROOT
File #2
TTree
TBranch
TBucket
UniqueI
D
Data
TBucket
UniqueI
D
Data
TBranch
TBucket
UniqueI
D
Data
TBucket
UniqueI
D
Data
TBucket
UniqueI
D
Data
ROOT-Diff Tool
1. Different structure.
2. Same structure, different content.
3. Same structure, same content, different metadata.
4. Same bytes everywhere.
Now, what’s different?
Stage
LHE
GENSIM
DIGI – 1
DIGI – 2
DIGI – 3
MINIAOD
# Objects
605
717
1171
6563
204
2723
Ignored
9
9
9
9
21
9
Not Equal
Structural
Equal
Content
Equal
4 592 580
1 707 694
1 1161 1158
21 6533 6526
7 176 175
4 2710 2671
Physics data is not the same.
Bitwise
Equal
0
0
0
0
0
0
What is the cost of root-diff?
Running time for comparing various size of root files
root-diff
md5sum
sha1sum
Time(Seconds)
100
10
For files < RAM, root-diff is faster
than standard checksums.
1
0
1000
2000
For files > RAM, root-diff is
non-sequential
I/O.
3000 slower,
4000due to
5000
6000
7000
Size(MB)
Can we eliminate or prevent some
kinds of non-determinism?
Parrot Modifies Syscalls
Use Parrot to mount remote filesystems
without requiring root access.
Unix
App
Trap
Syscalls
% parrot_run bash
Parrot
LibCVMFS
Redirect
Elsewhere
CVMFS
Global
Filesystem
% cd /cvmfs/cms.cern.ch
% source cmsset_default.sh
% scram
...
http://ccl.cse.nd.edu/software/parrot
Use Parrot to Eliminate
Some Sources of Non-Determinism
parrot_run --time-warp
--uid 0
--hostname thematrix
--mount /dev/urandom /some/fixed/data
% whoami
root
% uname
thematrix
% date –utc
Mon Jan 1 00:00:05 UTC 2001
http://ccl.cse.nd.edu/software/parrot
Does it work? Almost!
“Time Warp” makes LHE produce
bitwise-equal results.
(But only for <121 events.)
Results of a First Look
• ROOT-DIFF is a general algorithm that can determine
equivalence of ROOT files, ignoring the incidental metadata
inherent to ROOT.
• CMS codes (at first look) do not produce identical physics
data on consecutive runs, when compared with ROOT-DIFF.
• “Time Warp” mode causes LHE to give bitwise equal results
for a small number of events, therefore the code is sensitive
to OS effects, even with a constant seed.
• We have not fully isolated the factors that result in
non-deterministic behavior!
• Going forward:
– Push towards one good deterministic example to be shared.
– Clearly identify all conventional sources of non-determinism.
– Evaluate effect of increasing concurrency on determinism.
Example code here:
https://github.com/cooperative-computing-lab
/root-diff
/repro-example-chep-2016
Questions and Discussion Welcome
Peter Ivie
[email protected]
Charles Zheng
[email protected]
Douglas Thain
[email protected]
@ProfThain