A First Look at Reproducibility and Non-Determinism in CMS Software and ROOT Data Peter Ivie, Charles Zheng, and Douglas Thain University of Notre Dame CHEP 2016 What we usually work on: HTCondor Lobster Data Analysis Global Filesystems Parrot + CVMFS Parrot CVMFS Driver meta data squid squid proxy squid proxy proxy HTTP GET Content Addressable Storage www server data meta data data data data data CAS Cache CMS So ware 967 GB 31M files h p://cernvm.cern.ch/portal/filesystem Today, a different question: What happens if we attempt to run the same thing twice? Build CAS HTTP GET CMS Task Reproducibility in Scientific Computing DASPOS Project www.daspos.org Different Layers of Reproducibility (1) Can someone else follow simple instructions and get the same software environment? – Requires careful dependency management. (2) Can I run the exact same code on the same computer twice and get the same result? – Requires deterministic computing techniques. ... (99) Can someone else construct equivalent expts and observe the same physical laws? ? ATLAS vs CMS – Difficult to do this without steps 1-98 working. Some Reasons for Non-Determinism • Provenance data intermixed with output. • Unique identifiers embedded in objects. • Pseudo-random numbers seeded by environmental data (pid,uid,time) • Entropic randomness (/dev/urandom) • Parallelism results in re-ordering of data. • I/O interactions with OS not consistent. • OS/Compiler details affect ordering/precision of floating point operations. A very preliminary look… An initial experiment. CVMFS + SCRAM for consistent environment LHE SEED A1 GEN SIM B1 DIGI RECO C1 Mini AOD D1 Same machine, same day, same user, same shell, . . . LHE A2 GEN SIM B2 DIGI RECO C2 Mini AOD Not one of the output files was the same! D2 What constitutes the same output? The same bytes? (take a checksum) The same data? (ignore metadata) The same physics? (fit to a model) 0000000 0000020 0000040 0000060 0000100 0000120 0000140 0000160 0000200 5089 0000 00d6 a0a0 c0ee 80c0 d4ff 331a b3b3 474e 8002 0100 00ff 0040 30ff 2aa5 3333 c0b3 0a0d 0000 5032 0000 c8c8 8060 ff2a 4d4d c0c0 0a1a e001 544c 00c0 4100 008b 00ff 664d cccc 0000 0308 ff45 8000 e169 4000 e040 6666 e5cc 0000000 0000020 0000040 0000060 0000100 0000120 0000140 0000160 0000200 5089 0000 00d6 a0a0 c0ee 80c0 d4ff 331a b3b3 474e 8002 0100 00ff 0040 30ff 2aa5 3333 c0b3 0a0d 0000 5032 0000 c8c8 8060 ff2a 4d4d c0c0 0a1a e001 544c 00c0 4100 008b 00ff 664d cccc 0000 0308 ff45 8000 e169 4000 e040 6666 e5cc Structure of a Typical ROOT File TStorageFile UniqueID CycleID Timestamp TTree TBranch TBucket UniqueID Data Data … TBranch TBucket UniqueID Data Data … TBucket UniqueID Data Data … TBucket UniqueID Data Data … TFile UniqueID CycleID ROOT File #1 TTree Timestamp TBranch TBucket UniqueI D Data TFile UniqueID CycleID Timestamp TBranch TBucket UniqueI D Data TBucket UniqueI D Data ROOT File #2 TTree TBranch TBucket UniqueI D Data TBucket UniqueI D Data TBranch TBucket UniqueI D Data TBucket UniqueI D Data TBucket UniqueI D Data ROOT-Diff Tool 1. Different structure. 2. Same structure, different content. 3. Same structure, same content, different metadata. 4. Same bytes everywhere. Now, what’s different? Stage LHE GENSIM DIGI – 1 DIGI – 2 DIGI – 3 MINIAOD # Objects 605 717 1171 6563 204 2723 Ignored 9 9 9 9 21 9 Not Equal Structural Equal Content Equal 4 592 580 1 707 694 1 1161 1158 21 6533 6526 7 176 175 4 2710 2671 Physics data is not the same. Bitwise Equal 0 0 0 0 0 0 What is the cost of root-diff? Running time for comparing various size of root files root-diff md5sum sha1sum Time(Seconds) 100 10 For files < RAM, root-diff is faster than standard checksums. 1 0 1000 2000 For files > RAM, root-diff is non-sequential I/O. 3000 slower, 4000due to 5000 6000 7000 Size(MB) Can we eliminate or prevent some kinds of non-determinism? Parrot Modifies Syscalls Use Parrot to mount remote filesystems without requiring root access. Unix App Trap Syscalls % parrot_run bash Parrot LibCVMFS Redirect Elsewhere CVMFS Global Filesystem % cd /cvmfs/cms.cern.ch % source cmsset_default.sh % scram ... http://ccl.cse.nd.edu/software/parrot Use Parrot to Eliminate Some Sources of Non-Determinism parrot_run --time-warp --uid 0 --hostname thematrix --mount /dev/urandom /some/fixed/data % whoami root % uname thematrix % date –utc Mon Jan 1 00:00:05 UTC 2001 http://ccl.cse.nd.edu/software/parrot Does it work? Almost! “Time Warp” makes LHE produce bitwise-equal results. (But only for <121 events.) Results of a First Look • ROOT-DIFF is a general algorithm that can determine equivalence of ROOT files, ignoring the incidental metadata inherent to ROOT. • CMS codes (at first look) do not produce identical physics data on consecutive runs, when compared with ROOT-DIFF. • “Time Warp” mode causes LHE to give bitwise equal results for a small number of events, therefore the code is sensitive to OS effects, even with a constant seed. • We have not fully isolated the factors that result in non-deterministic behavior! • Going forward: – Push towards one good deterministic example to be shared. – Clearly identify all conventional sources of non-determinism. – Evaluate effect of increasing concurrency on determinism. Example code here: https://github.com/cooperative-computing-lab /root-diff /repro-example-chep-2016 Questions and Discussion Welcome Peter Ivie [email protected] Charles Zheng [email protected] Douglas Thain [email protected] @ProfThain
© Copyright 2026 Paperzz