Data Preservation Challenges at CMS and LHC and

The DASPOS Project
Mike Hildreth
representing the DASPOS Team
Mike Hildreth, 1 May, 2017
1
DASPOS
Data And Software Preservation for Open Science
multi-disciplinary effort
Notre Dame, Chicago, UIUC, Washington, Nebraska, NYU, (Fermilab, BNL)
Links HEP effort (DPHEP+experiments) to Biology, Astrophysics,
Digital Curation, and other disciplines
includes physicists, digital librarians, computer scientists
aim to achieve some commonality across disciplines in
meta-data descriptions of archived data
What’s in the data, how can it be used?
computational description (ontology/metadata development)
how was the data processed?
can computation replication be automated?
impact of access policies on preservation infrastructure
Mike Hildreth, 1 May, 2017
2
DASPOS
In parallel, will build test technical infrastructure to implement a
knowledge preservation system
“Scouting party” to figure out where the most pressing
problems lie, and some solutions
incorporate input from multi-disciplinary dialogue, usecase definitions, policy discussions
Will translate needs of analysts into a technical
implementation of meta-data specification
Will develop means of specifying processing steps and the
requirements of external infrastructure (databases, etc.)
Will implement “physics query” infrastructure across smallscale distributed network
End result: “template architecture” for
data/software/knowledge preservation systems
Mike Hildreth, 1 May, 2017
3
DASPOS Overview
• How to
catalogue and
share data
• How to curate
and archive
large digital
collections
Digital Librarian
Expertise
Computer Science
Expertise
Science
Expertise
• What does the data mean?
• How was it processed?
• How will it be re-used
Mike Hildreth, 1 May, 2017
4
• How to build
databases and
query infrastructure
• How to develop
distributed storage
networks
DASPOS Process
•
Multi-pronged approach for individual topics
•
NYU/Nebraska: RECAST and other developments
•
UIUC/Chicago: Workflows, Containers
•
ND: Metadata, Containers, Workflows, Environment
specification
•
Shared validation & examples
•
Workshops & All-hands meetings
•
Shared collaboration with CERN, DPHEP
•
Outreach to other disciplines
Mike Hildreth, 1 May, 2017
5
Prototype Architecture
Inspire
Preservation Archive
“Containerizer Tools”
Container
Cluster
run
• Test bed
• Capable of
running
containerized
processes
• PTU, Parrot scripts
• Used to capture
processes
• Deliverable: stored
in DASPOS git
store
• Metadata
• Container images
Data Archive
• Metadata
• Workflow images
• Instructions to
reproduce
• Data
• Data?
Tools:
Run containers/workflows
Tools:
Discovery/exploration
Unpack/analyze
Data path
Domain-specific
Metadata links
Mike Hildreth, 1 May, 2017
6
Policy & Curation
Access Policies
Public archives?
Prototype Architecture
Inspire
Preservation Archive
“Containerizer Tools”
Container
Cluster
run
• Test bed
• Capable of
running
containerized
processes
• PTU, Parrot scripts
• Used to capture
processes
• Deliverable: stored
in DASPOS git
store
• Metadata
• Container images
Data Archive
• Metadata
• Workflow images
• Instructions to
reproduce
• Data
• Data?
Tools:
Run containers/workflows
Tools:
Discovery/exploration
Unpack/analyze
~ Done
Under development
Not done
Data path
Domain-specific
Metadata links
Mike Hildreth, 1 May, 2017
7
Policy & Curation
Access Policies
Public archives?
Infrastructure I: Environment Capture
Mike Hildreth, 1 May, 2017
8
Umbrella
Umbrella(specifies(a(reproducible(environment(while(
avoiding(duplica=on(and(enabling(precise(adjustments.(
Run(the(experiment(
Same(thing,(but(use(
different(input(data.(
Same(thing,(but(
update(the(OS(
input1(
input2(
input2(
Mysim(3.1(
Mysim(3.1(
Mysim(3.1(
RedHat(6.1(
RedHat(6.1(
RedHat(6.2(
Linux(83(
Linux(83(
Linux(83(
Online(Data(Archive(
RedHat(6.1(
input1(
input2(
Linux(83(
Mysim(3.1(
RedHat(6.2(
calib1(
calib2(
Linux(84(
Mysim(3.2(
Mike Hildreth, 1 May, 2017
9
Umbrella
Current version of Umbrella can work with:
• Docker – create container, mount volumes.
• Parrot – Download tarballs, mount at run=me.
• Amazon – allocate VM, copy and unpack tarballs.
• Condor – Request compatible machine.
• Open Science Framework – deploy uploaded containers
Example Umbrella Apps:
• Povray ray-tracing application
•
•
OpenMalaria simulation
•
•
http://dx.doi.org/doi:10.7274/R0BZ63ZT
http://dx.doi.org/doi:10.7274/R03F4MH3
CMS high energy physics simulation
•
http://dx.doi.org/doi:10.7274/R0765C7T
Mike Hildreth, 1 May, 2017
10
Infrastructure II: Workflow Capture
Mike Hildreth, 1 May, 2017
11
PRUNE
PRUNE(connects(together(precisely(reproducible(
execu=ons(and(gives(each(item(a(unique(iden=fier(
output1&=&sim(&input1,&calib1&)&IN&ENV&myenv1.json&
input1(
sim(
sim2(
output(1(
calib1(
myenv1(
myenv1(
Bab598&=&fffda7&(&3ba8c2,&64c2fa&)&IN&ENV&c8c832&
Online(Data(Archive(
RedHat(6.1(
input1(
outpu11(
Linux(83(
Mysim(3.1(
RedHat(6.2(
calib1(
myenv1(
Linux(84(
Mysim(3.2(
Mike Hildreth, 1 May, 2017
12
PRUNE
•
Works across multiple workflow repositories
•
Is interfaced with Umbrella for environment specification on
multiple platforms
•
reproducible, flexible workflow preservation
Mike Hildreth, 1 May, 2017
13
Infrastructure III: Metadata
•
HEP Data Model Workshop (“VoCamp15ND”)
• Participants from HEP, Libraries, & Ontology Community*
•
•
*new collaborations for DASPOS
Define preliminary Data Models for CERN Analysis Portal
•
describe:
•
•
•
•
•
re-use components of developed formal ontologies
•
•
main high-level elements of an analysis
main research objects
main processing workflows and products
main outcomes of the research process
PROV, Computational Observation Pattern, HEP Taxonomy, etc.
Patterns implemented in JSON-LD format for use in CERN
Analysis Portal
•
will enable discovery, cross-linking of analysis descriptions
Mike Hildreth, 1 May, 2017
14
Detector Final State Description
Mike Hildreth, 1 May, 2017
15
•
published paper at “International
Conference on Knowledge Engineering
and Knowledge Management”
http://ekaw2016.cs.unibo.it
•
Extraction
(https://github.com/gordonwatts/HEPOnt
ologyParserExperiments) of test data
sets from CMS and ATLAS publications
to examine pattern usability and ability
facilitate data access across
experiments
Computational Activity
• Continued testing and validation of the Computational Activity and Computational Environment patterns
https://github.com/Vocamp/ComputationalActivity).
• Work on aligning pattern with other vocabularies for software annotation and attribution, including Github and
Mozilla Science led “Code as a research object” effort (https://github.com/codemeta/codemeta)
Mike Hildreth, 1 May, 2017
16
Overall Metadata work structure
Integration of patterns into a knowledge flow system that captures provenance and reproducibility
information from a computational perspective as well as links to ”higher level” metadata
descriptions of the data in terms of physics vocabularies
Mike Hildreth, 1 May, 2017
17
Technology I: Containers
• Tools like chroot and Docker sandbox the execution of an
application
• Offer the ability to convert application to a container/image
• Virtualize only essential functions of the compute node
environment, allow local system to provide the rest
• much faster computation
• becoming the preferred solution over VMs for many computing
environments
App A
App B
Bin/Libs
Bin/Libs
Docker Engine
Host OS
Server
Mike Hildreth, 1 May, 2017
Native-execution time
49m2s
PTU Capture time
122m53s
PTU re-run time
114m05s
Native-execution in
container[Docker]
58m40s
Comparison of execution time for an ATLAS
application using PTU (packaged environment,
redirecting system calls) or a Docker container
18
Technology I: Containers
Portability = Preservation!
• Tools like chroot and Docker sandbox the execution of an
application
• Offer the ability to convert application to a container/image
• Virtualize only essential functions of the compute node
environment, allow local system to provide the rest
• much faster computation
• becoming the preferred solution over VMs for many computing
environments
App A
App B
Bin/Libs
Bin/Libs
Docker Engine
Host OS
Server
Mike Hildreth, 1 May, 2017
Native-execution time
49m2s
PTU Capture time
122m53s
PTU re-run time
114m05s
Native-execution in
container[Docker]
58m40s
Comparison of execution time for an ATLAS
application using PTU (packaged environment,
redirecting system calls) or a Docker container
19
Technology II: Smart Containers
Smart&Containers&&
What&is&it?&
IN&THEORY:&
2&
DATA&
Enhance&data&by&
linking&to&other&
things&
1&
Put&data,&
metadata&and&
provenance&in&
the&“same&world”&
METADATA&
PROVENANCE&
Mike Hildreth, 1 May, 2017
20
Smart&Containers&&
&Searching&
for&a&container&
Smart Containers
1&
2&
3&
4&
IN&PRACTICE:
You:&execute&a&search&
Machine:&searches&knowledge&graph&
of&available&containers&and&returns&
matches&
You:&select&the&one&you&want&
Machine:&idenQfies&dependencies,&
pulls&together&any&addiQonal&
containers&you&need&and&runs&your&
selecQon&
4
-
API to write metadata
-
Metadata storage and
strandardization
-
Specification of data
location
SC&
“I’d%like%this%one!”%
Add machinereadable labels
5
Mike Hildreth, 1 May, 2017
21
Link things
together into a
knowledge graph
Containers Workshop
Mike Hildreth, 1 May, 2017
22
RECAST
“Analysis”
Data Workflow
New Models
• Preserved workflows can be used to compare new models with a
published analysis
• Reinterpretation possible with full detector simulation, analysis chain
• “Folding” rather than “Unfolding” like in HEPData
Mike Hildreth, 1 May, 2017
23
CERN Analysis Portal & REANA
Mike Hildreth, 1 May, 2017
24
REANA architecture
REANA
@suenjedt @tiborsimko
Mike Hildreth, 1 May, 2017
25
23 / 2
Workflow Preservation
•
JSON Specification of workflow:
Individual processing steps:
•
packtivity bundles executable
(docker container),
environment, executable
description
•
working on implementation of
step description with umbrella
•
Lukas Heinrich
either create containers for
submission or run on
separate back-end
•
yadage captures how pieces
fit together into a
parametrized workflow
•
allows for re-use of stored
processing chain,
component by component
much of original
infrastructure developed by
Mike Hildreth, 1 May, 2017
26
, what to extend
REANA
Workflows
s,
w nodes, edges.
mplates,
give us good
composability, modularity.
• Workflow
schematic:
chema
As stored in CAP
Analysis
[seeds][0]
[seeds][1]
[kAww]
[kHww]
[kHzz]
[kAzz]
[nevents]
[seeds][0]
[kAzz]
prepare
prepare
prepare[0]
prepare[0]
[param_card]
[param_card]
grid
[nevents]
grid
grid[0]
grid[0]
[gridpack]
[gridpack]
subchain
subchain
[subchain][0]
[nevents]
[subchain][1]
[gridpack]
[seed]
[subchain][0]
[nevents]
[gridpack]
madevent
madevent
[seed]
[nevents]
[gridpack]
madevent[0]
madevent[0]
[lhefile]
[lhefile]
[lhefile]
pythia
pythia
pythia[0]
pythia[0]
pythia[0]
[hepmcfile]
[hepmcfile]
[hepmcfile]
delphes
delphes
delphes[0]
[delphesoutput]
analysis
analysis[0]
delphes[0]
[delphesoutput]
[delphesoutput]
analysis
analysis[0]
[analysis_output]
[analysis_output]
rootmerge
delphes
delphes[0]
analysis
analysis[0]
[analysis_output]
rootmerge
rootmerge[0]
rootmerge[0]
[mergedfile]
[mergedfile]
Mike Hildreth, 1 May, 2017
[seed]
madevent
madevent[0]
pythia
multiple)
workflow
[kAww]
27
[kHww]
[kHzz]
Next Steps: DASPOS 2.0?
•
Another scouting expedition?
Our goal is ultimately to change how science is done in a computing
context so that it has greater integrity and productivity. We have
developed some prototype techniques (in DASPOS1) that improve the
expression and archival of artifacts. Going forward, we want to study how
the systematic application of these techniques can enable new, higher
level scientific reasoning about a very large body (multidisciplinary) of
work. For this to have impact, we will develop small communities of
practice that will apply these techniques using the archives and tools
relevant to their discipline.
Another way to phrase this might be: to study/prototype the kinds of
knowledge preservation tools that might make doing science easier
and would enable broader/better science.
Mike Hildreth, 1 May, 2017
28
Preservation Tools, Techniques, and
Policies IG: Initial Meeting
Co-Chairs: Mike Hildreth (Notre Dame),
Ruth Duerr (Ronin Inst.) + ?
Tools are Key!
• This Interest Group is focused on bridging the gap
between researchers and archives
Tools
Researcher/DataGenerator
Archivist/Data
Scientist
References
•
Douglas Thain, Peter Ivie, and Haiyan Meng,Techniques for Preserving Scientific Software Executions: Preserve the
Mess or Encourage Cleanliness?,12th International Conference on Digital Preservation (iPres), November, 2015. DOI:
10.7274/R0CZ353M
Umbrella:
•
Haiyan Meng and Douglas Thain, Umbrella: A Portable Environment Creator for Reproducible Computing on Clusters,
Clouds, and Grids, Workshop on Virtualization Technologies in Distributed Computing (VTDC) at HPDC, June, 2015. DOI:
10.1145/2755979.2755982
•
Haiyan Meng, Rupa Kommineni, Quan Pham, Robert Gardner, Tanu Malik and Douglas Thain (2015). An Invariant Framework
for Conducting Reproducible Computational Science. Journal of Computational Science. April, DOI:
10.1016/j.jocs.2015.04.012
And the parrot packaging work as well:
•
Haiyan Meng, Matthias Wolf, Peter Ivie, Anna Woodard, Michael Hildreth, Douglas Thain, A Case Study in Preserving a
High Energy Physics Application with Parrot, Journal of Physics: Conference Series (CHEP 2015), December, 2015.
RECAST demo:
•
https://recast-demo.cern.ch/
Metadata work:
•
K. Janowicz, P. Hitzler, B. Adams, D. Kolas, C. Vardeman II (2014). Five Stars of Linked Data Vocabulary Use. Semantic
Web Journal. 5 (3), 17376
•
Charles Vardeman II, Adila Krisnadhi, Michelle Cheatham, Krzysztof Janowicz, Holly Ferguson, Pascal Hitzler, Aimee P. C.
Buccellato (2015). An Ontology Design Pattern and Its Use Case for Modeling Material Transformation. Semantic Web
Journal, to appear. http://www.semantic-web-journal.net/system/files/swj1303.pdf.
Mike Hildreth, 1 May, 2017
31