The DASPOS Project Mike Hildreth representing the DASPOS Team Mike Hildreth, 1 May, 2017 1 DASPOS Data And Software Preservation for Open Science multi-disciplinary effort Notre Dame, Chicago, UIUC, Washington, Nebraska, NYU, (Fermilab, BNL) Links HEP effort (DPHEP+experiments) to Biology, Astrophysics, Digital Curation, and other disciplines includes physicists, digital librarians, computer scientists aim to achieve some commonality across disciplines in meta-data descriptions of archived data What’s in the data, how can it be used? computational description (ontology/metadata development) how was the data processed? can computation replication be automated? impact of access policies on preservation infrastructure Mike Hildreth, 1 May, 2017 2 DASPOS In parallel, will build test technical infrastructure to implement a knowledge preservation system “Scouting party” to figure out where the most pressing problems lie, and some solutions incorporate input from multi-disciplinary dialogue, usecase definitions, policy discussions Will translate needs of analysts into a technical implementation of meta-data specification Will develop means of specifying processing steps and the requirements of external infrastructure (databases, etc.) Will implement “physics query” infrastructure across smallscale distributed network End result: “template architecture” for data/software/knowledge preservation systems Mike Hildreth, 1 May, 2017 3 DASPOS Overview • How to catalogue and share data • How to curate and archive large digital collections Digital Librarian Expertise Computer Science Expertise Science Expertise • What does the data mean? • How was it processed? • How will it be re-used Mike Hildreth, 1 May, 2017 4 • How to build databases and query infrastructure • How to develop distributed storage networks DASPOS Process • Multi-pronged approach for individual topics • NYU/Nebraska: RECAST and other developments • UIUC/Chicago: Workflows, Containers • ND: Metadata, Containers, Workflows, Environment specification • Shared validation & examples • Workshops & All-hands meetings • Shared collaboration with CERN, DPHEP • Outreach to other disciplines Mike Hildreth, 1 May, 2017 5 Prototype Architecture Inspire Preservation Archive “Containerizer Tools” Container Cluster run • Test bed • Capable of running containerized processes • PTU, Parrot scripts • Used to capture processes • Deliverable: stored in DASPOS git store • Metadata • Container images Data Archive • Metadata • Workflow images • Instructions to reproduce • Data • Data? Tools: Run containers/workflows Tools: Discovery/exploration Unpack/analyze Data path Domain-specific Metadata links Mike Hildreth, 1 May, 2017 6 Policy & Curation Access Policies Public archives? Prototype Architecture Inspire Preservation Archive “Containerizer Tools” Container Cluster run • Test bed • Capable of running containerized processes • PTU, Parrot scripts • Used to capture processes • Deliverable: stored in DASPOS git store • Metadata • Container images Data Archive • Metadata • Workflow images • Instructions to reproduce • Data • Data? Tools: Run containers/workflows Tools: Discovery/exploration Unpack/analyze ~ Done Under development Not done Data path Domain-specific Metadata links Mike Hildreth, 1 May, 2017 7 Policy & Curation Access Policies Public archives? Infrastructure I: Environment Capture Mike Hildreth, 1 May, 2017 8 Umbrella Umbrella(specifies(a(reproducible(environment(while( avoiding(duplica=on(and(enabling(precise(adjustments.( Run(the(experiment( Same(thing,(but(use( different(input(data.( Same(thing,(but( update(the(OS( input1( input2( input2( Mysim(3.1( Mysim(3.1( Mysim(3.1( RedHat(6.1( RedHat(6.1( RedHat(6.2( Linux(83( Linux(83( Linux(83( Online(Data(Archive( RedHat(6.1( input1( input2( Linux(83( Mysim(3.1( RedHat(6.2( calib1( calib2( Linux(84( Mysim(3.2( Mike Hildreth, 1 May, 2017 9 Umbrella Current version of Umbrella can work with: • Docker – create container, mount volumes. • Parrot – Download tarballs, mount at run=me. • Amazon – allocate VM, copy and unpack tarballs. • Condor – Request compatible machine. • Open Science Framework – deploy uploaded containers Example Umbrella Apps: • Povray ray-tracing application • • OpenMalaria simulation • • http://dx.doi.org/doi:10.7274/R0BZ63ZT http://dx.doi.org/doi:10.7274/R03F4MH3 CMS high energy physics simulation • http://dx.doi.org/doi:10.7274/R0765C7T Mike Hildreth, 1 May, 2017 10 Infrastructure II: Workflow Capture Mike Hildreth, 1 May, 2017 11 PRUNE PRUNE(connects(together(precisely(reproducible( execu=ons(and(gives(each(item(a(unique(iden=fier( output1&=&sim(&input1,&calib1&)&IN&ENV&myenv1.json& input1( sim( sim2( output(1( calib1( myenv1( myenv1( Bab598&=&fffda7&(&3ba8c2,&64c2fa&)&IN&ENV&c8c832& Online(Data(Archive( RedHat(6.1( input1( outpu11( Linux(83( Mysim(3.1( RedHat(6.2( calib1( myenv1( Linux(84( Mysim(3.2( Mike Hildreth, 1 May, 2017 12 PRUNE • Works across multiple workflow repositories • Is interfaced with Umbrella for environment specification on multiple platforms • reproducible, flexible workflow preservation Mike Hildreth, 1 May, 2017 13 Infrastructure III: Metadata • HEP Data Model Workshop (“VoCamp15ND”) • Participants from HEP, Libraries, & Ontology Community* • • *new collaborations for DASPOS Define preliminary Data Models for CERN Analysis Portal • describe: • • • • • re-use components of developed formal ontologies • • main high-level elements of an analysis main research objects main processing workflows and products main outcomes of the research process PROV, Computational Observation Pattern, HEP Taxonomy, etc. Patterns implemented in JSON-LD format for use in CERN Analysis Portal • will enable discovery, cross-linking of analysis descriptions Mike Hildreth, 1 May, 2017 14 Detector Final State Description Mike Hildreth, 1 May, 2017 15 • published paper at “International Conference on Knowledge Engineering and Knowledge Management” http://ekaw2016.cs.unibo.it • Extraction (https://github.com/gordonwatts/HEPOnt ologyParserExperiments) of test data sets from CMS and ATLAS publications to examine pattern usability and ability facilitate data access across experiments Computational Activity • Continued testing and validation of the Computational Activity and Computational Environment patterns https://github.com/Vocamp/ComputationalActivity). • Work on aligning pattern with other vocabularies for software annotation and attribution, including Github and Mozilla Science led “Code as a research object” effort (https://github.com/codemeta/codemeta) Mike Hildreth, 1 May, 2017 16 Overall Metadata work structure Integration of patterns into a knowledge flow system that captures provenance and reproducibility information from a computational perspective as well as links to ”higher level” metadata descriptions of the data in terms of physics vocabularies Mike Hildreth, 1 May, 2017 17 Technology I: Containers • Tools like chroot and Docker sandbox the execution of an application • Offer the ability to convert application to a container/image • Virtualize only essential functions of the compute node environment, allow local system to provide the rest • much faster computation • becoming the preferred solution over VMs for many computing environments App A App B Bin/Libs Bin/Libs Docker Engine Host OS Server Mike Hildreth, 1 May, 2017 Native-execution time 49m2s PTU Capture time 122m53s PTU re-run time 114m05s Native-execution in container[Docker] 58m40s Comparison of execution time for an ATLAS application using PTU (packaged environment, redirecting system calls) or a Docker container 18 Technology I: Containers Portability = Preservation! • Tools like chroot and Docker sandbox the execution of an application • Offer the ability to convert application to a container/image • Virtualize only essential functions of the compute node environment, allow local system to provide the rest • much faster computation • becoming the preferred solution over VMs for many computing environments App A App B Bin/Libs Bin/Libs Docker Engine Host OS Server Mike Hildreth, 1 May, 2017 Native-execution time 49m2s PTU Capture time 122m53s PTU re-run time 114m05s Native-execution in container[Docker] 58m40s Comparison of execution time for an ATLAS application using PTU (packaged environment, redirecting system calls) or a Docker container 19 Technology II: Smart Containers Smart&Containers&& What&is&it?& IN&THEORY:& 2& DATA& Enhance&data&by& linking&to&other& things& 1& Put&data,& metadata&and& provenance&in& the&“same&world”& METADATA& PROVENANCE& Mike Hildreth, 1 May, 2017 20 Smart&Containers&& &Searching& for&a&container& Smart Containers 1& 2& 3& 4& IN&PRACTICE: You:&execute&a&search& Machine:&searches&knowledge&graph& of&available&containers&and&returns& matches& You:&select&the&one&you&want& Machine:&idenQfies&dependencies,& pulls&together&any&addiQonal& containers&you&need&and&runs&your& selecQon& 4 - API to write metadata - Metadata storage and strandardization - Specification of data location SC& “I’d%like%this%one!”% Add machinereadable labels 5 Mike Hildreth, 1 May, 2017 21 Link things together into a knowledge graph Containers Workshop Mike Hildreth, 1 May, 2017 22 RECAST “Analysis” Data Workflow New Models • Preserved workflows can be used to compare new models with a published analysis • Reinterpretation possible with full detector simulation, analysis chain • “Folding” rather than “Unfolding” like in HEPData Mike Hildreth, 1 May, 2017 23 CERN Analysis Portal & REANA Mike Hildreth, 1 May, 2017 24 REANA architecture REANA @suenjedt @tiborsimko Mike Hildreth, 1 May, 2017 25 23 / 2 Workflow Preservation • JSON Specification of workflow: Individual processing steps: • packtivity bundles executable (docker container), environment, executable description • working on implementation of step description with umbrella • Lukas Heinrich either create containers for submission or run on separate back-end • yadage captures how pieces fit together into a parametrized workflow • allows for re-use of stored processing chain, component by component much of original infrastructure developed by Mike Hildreth, 1 May, 2017 26 , what to extend REANA Workflows s, w nodes, edges. mplates, give us good composability, modularity. • Workflow schematic: chema As stored in CAP Analysis [seeds][0] [seeds][1] [kAww] [kHww] [kHzz] [kAzz] [nevents] [seeds][0] [kAzz] prepare prepare prepare[0] prepare[0] [param_card] [param_card] grid [nevents] grid grid[0] grid[0] [gridpack] [gridpack] subchain subchain [subchain][0] [nevents] [subchain][1] [gridpack] [seed] [subchain][0] [nevents] [gridpack] madevent madevent [seed] [nevents] [gridpack] madevent[0] madevent[0] [lhefile] [lhefile] [lhefile] pythia pythia pythia[0] pythia[0] pythia[0] [hepmcfile] [hepmcfile] [hepmcfile] delphes delphes delphes[0] [delphesoutput] analysis analysis[0] delphes[0] [delphesoutput] [delphesoutput] analysis analysis[0] [analysis_output] [analysis_output] rootmerge delphes delphes[0] analysis analysis[0] [analysis_output] rootmerge rootmerge[0] rootmerge[0] [mergedfile] [mergedfile] Mike Hildreth, 1 May, 2017 [seed] madevent madevent[0] pythia multiple) workflow [kAww] 27 [kHww] [kHzz] Next Steps: DASPOS 2.0? • Another scouting expedition? Our goal is ultimately to change how science is done in a computing context so that it has greater integrity and productivity. We have developed some prototype techniques (in DASPOS1) that improve the expression and archival of artifacts. Going forward, we want to study how the systematic application of these techniques can enable new, higher level scientific reasoning about a very large body (multidisciplinary) of work. For this to have impact, we will develop small communities of practice that will apply these techniques using the archives and tools relevant to their discipline. Another way to phrase this might be: to study/prototype the kinds of knowledge preservation tools that might make doing science easier and would enable broader/better science. Mike Hildreth, 1 May, 2017 28 Preservation Tools, Techniques, and Policies IG: Initial Meeting Co-Chairs: Mike Hildreth (Notre Dame), Ruth Duerr (Ronin Inst.) + ? Tools are Key! • This Interest Group is focused on bridging the gap between researchers and archives Tools Researcher/DataGenerator Archivist/Data Scientist References • Douglas Thain, Peter Ivie, and Haiyan Meng,Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness?,12th International Conference on Digital Preservation (iPres), November, 2015. DOI: 10.7274/R0CZ353M Umbrella: • Haiyan Meng and Douglas Thain, Umbrella: A Portable Environment Creator for Reproducible Computing on Clusters, Clouds, and Grids, Workshop on Virtualization Technologies in Distributed Computing (VTDC) at HPDC, June, 2015. DOI: 10.1145/2755979.2755982 • Haiyan Meng, Rupa Kommineni, Quan Pham, Robert Gardner, Tanu Malik and Douglas Thain (2015). An Invariant Framework for Conducting Reproducible Computational Science. Journal of Computational Science. April, DOI: 10.1016/j.jocs.2015.04.012 And the parrot packaging work as well: • Haiyan Meng, Matthias Wolf, Peter Ivie, Anna Woodard, Michael Hildreth, Douglas Thain, A Case Study in Preserving a High Energy Physics Application with Parrot, Journal of Physics: Conference Series (CHEP 2015), December, 2015. RECAST demo: • https://recast-demo.cern.ch/ Metadata work: • K. Janowicz, P. Hitzler, B. Adams, D. Kolas, C. Vardeman II (2014). Five Stars of Linked Data Vocabulary Use. Semantic Web Journal. 5 (3), 17376 • Charles Vardeman II, Adila Krisnadhi, Michelle Cheatham, Krzysztof Janowicz, Holly Ferguson, Pascal Hitzler, Aimee P. C. Buccellato (2015). An Ontology Design Pattern and Its Use Case for Modeling Material Transformation. Semantic Web Journal, to appear. http://www.semantic-web-journal.net/system/files/swj1303.pdf. Mike Hildreth, 1 May, 2017 31
© Copyright 2026 Paperzz