Workflow Refinement Provenance with PASOA • Using the PASOA model, we will treat Pegasus and each refinement stage as a separate actor, called in turn Original DAX Pegasus Reduction Refinement Site Selection Refinement Data Staging Refinement Registration Refinement Clustering Refinement align_warp align_warp align_warp align_warp reslice reslice reslice reslice softmean slicer slicer slicer convert convert convert Interactions • The actors have interactions, sending messages between themselves containing descriptions of workflows • The contents of interactions are represented as XML documents in a PASOA provenance store • The first interaction is the DAX file entering Pegasus • This is represented by reference with the following fragment <workflow url = “…”/> Interactions (2) • Every subsequent interaction is an addition to this original document, as refinements to the workflow are accumulated • We will describe the additions caused by each refinement type within the slides • Here is an example of a partially refined workflow representation: <workflow url = “…”> <removed job = “…”/> <siteSelection job = “…”> <logicalSite>…</logicalSite> <jobManager>…</jobManager> <runDirectory>…</runDirectory> <queueName>…</queueName> </siteSelection> </workflow> Relationships • In addition to interactions, the relationships between workflow nodes before and after refinement are documented in a provenance store • The relationships describe a graph with which to find the provenance of data produced by refined workflows Pegasus Refine Refine Refine Refine Refine Condor DAGMan data item Job Job Job Relationships (2) • In the relationships created by Pegasus passing a partially workflow from one refinement to the next, is a set of ‘identical to’ relationships between the output of one refinement step and the input of the next Pegasus • Each refinement also creates a set of relationships. • These are particular to each refinement and described within the slides Refine Refine Refine Refine Refine align_warp align_warp align_warp align_warp reslice reslice reslice reslice Partition 1 Workflow 2 The Workflow is Partitioned into 2 pieces, then Workflow 1 is refined and executed and Partition 2 then Workflow 2 is refined and executed Workflow 3, softmean slicer slicer slicer convert convert convert Provenance of Partitioning • We will not (for now) treat partitioning as a refinement actor but as something done manually, leading to interleaving of refinement and execution • We annotate just the first partition refinements in the following slides Pegasus Refine Refine Refine Refine Refine Condor Job Job Refine Refine Job Based on available data products Workflow 2 is reduced. In this case files ReslicedImage 1, ReslicedHeader 1, ReslicedImage 2, and ReslicedHeader 2 are found on Storage systems S1 and S2, so they do not need to be recomputed. align_warp align_warp reslice reslice Relationships for Reduction • Relationships are asserted between the nodes in the refined workflow that have not been reduced to the corresponding nodes in the original workflow • The relationship type is ‘identical to’ • No relationships are asserted to the removed nodes of the original workflow, thereby marking them as removed when looking at the whole provenance Pre-refinement workflow align_warp align_warp align_warp align_warp reslice reslice reslice reslice align_warp align_warp reslice reslice identical to Post-refinement workflow Interactions for Reduction • The post-refinement workflow has the same content as the pre-refinement workflow but with the following added for each removed job: <removed job = “…”/> Pegasus Post-refinement workflow Pre-refinement workflow Reduction Refinement Site Selection: the location where the workflow components are to be executed are selected. In our workflow we have the following resources available R1, R2 and R3 align_warp at R1 align_warp at R2 reslice at R1 reslice at R1 Relationships for Site Selection • Every node in the post-refinement workflow after site selection is related to the corresponding node in the prerefinement workflow with the ‘site selection refinement of’ relationship type Pre-refinement workflow align_warp align_warp reslice reslice align_warp at R1 align_warp at R2 reslice at R1 reslice at R1 site selection refinement of Post-refinement workflow Interactions for Site Selection • The post-refinement workflow has the same content as the pre-refinement workflow but with the following added for each refined job: Pegasus Pre-refinement workflow <siteSelection job = “…”> <logicalSite>…</logicalSite> <jobManager>…</jobManager> <runDirectory>…</runDirectory> <queueName>…</queueName> </siteSelection> Post-refinement workflow Site Selection Refinement Data stage-in and data stage out nodes are added to the workflow, Data is moved from the storage systems to the computations, data is transferred between compute resources and some intermediate data products are staged out to long term storage. Anatomy Image3 Tr(S1->R1) Anatomy Header 3 Tr(S1->R1) Reference Image Tr(S2->R1) Reference Header Tr(S2->R1) Reference Image Tr(S2->R2) align_warp at R1 Reference Header Tr(S2->R2) Anatomy Image 4 Tr(S1->R1) Anatomy Header 4 Tr(S1->R1) align_warp at R2 Warp Params4 Tr(R2->R1) reslice at R1 Resliced Image 3 Tr(R1->S1) Resliced Header 3 Tr(R1->S1) reslice at R1 Resliced Image 4 Tr(R1->S1) Resliced Header 4 Tr(R1->S1) Relationships for Data Staging • Every data staging node introduced in the refinement is related to the job that it was introduced to provide the data for with the ‘staging introduced for’ relationship type • Every application node in the post-refinement workflow after site selection is related to the corresponding node in the prerefinement workflow with the ‘identical to’ relationship type Pre-refinement workflow align_warp at R1 staging introduced for Identical to Anatomy Image3 Tr(S1->R1) Anatomy Header 3 Tr(S1->R1) Reference Image Tr(S2->R1) Reference Header Tr(S2->R1) align_warp at R1 Post-refinement workflow Interactions for Data Staging • For each data stage job add: <transfer id=“…”> <from url=“…”/> <to url=“…”/> </transfer> • And for each of its dependencies with application nodes: <child ref=“…"> <parent ref=“…"/> </child> Pegasus Pre-refinement workflow Post-refinement workflow Data Staging Refinement Data registration nodes are added to the workflow Anatomy Image3 Tr(S1->R1) Anatomy Header 3 Tr(S1->R1) Reference Image Tr(S2->R1) Reference Header Tr(S2->R1) Reference Image Tr(S2->R2) align_warp at R1 Reference Header Tr(S2->R2) Anatomy Image 4 Tr(S1->R1) Anatomy Header 4 Tr(S1->R1) align_warp at R2 Warp Params4 Tr(R2->R1) reslice at R1 reslice at R1 Resliced Image 3 Tr(R1->S1) Resliced Header 3 Tr(R1->S1) Resliced Image 4 Tr(R1->S1) Resliced Header 4 Tr(R1->S1) Register Resliced Image 3 Register Resliced Header 3 Register Resliced Image 4 Register Resliced Header 4 Relationships for Registration • For every non-registration node in the postrefinement workflow, an ‘identical to’ relationship is added to the corresponding node in the pre-refinement workflow • For every registration node in the postrefinement workflow, a ‘registration introduced for’ relationship is added to the data staging it depends on Pre-refinement workflow Resliced Image3 Tr(R1->S1) Identical to registration introduced for Resliced Image3 Tr(R1->S1) Register Resliced Image3 Post-refinement workflow Interactions for Registration • For each registration job add: <register id=“…”> <catalog url=“…”/> <source file=“…” site=“…”/> <physical file=“…” site=“…”/> </transfer> Pre-refinement • And for each of its dependencies with data staging nodes: <child ref=“…"> <parent ref=“…"/> </child> Pegasus Post-refinement workflow workflow Registration Refinement Nodes that are destined to the same resources can be clustered together to improve the overall workflow performance. The workflow is ready for execution and is sent to a workflow engine. Anatomy Image3 Tr(S1->R1) Anatomy Header 3 Tr(S1->R1) Reference Image Tr(S2->R1) Reference Header Tr(S2->R1) Reference Image Tr(S2->R2) align_warp at R1 Reference Header Tr(S2->R2) Anatomy Image 4 Tr(S1->R1) Anatomy Header 4 Tr(S1->R1) align_warp at R2 Warp Params4 Tr(R2->R1) reslice at R1 reslice at R1 Resliced Image 3 Tr(R1->S1) Resliced Header 3 Tr(R1->S1) Resliced Image 4 Tr(R1->S1) Resliced Header 4 Tr(R1->S1) Register Resliced Image 3 Register Resliced Header 3 Register Resliced Image 4 Register Resliced Header 4 Relationships for Clustering • For every non-clustered node in the post-refinement workflow, an ‘identical to’ relationship is added to the corresponding node in the pre-refinement workflow • For every clustered job node in the postrefinement workflow, a ‘clustering of’ relationship is added to the jobs from which it is clustered Pre-refinement workflow reslice at R1 reslice at R1 clustering of 2 reslices at R1 Post-refinement workflow Interactions for Clustering • For each clustered job add: <clustered newid=“…”> <job id=“…”/> <job id=“…”/> … </clustered> Pegasus Post-refinement workflow Pre-refinement workflow Clustering Refinement Upon successful execution of partition 1, partition 2 is refined, it is possible that this partition will skip some refinement steps. Here we perform site selection. softmean at R1 Partition 2 slicer at R1 slicer at R2 slicer at R3 convert at R1 convert at R2 convert at R3 Data transfer nodes are added. Resliced Image 1 Tr(S1->R1) Resliced Header 1 Tr(S1->R1) Resliced Image 2 Tr(S2->R1) Resliced Header 2 Tr(S2->R1) softmean at R1 Atlas Image Tr(R1->R2) Atlas Header Tr(R1->R2) Atlas Image Tr(R1->R3) slicer at R1 slicer at R2 slicer at R3 convert at R1 convert at R2 convert at R3 Atlas X Tr(R1->S1) Atlas Y Tr(R2->S1) Atlas Z Tr(R3->S1) Atlas Header Tr(R1->R3) Resliced Image 1 Tr(S1->R1) Resliced Header 1 Tr(S1->R1) Resliced Image 2 Tr(S2->R1) Resliced Header 2 Tr(S2->R1) softmean at R1 Atlas Image Tr(R1->R2) Atlas Header Tr(R1->R2) Atlas Image Tr(R1->R3) slicer at R1 slicer at R2 slicer at R3 convert at R1 convert at R2 convert at R3 Atlas X Tr(R1->S1) Atlas Y Tr(R2->S1) Atlas Z Tr(R3->S1) Register Atlas X Register Atlas Y Register Atlas Z Atlas Header Tr(R1->R3) Data registration nodes are added Resliced Image 1 Tr(S1->R1) Resliced Header 1 Tr(S1->R1) Resliced Image 2 Tr(S2->R1) Resliced Header 2 Tr(S2->R1) softmean at R1 Atlas Image Tr(R1->R2) Atlas Header Tr(R1->R2) Atlas Image Tr(R1->R3) slicer at R1 slicer at R2 slicer at R3 convert at R1 convert at R2 convert at R3 Atlas X Tr(R1->S1) Atlas Y Tr(R2->S1) Atlas Z Tr(R3->S1) Register Atlas X Register Atlas Y Register Atlas Z Atlas Header Tr(R1->R3) Workflow 1 Partition id1 Workflow 2 Workflow 5 Reduction id2 Workflow 4 Site Selection id3 Workflow 6 Data Staging id4 Workflow 7 Registration id4 Workflow 8 Workflow Execution (DAGMan) id6 Clustering id5 Control Dependency Workflow 3 Site Selection id7 Data Staging id8 Workflow 9 Registration id9 Workflow 10 Clustering id10 Workflow 11 Workflow 12 Workflow Execution (DAGMan) id11
© Copyright 2026 Paperzz