redux - Provenance Challenge

REDUX – automatic capture, efficient storage
Roger S. Barga
Microsoft Research (MSR)
Luciano Digiampietri
University of Campinas, Sao Paolo, Brazil
Considerations
What information needs to be captured?
Which version of BLAST did I use?
What codes (activities) did I invoke to get this result, and what were the parameters?
What data transformations did I use to get this result?
What machine was used to perform the alignment?
Were any steps skipped in this experiment, or were any shims inserted?
Did the experiment design differ between these two results? If so, where?...
Are there any branches in the workflow that have not been explored?
Additional Issues to Consider…
Result of a provenance query is an executable workflow
It may not possible to rerun an experiment, to either validate or recreate a result
because original workflow is lost (activities have been updated).
Allow the user to control what is shared/exposed – one size doesn’t fit all
Provenance storage costs can quickly grow out of hand…
Implementation
Extended enactment engine of WinOE to automatically
capture steps during execution leading to a result
Provenance capture is automatic & transparent
A multilayer model for representing result provenance
Abstract Workflow  Service Instantiation  Data Instantiation  Runtime
Store provenance in a RDBMS (SQL Server), utilize
previous traces to significantly reduce storage costs
Current query interface is SQL, eventually a forms based interface.
Version and lock the executables
Updating any activity will change the workflow version number, resulting
in a new version. User is able to rerun an experiment by invoking
workflow using fully-specified reference found in the provenance record;
Abstract Workflow
Data Model for Abstract Workflow
Bound to Activities (code) and Data
Data Model for Workflow Instance
Provenance Queries – Query 1
Provenance queries 1, 4, 5, 7, 8 and 9
Find the process that led to Atlas X Graphic / everything that caused
Atlas X Graphic to be as it is. This should tell us the new brain images
from which the averaged atlas was generated, the warping performed etc.
Returns ExecutableWorkflowId (process), ExecutionId (id of specific execution of the
process), EventId (event where data was produced) and ExecutableWorkflow_
ExecutableActivityId (activity that produced the data) of the processes that generated
the Atlas X Graphic
Provenance Queries – Query 7a
Provenance queries 1, 4, 5, 7, 8 and 9
Our layered model allows the detection of differences in several ways
A user has run the workflow twice, in the second instance replacing each
procedures (convert) in the final stage with two procedures: pgmtoppm,
then pnmtojpeg. Find the differences between the two workflow runs. The
exact level of detail in the difference that is detected by a system is
up to each participant.
Provenance Queries – Query 7b
A user has run the workflow twice, in the second instance replacing each
procedures (convert) in the final stage with two procedures: pgmtoppm,
then pnmtojpeg. Find the differences between the two workflow runs. The
exact level of detail in the difference that is detected by a system is
up to each participant.
Workflow Model captures information about the instances of the activities,
and the links among the ports (or activities interfaces). At this layer, our
model allows provenance queries to question, for example, what activities
from Workflow 2 are not included in Workflow 1:
Activities used by the second workflow but not the first
Provenance Queries – Query 7c
A user has run the workflow twice, in the second instance replacing each
procedures (convert) in the final stage with two procedures: pgmtoppm,
then pnmtojpeg. Find the differences between the two workflow runs. The
exact level of detail in the difference that is detected by a system is
up to each participant.
Runtime Level which contains information about the execution of the
workflow (produced data, timestamps, activities invoked, etc.). Here the
model allows queries about produced data, data flow (See Q2 and Q3),
date/time, etc.
One example query that illustrates the difference between two
workflows, at this level, is: What is the data produced by the second
workflow that was not produced by the first?
Data produced by workflow 2 that was not produced by workflow 1:
Efficiently Storing Provenance Data
For Provenance Query 7
Two workflows are sharing more that 99% of
the provenance data (space) and sharing 46%
of the database tuples.
To Sum Up…
Extended Windows Workflow Foundation
Transparently capture execution trace leading to a result
A layered provenance model
Relational database (SQL Server) as provenance store
Store provenance as delta/edit over existing traces
Initial query facility built over this provenance data
Unique aspects of our system
Result of a provenance query is an executable workflow
Coupled code versioning to provenance collection
An open (and interesting) data management challenge