An Open Provenance Model for Scientific Workflows

An Open Provenance Model
for Scientific Workflows
Professor Luc Moreau
[email protected]
University of Southampton
www.ecs.soton.ac.uk/~lavm
Provenance & PASOA Teams

University of Southampton


IBM UK (EU Project Coordinator)


Steven Willmott, Javier Vazquez
SZTAKI


Omer Rana, Arnaud Contes, Vikas Deora, Ian Wootten, Shrija Rajbhandari
Universitad Politecnica de Catalunya (UPC)


John Ibbotson, Neil Hardman, Alexis Biller
University of Wales, Cardiff


Luc Moreau, Paul Groth, Simon Miles, Victor Tan, Miguel Branco, Sofia
Tsasakou, Sheng Jiang, Steve Munroe, Zheng Chen
Laszlo Varga, Arpad Andics,
Tamas Kifor
German Aerospace

Andreas Schreiber, Guy Kloss,
Frank Danneman
Contents




Motivation
Provenance Concept Map
Process documentation in a concrete
bioinformatics application
Conclusions
Motivation
Peer Review/Audit
Academic
publishing
Accounting
Healthcare
Banking
e-Science datasets

How to undertake peer-reviewing and
validation of e-Scientific results?
Current Solutions




Proprietary, Monolithic
Silos, Closed
Do not inter-operate
with other applications
Not adaptable to new
regulations
Provenance

Oxford English Dictionary:




the fact of coming from some particular source or
quarter; origin, derivation
the history or pedigree of a work of art,
manuscript, rare book, etc.;
concretely, a record of the passage
of an item through its various
owners.
Concept vs representation
Application Drivers
Aerospace engineering:
maintain a historical
record of design
processes, up to 99
years.
Organ transplant
management: tracking
of previous decisions,
crucial to maximise the
efficiency in matching
and recovery rate of
patients
Bioinformatics: verification and
auditing of “experiments” (e.g.
for drug approval)
High Energy Physics: tracking,
analysing, verifying data sets in
the ATLAS Experiment of the
Large Hadron Collider (CERN)
Provenance Concept Map
documents
Process
Documentation
is defined as a past
Process
has a structure
Provenance
(concept)
produces
is an execution of
is represented by
Provenance
Query
has
Provenance
(representation )
Application
is obtained by
P-structure
contains
Data product
assert
consists of
operates over
Services
P-assertions
Making Applications
Provenance Aware
Application
Data Product
Assert p-assertions and
record them as Process Documentation
Provenance
Store
Obtain the provenance
of data by issuing
provenance queries
Process Documentation
I received M1, M4
I sent M2, M3
Interaction
p-assertions
M1
f1
f2
Relationship
p-assertions
M2
M3 = f1(M1)
M2 = f2(M1,M4)
M2 is in reply to M1
M3
M4
Service state
p-assertions
I received M1 at time t
I used algorithm x.y.z
Data flow



Interaction p-assertions allow us to
specify a flow of data between services
Relationship p-assertions allow us to
characterise the flow of data “inside” an
service
Overall data flow (internal + external)
constitutes a DAG, which characterises
the process that led to a result
Process Documentation in a
Concrete Bioinformatics
Application
Biology




Determine how protein
sequences fold into a 3D
structure?
Structure of protein sequences
may help to answer this question.
Structure can be quantified by
textual compressibility.
Determine the amino acid
groupings that maximize
compressibility?
Collaboration Diagram
Actual Call DAG
The P-Structure
The logical structure of a provenance store
Interaction Record
The set of p-assertions pertaining to a
given interaction (i.e., message
exchange between a sender and a
receiver)
Interaction Key
A unique identifier for an interaction
Sender identity
Receiver identity
Local id
View
The set of p-assertions created by an asserter
involved in an interaction (sender or receiver
view)
Asserter
The identity of an asserter
Interaction P-Assertion
An assertion of the contents of a message by
an actor that has sent or received that message
Interaction P-Assertion Content
The content of an interaction p-assertion:
here, the invocation of blast (through a
wrapper)
Interaction Content
Provenance-related information passed in
application messages
Actor State P-Assertion
An assertion made by an actor about its internal
state in the context of a specific interaction
Relationship P-Assertion
With respect to an interaction, a relationship p-assertion is an
assertion, made by an actor, that describes how the actor obtained
output data or the whole message sent in that interaction by applying
some function to input data or messages from other interactions.
Subject Id
The identity of the subject of a relationship
Object Id
The identity of the object of a relationship
Process Documentation
Characteristics




Common logical structure of the
provenance store shared by all asserting
and querying actors
Can be produced autonomously,
asynchronously by the different
application components
Open, extensible model, for which we
are producing a public specification
Tools can operate on it (e.g.
visualisation, reasoning)
Performance (HPDC’05)
Standardisation Philosophy


Thin layer common between systems:
extensible data model
Model can be extended for specific:



technologies (WS, Web, …), or
application domains (Bio, Healthcare,
Desktop, …)
Service interfaces
Proposed List of Specifications
Generic Profiles
WS-Prov-Intro
WS-Prov-DM-Sec
WS-Prov-DM-Link
WS-Prov-Glo
WS-Prov-DM
WS-Prov-DM-Infer
WS-Prov-DM-DS
WS-Prov-Primer
WS-Prov-DM-Rel
WS-Prov-Rec
WS-Prov-Query
Technology Bindings
WS-Prov-SOAP
WS-Prov-WWW
Domain
Specific
Profiles
Conclusions
To Sum Up
Distribution
Finance
Aerospace
Standardising the
documentation of
Business Processes
Healthcare
Automobile
Provenance


Architecture
Methodology
Pharmaceutical
Record


Provenance
Store
Query


Compliance check
Rerun/Reproduce
Analyse
Slide from John Ibbotson
Conclusions







Crucial topic for many applications
Full architectural specification
Implementation available for download
Methodology to make application
provenance-aware
Draft standardisation proposal to be released
www.pasoa.org
www.gridprovenance.org
Provenance Challenge
Provenance Challenge Workshop
at OGF18, Washington,
September 11-14
twiki.ipaw.info
Questions