Annotating research objects

Wf4Ever: Annotating
research objects
Stian Soiland-Reyes, Sean Bechhofer
myGrid, University of Manchester
Open Annotation Rollout, Manchester, 2013-06-24
This work is licensed under a
Creative Commons Attribution 3.0 Unported License
Motivation: Scientific workflows
Coordinated execution of
services and linked resources
Dataflow between services
Web services (SOAP, REST)
Command line tools
Scripts
User interactions
Components (nested workflows)
http://www.myexperiment.org/workflows/3355
http://www.biovel.eu/
Method becomes:
Documented visually
Shareable as single definition
Reusable with new inputs
Repurposable other services
Reproducible?
http://www.taverna.org.uk/
But workflows are complex machines
• Will it still work after a year? 10 years?
• Expanding components, we see a workflow involves a
series of specific tools and services which
•
Depend on datasets, software libraries, other tools
•
Are often poorly described or understood
•
Over time evolve, change, break or are replaced
• User interactions are not reproducible
•
But can be tracked and replayed
Inputs
Components
Configuration
Outputs
http://www.myexperiment.org/workflows/3355
Electronic Paper Not Enough
Investigation
Hypothesis
Experiment
Result
Analysis
Conclusions
Publish
Electronic
paper
Data
Data
Open Research movement: Openly share the data of your experiments
http://figshare.com/
http://datadryad.org/
http://www.force11.org/beyondthepdf2
RESEARCH OBJECT (RO)
http://www.researchobject.org/
Research objects goal: Openly share everything about your
experiments, including how those things are related
http://www.researchobject.org/
What is in a research object?
A Research Object bundles and relates digital
resources of a scientific experiment or
investigation:
Data used and results produced in
experimental study
Methods employed to produce and analyse
that data
Provenance and settings for the experiments
People involved in the investigation
Annotations about these resources, that are
essential to the understanding and
interpretation of the scientific outcomes
captured by a research object
http://www.researchobject.org/
Gathering everything
Research Objects (RO) aggregate related resources, their
provenance and annotations
Conveys “everything you need to know” about a
study/experiment/analysis/dataset/workflow
Shareable, evolvable, contributable, citable
ROs have their own provenance and lifecycles
Why Research Objects?
i.
To share your research materials
(RO as a social object)
ii. To facilitate reproducibility and reuse of methods
iii. To be recognized and cited
(even for constituent resources)
iv. To preserve results and prevent decay
(curation of workflow definition;
using provenance for partial rerun)
A Research object
http://alpha.myexperiment.org/packs/387
Quality Assessment of a research object
Quality Monitoring
Annotations in research objects
Types: “This document contains an hypothesis”
Relations: “These datasets are consumed by that tool”
Provenance: “These results came from this workflow run”
Descriptions: “Purpose of this step is to filter out invalid data”
Comments: “This method looks useful, but how do I install it?”
Examples: “This is how you could use it”
Annotation guidelines – which
properties?
Descriptions: dct:title, dct:description, rdfs:comment, dct:publisher, dct:license,
dct:subject
Provenance: dct:created, dct:creator, dct:modified, pav:providedBy,
pav:authoredBy, pav:contributedBy, roevo:wasArchivedBy, pav:createdAt
Provenance relations: prov:wasDerivedFrom, prov:wasRevisionOf,
wfprov:usedInput, wfprov:wasOutputFrom
Social networking: oa:Tag, mediaont:hasRating, roterms:technicalContact,
cito:isDocumentedBy, cito:isCitedBy
Dependencies: dcterms:requires, roterms:requiresHardware,
roterms:requiresSoftware, roterms:requiresDataset
Typing: wfdesc:Workflow, wf4ever:Script, roterms:Hypothesis, roterms:Results,
dct:BibliographicResource
What is provenance?
Attribution
who did it?
Activity
what happens to it?
Date and tool
when was it made?
using what?
Derivation
how did it change?
Origin
where is it from?
Attributes
what is it?
Annotations
what do others say about it?
Licensing
can I use it?
By Dr Stephen Dann
licensed under Creative Commons Attribution-ShareAlike 2.0 Generic
http://www.flickr.com/photos/stephendann/3375055368/
Aggregation
what is it part of?
Attribution
actedOnBehalfOf
Who collected this sample? Who helped?
The
lab
Alice
Which lab performed the sequencing?
Who did the data analysis?
wasAttributedTo
Who curated the results?
Who produced the raw data this analysis is based on?
Data
Who wrote the analysis workflow?
Why do I need this?
Roles
Agent types
i.
To be recognized for my work
ii.
Who should I give credits to?
Person
Organization
SoftwareAgent
iii.
Who should I complain to?
iv.
Can I trust them?
v.
Who should I make friends with?
prov:wasAttributedTo
prov:actedOnBehalfOf
dct:creator
dct:publisher
pav:authoredBy
pav:contributedBy
pav:curatedBy
pav:createdBy
pav:importedBy
pav:providedBy
...
http://practicalprovenance.wordpress.com/
Derivation
Sample
Which sample was this metagenome sequenced from?
wasDerivedFrom
Which meta-genomes was this sequence extracted from?
Which sequence was the basis for the results?
Meta genome
What is the previous revision of the new results?
wasQuotedFrom
Why do I need this?
i.
ii.
wasInfluencedBy
To find the latest revision
iii. To backtrack where a diversion
appeared after a change
iv. To credit work I depend on
v.
Sequence
To verify consistency (did I use
the correct sequence?)
Auditing and defence for peer review
wasDerivedFrom
Old
results
wasRevisionOf
New
results
Activities
Lab
technician
Sample
Alice
hadRole
wasAssociatedWith
What happened? When? Who?
What was used and generated?
Why was this workflow started?
Which workflow ran? Where?
used
"2012-06-21"
Sequencing
wasGeneratedBy
Metagenome
Why do I need this?
wasStartedBy
i. To see which analysis was performed
Workflow
ii. To find out who did what
run
iii. What was the metagenome
used for?
wasGeneratedBy
iv. To understand the whole process
“make me a Methods section”
Results
Results
v. To track down inconsistencies
wasStartedAt
wasInformedBy
Workflow
server
wasAssociatedWith
hadPlan
Workflow
definition
Provenance Working Group
PROV model
Copyright © 2013 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved.
http://www.w3.org/TR/prov-primer/
Provenance of what?
Who made the (content of) research object? Who maintains it?
Who wrote this document? Who uploaded it?
Which CSV was this Excel file imported from?
Who wrote this description? When? How did we get it?
What is the state of this RO? (Live or Published?)
What did the research object look like before? (Revisions) – are
there newer versions?
Which research objects are derived from this RO?
Research object model at a glance
«ore:AggregatedResource»
«ro:Resource»
Resource
Resource
Resource
oa:hasTarget
«ore:Aggregation»
«ro:ResearchObject»
ore:aggregates
Research
Object
«oa:Annotation»
«ro:AggregatedAnnotation»
Annotation
Annotation
Annotation
oa:hasBody
«trig:Graph»
Resource
Resource
Annotation graph
«ore:ResourceMap»
«ro:Manifest»
Manifest
Wf4Ever architecture
If RDF, import as named graph
External
reference
Blob store
Redirects to
Graph
store
Uploaded to
ORE Proxy
Resource
Annotation
graph
Manifest
Research
object
REST resources
http://www.wf4ever-project.org/wiki/display/docs/RO+API+6
Annotation
SPARQL
Where do RO annotations come
from?
Imported from uploaded resources, e.g. embedded in
workflow-specific format (creator: unknown!)
Created by users filling in Title, Description etc. on website
By automatically invoked software agents, e.g.:
A workflow transformation service extracts the workflow
structure as RDF from the native workflow format
Provenance trace from a workflow run, which describes the
origin of aggregated output files in the research object
How we are using the OA model
Multiple oa:Annotation contained within the manifest RDF and
aggregated by the RO.
Provenance (PAV, PROV) on oa:Annotation (who made the link)
and body resource (who stated it)
Typically a single oa:hasTarget, either the RO or an aggregated
resource.
oa:hasBody to a trig:Graph resource (read: RDF file) with the
“actual” annotation as RDF:
<workflow1> dct:title "The wonderful workflow" .
Multiple oa:hasTarget for relationships, e.g. graph body:
<workflow1> roterms:inputSelected <input2.txt> .
What should we also be using?
Motivations
myExperiment: commenting, describing, moderating, questioning,
replying, tagging – made our own vocabulary as OA did not exist
Selectors on compound resources
E.g. description on processors within a workflow in a workflow
definition. How do you find this if you only know the workflow
definition file?
Currently: Annotations on separate URIs for each component,
described in workflow structure graph, which is body of annotation
targeting the workflow definition file
Importing/referring to annotations from other OA systems
(how to discover those?)
What is the benefit of OA for us?
Existing vocabulary – no need for our project to try to
specify and agree on our own way of tracking
annotations.
Potential interoperability with third-party annotation
tools
E.g. We want to annotate a figure in a paper and
relate it to a dataset in a research object – don’t
want to write another tool for that!
Existing annotations (pre research object) in Taverna
and myExperiment map easily to OA model
History lesson (AO/OAC/OA)
When forming the Wf4Ever Research Object model, we found:
Open Annotation Collaboration (OAC)
Annotation Ontology (AO)
What was the difference?
Technically, for Wf4Ever’s purposes: They are equivalent
Political choice: AO – supported by Utopia (Manchester)
We encouraged the formation of W3C Open Annotation
Community Group and a joint model
Next: Research Object model v0.2 and RO Bundle will use the
OA model – since we only used 2 properties, mapping is 1:1
http://www.wf4ever-project.org/wiki/display/docs/2011-09-26+Annotation+model+considerations
Saving a research object:
RO bundle
Single, transferrable research object
Self-contained snapshot
Which files in ZIP, which are URIs? (Up to user/application)
Regular ZIP file, explored and unpacked with standard tools
JSON manifest is programmatically accessible without RDF
understanding
Works offline and in desktop applications – no REST API
access required
Basis for RO-enabled file formats, e.g. Taverna run bundle
Exchanged with myExperiment and RO tools
Workflow Results Bundle
URI
references
ZIP folder structure (RO Bundle)
.ro/manifest.json
de/def2e58b-50e2-4949-9980-fd310166621a.txt
intermediates/
workflowrun.prov.ttl
(RDF)
execution
environment
outputA.txt
Aggregating in Research Object
outputB/
1.txt
2.txt
3.txt
attribution
outputC.jpg
workflow
inputA.txt
mimetype
application/vnd.wf4ever.robundle+zip
https://w3id.org/bundl
RO Bundle
JSON-LD context  RDF
.ro/manifest.json
Who made the RO? When?
http://orcid.org/
RO provenance
What is aggregated? File In ZIP or external URI
Format
Who?
http://json-ld.org/
External URIs placed in folders
Embedded annotation
External annotation, e.g. blogpost
Note: JSON "quotes" not shown above for brevity
https://w3id.org/bundle
Research Object as RDFa
http://www.oeg-upm.net/files/dgarijo/motifAnalysisSite/
<body resource="http://www.oeg-upm.net/files/dgarijo/motifAnalysisSite/"
typeOf="ore:Aggregation ro:ResearchObject">
<span property="dc:creator prov:wasAttributedTo"
resource="http://delicias.dia.fi.upm.es/members/DGarijo/#me"></span>
<h3 property="dc:title">Common Motifs in Scientific Workflows:
<br>An Empirical Analysis</h3>
<li><a property="ore:aggregates" href="t2_workflow_set_eSci2012.v.0.9_FGCS.xls"
typeOf="ro:Resource">Analytics for Taverna workflows</a></li>
<li><a property="ore:aggregates" href="WfCatalogue-AdditionalWingsDomains.xlsx“
typeOf="ro:Resource">Analytics for Wings workflows</a></li>
http://mayor2.dia.fi.upm.es/oeg-upm/files/dgarijo/motifAnalysisSite/
W3C community group for RO
http://www.w3.org/community/rosc/