Bee workflowmanager

BeeWM
Béla Hullár
ETHZ, SIS
Bee Introduction
• iBrain, iBrain2, screeningBee,… BeeWM
• lightweight generic workflow management system
(main development focus on HCS)
• can handle different cluster types (LSF, SGE/UGE) and
data storage solutions (OpenBIS, Filesystem)
• Can be used without OpenBIS, but the main focus was
on supporting data-processing of data stored in openbis
• Validations of processing
• Data-provenance tracking, which can be used to avoid
re-computation (if a result of a compatible analysis exists
in OpenBIS no computation will be performed)
• Web interface
• Simple web service interface, which allows easy
integration
• Support for automatic data-processing
SyBIT Retreat 2014
Cluster integration
• LSF over ssh (Brutus)
• DRMAA – SGE locally or over ssh
• It can run command line tools, which can be run on
the cluster
• Resubmission in case of error – if the runtime limit
was violated then in a longer queue
SyBIT Retreat 2014
Data access
• Data staged automatically to the cluster and
moved back to the storage
• Main focus on OpenBIS, but it can be used with
filesystem too
• Data is cached on the cluster filesystem
SyBIT Retreat 2014
Workflows
• XML description
• A workflow is composed of Module
which is a processing step
• Modules can be independent or can depend on
data generated by other module
• A module can be simple or parallel (submits
multiple cluster job)
• Validation steps can be defined:
o Some file or files should exist
o A file should contain a given string
o File size
SyBIT Retreat 2014
Using Bee
• RestFull API for submitting, managing workflows
• A web ui
• OpenBIS integration
o Web app
o Automatic processing plugin
SyBIT Retreat 2014
Web Interface
SyBIT Retreat 2014
Web Interface Submission
SyBIT Retreat 2014
Web Interface - Monitoring
SyBIT Retreat 2014
OpenBIS Integration
SyBIT Retreat 2014
Equivalence Checking
• To avoid unnecessary re-computation
• To make easier complex analysis of large number of
datasets
• Providing provenance information
SyBIT Retreat 2014
Use case: HCS in InfectX/TargetInfectX
SyBIT Retreat 2014
Basic Concepts
• Derived measurement: generated by the WM
• Raw measurement: anything else, eg. datasets
generated by measurement devices, parameter
files,..
• Resolved version number: 1.0.2, 3.1.2
• Unresolved version number: 1.0.*, 3.*
• Note: when we define the workflow we can use an
unresolved version, but the actual analysis is done
with a particular version, which will be stored with
the results
• Matching: 3.* ~ 3.1.2, 1.0.* ~ 1.1.1
SyBIT Retreat 2014
• Dataset equivalence class:
o Raw measurement: dataset id
o Derived measurement: 3-tuple: module name, module version, and the list of
dataset equivalence classes of all input datasets (recursively).
• Two datasets A and B are defined as equivalent under
the equivalence relation if and only if for the
equivalence classes EA and EB of both datasets, the
following are true:
o the module name of EA is a string match of the module name of EB
o the module version of EA matches the resolved module version of EB
o recursively the same is true for the list of equivalence classes of the input
datasets
SyBIT Retreat 2014
Implementation details
Provenance
Raw
SyBIT Retreat 2014
Derived
• When a result dataset is ready the tree of the
provenance objects are built and serialized
• Provenance information is stored in JSON (JAXB)
• Each result dataset has parent datasets which are
the input datasets from the workflow
• When a workflow is submitted we query all the
children datasets of the workflow inputs from
OpenBIS
• The provenance information tree of these objects
built up automatically by JAXB
• For all the output dataset definition from the
workflow description we build up the tree of the
unresolved provenance and try to find matching in
the queried list
SyBIT Retreat 2014
Other Issues
• Not all the output datasets are store:
o We only have to find matching datasets for the relevant datasets
o A -> A1, A2 and B -> B1, A1 and B1 are relevant, A2 not. We found
equivalent dataset for A1, but not for B1. A2 is an input of B, A had to be
run again!
o After finding all the equivalent datasets for the workflow, module states
have to be set properly and necessary datasets from storage have to be
staged
• Reuse of equivalent datasets as input:
o Instead of generating raw provenance from the dataset id the whole
provenance information has to be used where these datasets were
reused
SyBIT Retreat 2014