BeeWM Béla Hullár ETHZ, SIS Bee Introduction • iBrain, iBrain2, screeningBee,… BeeWM • lightweight generic workflow management system (main development focus on HCS) • can handle different cluster types (LSF, SGE/UGE) and data storage solutions (OpenBIS, Filesystem) • Can be used without OpenBIS, but the main focus was on supporting data-processing of data stored in openbis • Validations of processing • Data-provenance tracking, which can be used to avoid re-computation (if a result of a compatible analysis exists in OpenBIS no computation will be performed) • Web interface • Simple web service interface, which allows easy integration • Support for automatic data-processing SyBIT Retreat 2014 Cluster integration • LSF over ssh (Brutus) • DRMAA – SGE locally or over ssh • It can run command line tools, which can be run on the cluster • Resubmission in case of error – if the runtime limit was violated then in a longer queue SyBIT Retreat 2014 Data access • Data staged automatically to the cluster and moved back to the storage • Main focus on OpenBIS, but it can be used with filesystem too • Data is cached on the cluster filesystem SyBIT Retreat 2014 Workflows • XML description • A workflow is composed of Module which is a processing step • Modules can be independent or can depend on data generated by other module • A module can be simple or parallel (submits multiple cluster job) • Validation steps can be defined: o Some file or files should exist o A file should contain a given string o File size SyBIT Retreat 2014 Using Bee • RestFull API for submitting, managing workflows • A web ui • OpenBIS integration o Web app o Automatic processing plugin SyBIT Retreat 2014 Web Interface SyBIT Retreat 2014 Web Interface Submission SyBIT Retreat 2014 Web Interface - Monitoring SyBIT Retreat 2014 OpenBIS Integration SyBIT Retreat 2014 Equivalence Checking • To avoid unnecessary re-computation • To make easier complex analysis of large number of datasets • Providing provenance information SyBIT Retreat 2014 Use case: HCS in InfectX/TargetInfectX SyBIT Retreat 2014 Basic Concepts • Derived measurement: generated by the WM • Raw measurement: anything else, eg. datasets generated by measurement devices, parameter files,.. • Resolved version number: 1.0.2, 3.1.2 • Unresolved version number: 1.0.*, 3.* • Note: when we define the workflow we can use an unresolved version, but the actual analysis is done with a particular version, which will be stored with the results • Matching: 3.* ~ 3.1.2, 1.0.* ~ 1.1.1 SyBIT Retreat 2014 • Dataset equivalence class: o Raw measurement: dataset id o Derived measurement: 3-tuple: module name, module version, and the list of dataset equivalence classes of all input datasets (recursively). • Two datasets A and B are defined as equivalent under the equivalence relation if and only if for the equivalence classes EA and EB of both datasets, the following are true: o the module name of EA is a string match of the module name of EB o the module version of EA matches the resolved module version of EB o recursively the same is true for the list of equivalence classes of the input datasets SyBIT Retreat 2014 Implementation details Provenance Raw SyBIT Retreat 2014 Derived • When a result dataset is ready the tree of the provenance objects are built and serialized • Provenance information is stored in JSON (JAXB) • Each result dataset has parent datasets which are the input datasets from the workflow • When a workflow is submitted we query all the children datasets of the workflow inputs from OpenBIS • The provenance information tree of these objects built up automatically by JAXB • For all the output dataset definition from the workflow description we build up the tree of the unresolved provenance and try to find matching in the queried list SyBIT Retreat 2014 Other Issues • Not all the output datasets are store: o We only have to find matching datasets for the relevant datasets o A -> A1, A2 and B -> B1, A1 and B1 are relevant, A2 not. We found equivalent dataset for A1, but not for B1. A2 is an input of B, A had to be run again! o After finding all the equivalent datasets for the workflow, module states have to be set properly and necessary datasets from storage have to be staged • Reuse of equivalent datasets as input: o Instead of generating raw provenance from the dataset id the whole provenance information has to be used where these datasets were reused SyBIT Retreat 2014
© Copyright 2026 Paperzz