pinney

Perl Object Layer & Pipelines
Pipelines
–Steve Fischer
John Iodice Deborah Pinney
Mark Heiges Ed Robinson
Perl Object Layer
–Brian Brunk
Mark Gibson Dave Barkan
Pipeline Introduction
• Sequential steps of
– Plugin calls
– Script calls
– Cluster jobs
• Purpose
– Codifies the process of creating the data set
– Reduces human resources
– Reduces human error and omissions
Two Pipeline Types
• Resources pipeline
– Downloads resources from external sources
– Loads resources into database
– Example: NRDB files
• Analysis pipeline
– Extract data from database
– Run analysis programs on data
• On main or cluster server
– Put value added data back into database
Resource Pipeline
• Invoked by:
– loadresources xmlfile propfile
• Take a tour of a resources XML file
Resources Repository
•
•
•
•
•
Destination of downloads
Houses files in a file system
Serves as a cache for files
Has API to access files by name and version
If you request an existing file by name and
version, repository returns it without
downloading
– But the wget arguments must match (these are
remembered by the repository)
• Particularly useful if multiple projects want to
synchronize their data input
Analysis Pipeline
• Take a tour of the analysis pipeline file
• Take a tour of the Steps.pm file
• Take a tour of the property file
Pipeline Directory Structure
• The directory which houses all the information
for the pipeline including:
–
–
–
–
Input data
Logs
Result data
Pipeline control information:
• Which steps have been completed
• Property files to control cluster
• Structured for easy comprehension
• Take a tour of the directory structure
Analysis Pipeline API
• GUS::Pipeline::Manager.pm
– Declares properties
– Prevents steps from rerun
– Calls plugins
– Executes commands
– Eases communication with cluster
• GUS::Pipeline::MakeTaskDirs.pm
– Helps make directories expected by distribjob on the
cluster
• GUS::Pipeline::TaskRunAndValidate.pm
– Helps run a series of tasks on the cluster
DJob
• Manages the distribution of tasks across a
compute cluster
• Handles the case of a very large number of
inputs which are processed independently and
uniformly
• For example, blasting a set of EST against a
genome
• Now available for clusters using PBS cluster
scheduler
• http://core.pcbi.upenn.edu/tools/liniactools.html
Perl Object Layer
http://www.cbil.upenn.edu/~brunkb/P
ERL_Objects.html
Perl Object Layer
• Simplifies database interactivity
• Manages parent-child relationships
• Manages submits (inserts,updates and deletes)
– Submits children recursively
– Automatic versioning
– Sets default attributes (Ex. row_user_id)
• Enforces read/write permissions
• Code generator - objects consistent with db
• Extracts meta data from db
• Prints to XML and parses XML into objects
DbiDatabase Module
• Creates login to the database
• Allows use of all database objects
• Has methods to get meta information
– Ex: getTable(tableName) returns a DbiTable for
access of FK and PK attributes
• DbiDatabase object automatically instantiated by
plugins
• DbiDatabase objects must be explicitly
instantiated in scripts
Object Constructor
• TableName->new($hashRef)
Retrieving objects from DB
• retrieveFromDB(\@attributesToNotRetrieve)
• Returns 1 if successful
– Constrains attribute values
• Returns 0 if not successful
– No rows or multiple rows
Getting and Setting Attributes
• Attributes can be set using the individual
object
– Preferred, for additional functionality
– Ex: setRowUserId($userId);
• Attributes can be set using the superclass
– set('row_user_id',$userId);
• Get methods use similar syntax
– getRowUserId()
– get('row_user_id')
Managing submits to database
• submit($notDeep, $noTran)
– $notDeep = 1 only submits self but not
children
– $noTran = 1 does not begin or commit a
transaction
• addToSubmitList($object)
– Additional $object gets submitted after
main object and its children are submitted
Managing Parents
• setParent($p)
• getParent($className, $retrieveIfNoParent
,\@doNotRetrieveAttributes)
• retrieveParentFromDB($className
,\@doNotRetrieveAttributes)
Managing Memory
• undefPointerCache()
– MUST be called in each loop to allow
garbage collection.
– Removes all child and parent pointers so
they can not be retrieved.
• All other methods are automatic
– addToPointerCache($ob)
– getFromPointerCache($object_reference)
– removeFromPointerCache($ob)
Managing deletes
• Deletes occur in two steps
• markDeleted($doChildren)
– Mark self deleted
– If $doChildren = 1 then does this
recursively
• Deletes occur with submit
Managing Children
• getChildren($className,
$retrieveIfNoChildren, $getDeletedToo,
$where,\@doNotRetrieveAttributes)
• getAllChildren($retrieve, $getDeletedToo,
$where)
• retrieveChildrenFromDB($className,
$resetIfHave,
$where,\@doNotRetrieveAttributes )
• retrieveAllChildrenFromDB($recursive,
$resetIfHave)
Methods for dealing with
sequence
• getSequence()
• setSequence($sequence)
– removes returns and non-sequence
characters and then sets.
• GetFeatureSequence()
– retrieves substring of sequence to which
that feature points
• toFasta($type)
– If $type = 1 id used is the aa(or
na)_sequence_id - otherwise it is the
source_id
Printing
• ToString()
• toXML($indent, $suppressDef, $doXmlIds,
$family)
– $suppressDef = 1 default attributes below
modification_date are suppressed
– $doXmlIds = 1 will print XML ids in the
object tags
– $family = 1 will print parent/child
relationships in object tags rather than
nesting children
Checking read and write
permissions
• checkReadPermission()
• checkWritePermission()

Download Report

pinney

Paperzz.com

Your Paperzz