Using q: a step-by-step guide

Using q: a step-by-step guide - tewlab

Using q: a step-by-step guide
Artur B. Veloso, Thomas E. Wilson
University of Michigan Medical School
Introduction
Installation
STEP 1:
STEP 2:
Check the prerequisites
Install and configure
Submitting and monitoring jobs
STEP 3: Establish a worker script
STEP 4: Wrap your worker script into a q file and submit it
STEP 5: Monitor your job(s)
Collecting input values
STEP 6: Handle multiple input values
STEP 7: Handle combinations of inputs
STEP 8: Collect information from the system at submission time
STEP 9: Inherit information from the environment
Executing multiple tasks
STEP 10: Serial jobs
STEP 11: Parallel jobs (arrays and threads)
Using modular pipelines
STEP 12: Create modular pipelines (slaves)
STEP 13: Pass information upstream from slave to master
STEP 14: Simplify your code with embedded files
Advanced scripting
STEP 15: Diversify your pipeline with other scripting languages
STEP 16: Optimize pipelines using built-in commands
Job and pipeline management
STEP 17: Manage jobs and handle errors
STEP 18: Protect your output data
STEP 19: Manage your pipeline
STEP 20: Organize your pipeline definition files
Web interfaces and data distributions
STEP 21: Use the q remote web interface
STEP 22: Share your work with others
Summary
Index
Using q: a step-by-step guide
Introduction
The q utility is a platform for integrating the management of data analysis pipelines. In brief, q defines a
minimal scripting language that acts a wrapper around the worker scripts that actually execute the work
of your pipeline – scripts that you might very likely already have and use, but that you have difficulty
organizing and tracking. Along the way, q allows you to make efficient use of system resources as your
work becomes increasingly complex.
This document provides a step-by-step tutorial on how to wrap worker scripts into an organized q
pipeline, submit the work to a job scheduler, and monitor and manage the resulting jobs. We‟ll start
simple, but build to show you how complex pipelines can be assembled using a few code lines.
Concepts are taken in pieces, but your pipelines will likely combine many of these pieces.
STEP 1: Check the prerequisites
q requires that you already have the following available to you:
1)
2)
3)
4)
5)
6)
a Linux (or Linux-compatible) host server, with:
a user account, accessible via ssh or Putty
a job scheduler, either SGE (Sun Grid Engine) or Torque PBS
Perl
the bash system shell, /bin/bash (need not be the shell offered by the login host)
the GNU time utility, /usr/bin/time (different than the time utility built into many Linux systems)
STEP 2: Install and configure
To install q, unzip the q zip file into some server folder and run the configuration script:
$ cd /path/to/some/folder
$ gunzip q-<#.#.#>.zip
$ cd q-<#.#.#>
$ perl configure.pl
reading system information ............................
generating q program target
done
created q program target:
/path/to/some/folder/q-<#.#.#>/q
$ ./q
q version #.#.#
q is a utility for submitting, monitoring and managing data analysis pipelines
usage: q <command> [options] <masterFile> [...]
masterFile = path to a master instructions file
use 'q --help' or 'q <command> --help' for extended help
In most cases you will want to add „/path/to/some/folder/q-<#.#.#>‟ to your server system PATH
variable so that it is available to you from any server directory by simply typing „q‟. Details of how to do
this will depend on your server and account configuration.
STEP 3: Establish a worker script
Let‟s imagine you already have the following worker script that you have previously prepared to do
some useful bit of data analysis work:
2
Using q: a step-by-step guide
$ cat myScript.sh
#!/bin/bash
#$
-N parse_input_file
#PBS -N parse_input_file
INPUT_FILE=”$1”
MATCH_PATTERN=”$2”
OUTPUT_FILE=”$INPUT_FILE.$MATCH_PATTERN”
echo “parsing $INPUT_FILE $MATCH_PATTERN lines to $OUTPUT_FILE”
grep $MATCH_PATTERN $INPUT_FILE | cut –f1 > $OUTPUT_FILE
that you currently submit to your job scheduler as follows:
$ qsub myScript.sh /path/to/my/input.file myPattern
Your job 45678 ("parse_input_file") has been submitted
and that generates the following output:
parsing /path/to/my/input.file myPattern lines to /path/to/my/input.file.myPattern
If everything above makes sense to you, skip to Step 4.
A detailed description of shell scripting and job schedulers is beyond the scope of this document, but
the following brief descriptions should give you enough information to continue with this tutorial.
The first script line is the “shebang” line. It tells the system what interpreter to use to execute the script:
#!/bin/bash
The following lines are scheduler directives. They tell the job scheduler (either SGE or PBS) certain
details about your job. In this instance, we have specified a job name („parse_input_file‟).
#$
-N parse_input_file
#PBS -N parse_input_file
Notice that the script file provides directives for both SGE (#$) and PBS (#PBS). You will only use one
or the other scheduler on your system, but you may share your script with someone working with the
other scheduler, so best practice is to include both. For brevity, examples below will only use the SGEformat directive.
The next script lines collect and extend the input variables:
INPUT_FILE=”$1”
MATCH_PATTERN=”$2”
OUTPUT_FILE=”$INPUT_FILE.$MATCH_PATTERN”
that you passed on the command line when submitting your job to the job scheduler using the „qsub‟
command („/path/to/my/input.file‟ is assigned to $1, „myPattern‟ is assigned to $2):
$ qsub myScript.sh /path/to/my/input.file myPattern
The last script lines provide some useful feedback text into the job‟s log file („echo ...‟) and finally do the
actual work („grep ...‟):
3
Using q: a step-by-step guide
echo “parsing $INPUT_FILE $MATCH_PATTERN lines to $OUTPUT_FILE”
grep $MATCH_PATTERN $INPUT_FILE | cut –f1 > $OUTPUT_FILE
In this case, the work uses the system utility „grep‟ to find all lines in your input file that match your input
pattern, pipes those output lines („|‟) to the system utility „cut‟, which strips out just the first field/column
of every line, and finally redirects the final output („>‟) to a new file.
STEP 4: Wrap your worker script into a q file and submit it
The essence of q is that instead of submitting your worker script directly via qsub as was done above,
you will create a q wrapper around your script:
$ cat myMaster.q
$INPUT_FILE /path/to/my/input.file
$MATCH_PATTERN myPattern
qsub myScript.sh $INPUT_FILE $MATCH_PATTERN
and use q to submit the work to the scheduler:
$ q submit myMaster.q
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
no syntax errors detected
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
job_name
array job_ID
job_# depends on job_#
parse_input_file ------------45678
1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In the q master file („myMaster.q‟), the following is the basic white-space-delimited syntax for creating
and assigning values to variables:
$INPUT_FILE /path/to/my/input.file
Notice that the q command-line syntax follows the common pragma of many bioinformatics utility suites
that encompass many different commands - „q <command>‟. Use „q --help‟ to see the available
commands. Many deliberately mimic the job scheduler commands that they call (e.g. „qsub‟ becomes
„q submit‟).
You might be thinking that it is pointless to wrap one script (the q file) around another script (your
worker script) – but keep reading! One nice feature you can already see is that q provides extensive
syntax checking of q instructions files. No jobs will be queued if any syntax errors exist.
STEP 5: Monitor your job(s)
Even with the incredibly simple pipeline defined in steps 3 and 4, q offers immediate advantages for job
monitoring. A key feature is that with q, all management is done at the pipeline level, rather than at the
job or user level. Thus, you can follow related groups of jobs as a unit.
To monitor the status of all pipeline jobs we just submitted:
4
Using q: a step-by-step guide
$ q status myMaster.q
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
updating myMaster.q.status
updated 01/01/2012 01:00:00
qType
SGE
submitted
01/01/2012 12:00:00
myUser
job_name
array job_ID
exit_status start_time wall_time maxvmem
parse_input_file -------45678
0
Sun 01/01/12 12:00 00:05:00
0.549G
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
q checks with the system on the job(s) status(es) and reports various bits of information. In this case,
we see that the job we just submitted („parse_input_file‟) has already successfully completed since its
exit status is 0 – it might be „r‟ if the job was running, etc. The information is permanently stored such
that the pipeline status will always be available by a call to „q status‟.
You can similarly recall other information about the pipeline, including the exact script file that ran a job:
$ q script --job 45678 myMaster.q
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
updating myMaster.q.status
=================================================================================
job: 45678
--------------------------------------------------------------------------------.myMaster.q.data/script/SGE/parse_input_file.sh
#!/bin/bash
echo "q: running on host: $HOSTNAME"
echo
source "/path/to/some/folder/q-<#.#.#>/lib/utilities.sh"
checkPredecessors
getTaskID
#$
#$
#$
-N
-j
-o
parse_input_file
y
.myMaster.q.data/log/SGE
INPUT_FILE=”$1”
MATCH_PATTERN=”$2”
OUTPUT_FILE=”$INPUT_FILE.$MATCH_PATTERN”
echo “parsing $INPUT_FILE $MATCH_PATTERN lines to $OUTPUT_FILE”
grep $MATCH_PATTERN $INPUT_FILE | cut –f1 > $OUTPUT_FILE
=================================================================================
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
and the log file that the job created:
$ q report --job 45678 myMaster.q
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
updating myMaster.q.status
=================================================================================
job: 45678
--------------------------------------------------------------------------------.myMaster.q.data/log/SGE/parse_input_file
--------------------------------------------------------------------------------q: target script: .myMaster.q.data/script/SGE/parse_input_file.sh
q: execution started: Sun 01/01/12 12:00
q: running on host: myhost.org
5
Using q: a step-by-step guide
parsing /path/to/my/input.file myPattern lines to /path/to/my/input.file.myPattern
q: exit_status: 0; walltime: 5:00.00; ...
q: execution ended: Sun 01/01/12 12:05
=================================================================================
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Notice that the script that was actually submitted to the scheduler was parsed and modified in order to
support some of the useful reporting features of q. Also, notice that you did not have to know anything
about the jobs, such as where their files have been deposited, in order to check any of the above
information. All you need to know to monitor and manage a pipeline is the path to the master q
instructions file that queued the pipeline jobs.
STEP 6: Handle multiple input values
The value of q becomes more evident as pipelines become capable of handling more inputs and more
jobs. First, let‟s modify our pipeline to allow it to handle many different search patterns at once.
$ cat myMaster.q
$INPUT_FILE /path/to/my/input.file
$MATCH_PATTERNS myPattern1 myPattern2
qsub myScript.sh $MATCH_PATTERN $MATCH_PATTERNS
$ cat myScript.sh
#!/bin/bash
#q require $INPUT_FILE $MATCH_PATTERN
#$ -N parse_input_file
OUTPUT_FILE=”$INPUT_FILE.$MATCH_PATTERN”
echo “parsing $INPUT_FILE $MATCH_PATTERN lines to $OUTPUT_FILE”
grep $MATCH_PATTERN $INPUT_FILE | cut –f1 > $OUTPUT_FILE
Notice that we now establish a list of match patterns in q line:
$MATCH_PATTERNS myPattern1 myPattern2
which are queued via a set of qsub calls by line:
qsub myScript.sh $MATCH_PATTERN $MATCH_PATTERNS
where the syntax „$MATCH_PATTERN $MATCH_PATTERNS‟ tells q to execute qsub once for every
$MATCH_PATTERN in the list of $MATCH_PATTERNS, i.e. one job will be queued using „myPattern1‟ and a
second using „myPattern2‟. Thus, with the same number of script lines, we were able to handle
multiple search patterns. Importantly, the two jobs are executed in parallel, which means that both will
be queued for immediate execution with neither job depending on the other. That makes sense, since
patterns can be searched for independently.
To support passing of multiple inputs, additional changes were made to the worker script. These
changes alter the way in which input information is passed to the worker to a more extensible and
versatile format. Variables are no longer passed on the command line via the implicit variables „$1‟ and
„$2‟, so the following lines were simply deleted:
INPUT_FILE=”$1”
MATCH_PATTERN=”$2”
6
Using q: a step-by-step guide
Instead, the standard behavior of q is to pass all of its currently defined variables to the worker script as
environment variables. This means that a shell script queued by q can simply use a q variable without
having to define the variable itself, for example in line:
echo “parsing $INPUT_FILE $MATCH_PATTERN lines to $OUTPUT_FILE”
The line that was added to the worker script is a new q-specific directive „#q require‟:
#q require $INPUT_FILE $MATCH_PATTERN
which tells q that this worker script expects to receive values for variables $INPUT_FILE and
$MATCH_PATTERN. If the q file fails to define these, a syntax error will be thrown at submission time and
no jobs will be queued. The „#q require‟ directive is not necessary within a worker script – scripts will
work fine and receive passed variables even if this line is omitted. However, it is highly recommended
to use „#q require‟ directives as they allow q to perform syntax checking and thereby ensure that worker
scripts will have their required information at execution time (which could be hours or days away,
depending on how busy your server is).
STEP 7: Handle combinations of inputs
Next, let‟s imagine that you actually have multiple input files that you wish to search for each of your
multiple patterns:
$ cat myMaster.q
$INPUT_PATH /path/to/my/inputs
$INPUT_FILES $INPUT_PATH/input1.file $INPUT_PATH/input2.file
$MATCH_PATTERNS myPattern1 myPattern2
qsub myScript.sh $INPUT_FILE $INPUT_FILES * $MATCH_PATTERN $MATCH_PATTERNS
First, notice how q allows you to use variables in subsequent q lines, such that the line:
$INPUT_FILES $INPUT_PATH/input1.file $INPUT_PATH/input2.file
establishes a list of $INPUT_FILES using the previously defined $INPUT_PATH. More importantly, we
have now modified the call to „qsub‟ to execute a combinatorial matrix of jobs:
qsub myScript.sh $INPUT_FILE $INPUT_FILES * $MATCH_PATTERN $MATCH_PATTERNS
where the operator „*‟ tells q to execute a job for every combination of $INPUT_FILES and
$MATCH_PATTERNS. Continue to notice that we are greatly expanding the amount of queued work
through minor modifications of existing script lines. If we change the operator to „+‟:
qsub myScript.sh $INPUT_FILE $INPUT_FILES + $MATCH_PATTERN $MATCH_PATTERNS
we have now created a linear matrix of jobs which would queue one job using the nth item on the
$INPUT_FILES and $MATCH_PATTERNS lists (i.e. one job using „input1.file‟ and „myPattern1‟ and a
second job using „input2.file‟ and „myPattern2‟).
7
Using q: a step-by-step guide
STEP 8: Collect information from the system at submission time
To create a truly robust pipeline – one that can be recalled again and again – one might prefer that the
list of input files be determined dynamically at the time that the pipeline is queued, i.e. at submission
time. This is achieved in q by the following small modification:
$ cat myMaster.q
$INPUT_PATH /path/to/my/inputs
$INPUT_FILES run stat --printf "%n " $INPUT_PATH/*.file 2>/dev/null
dieUnless $INPUT_FILES
$MATCH_PATTERNS myPattern1 myPattern2
qsub myScript.sh $INPUT_FILE $INPUT_FILES * $MATCH_PATTERN $MATCH_PATTERNS
The following line collects all files matching „*.file‟ from $INPUT_PATH:
$INPUT_FILES run stat --printf "%n " $INPUT_PATH/*.file 2>/dev/null
where the keyword „run‟ tells q that it should execute the command „stat ...‟ immediately on the system
and assign the results of that command to $INPUT_FILES. If you do not understand the command,
consult „stat --help‟ to see why that line does what is needed, which is to provide the list of files to q as
a white-space-delimited list. In this instance, we have also chosen to include an error catch:
dieUnless $INPUT_FILES
which will throw a syntax error if no matching files were found. Other similar q instructions die only if
files exist („dieIf‟), or exit the q file quietly without throwing an error („exit‟, „exitUnless‟, „exitIf‟).
With a dynamically populated pipeline, it is of course possible that the set of input files might change
over time. One might imagine running a pipeline today, and then wanting to run it again tomorrow on
only the new files that have appeared in $INPUT_PATH. q provides implicit management of this situation
by keeping track of the content of jobs that have been queued by a pipeline. To submit only new work,
use command „q extend‟ instead of „q submit‟. „q extend‟ is also useful if you add new tasks to an
existing pipeline (see more below) – only new tasks will be submitted by „q extend‟.
In addition to collecting system information, you might sometimes want to do a bit of work on the
system at submission time. Examples might be to create an output directory or to provide command
line feedback about the submission. For example:
$ cat myMaster.q
$INPUT_PATH /path/to/my/inputs
mkdir -p $INPUT_PATH
will create the directory „/path/to/my/inputs‟ when you execute „q submit‟. Notice that when you don‟t
need a return value, you can simply place the submission-time command inline; no „run‟ keyword is
needed. As with variable assignments, any command recognized on the system can be executed at
submission time. In this way, your q files take on part of the flavor of shell scripts.
STEP 9: Inherit information from the environment
Some categories of information used by a pipeline are of broader scope than even the master file.
Some might be system-specific, perhaps the path to a program file. Others might define user
preferences, such as whether the job scheduler should send emails. Still others might define
8
Using q: a step-by-step guide
information common to a project, where multiple different masters within that project should all get the
same value. To support such environment-level assignments, before „q submit‟ acts on a master
instructions file it looks in three locations for files named „environment.q‟ and „environment.sh‟:
1) The directory in which the q program target resides
2) ~/.q (i.e. the user‟s home/.q directory)
3) The directory in which the master file resides
Any „environment.q‟ files that are found are executed before the master file is executed, i.e. as if they
were the first lines in the master file. Any lines found in „environment.sh‟ files are incorporated at the
top of all job scripts queued by the master. Environment files are acted on in the order indicated above,
so that, for example, a user might choose to override a piece of environment information specified in
the q directory. The following example:
$ cat environment.q
$INPUT_PATH /path/to/my/inputs
$ cat ~/q/environment.sh
#$ -m N
$ cat myMaster.q
$INPUT_FILE $INPUT_PATH/input.file
$MATCH_PATTERN myPattern
qsub myScript.sh
shows how an environment-level path specification was passed to „myMaster.q‟ and that a user has
specified that no jobs should send emails.
STEP 10: Execute multiple tasks – serial jobs
Up to this point, our pipeline only executes one task, albeit on multiple data values. Now let‟s make our
pipeline do some additional work on $OUTPUT_FILE:
$ cat myScript.sh
#!/bin/bash
#q require $INPUT_FILE $MATCH_PATTERN
#$ -N parse_input_file
OUTPUT_FILE=”$INPUT_FILE.$MATCH_PATTERN”
echo “parsing $INPUT_FILE $MATCH_PATTERN lines to $OUTPUT_FILE”
grep $MATCH_PATTERN $INPUT_FILE | cut –f1 > $OUTPUT_FILE
wc –l $OUTPUT_FILE
Keeping our work very simple for the purpose of this tutorial, we have merely added a step to our
worker script that counts the number of lines in $OUTPUT_FILE:
wc –l $OUTPUT_FILE
where the count information is written to the log file and accessible using „q report‟. The key point is
that the „grep ...‟ and „wc ...‟ commands are executed sequentially by the worker script, i.e. in a serial
fashion. So, one queued job executes two serial tasks. Contrast that with the following:
9
Using q: a step-by-step guide
$ cat myMaster.q
$INPUT_FILE /path/to/my/input.file
$MATCH_PATTERN myPattern
$OUTPUT_FILE $INPUT_FILE.$MATCH_PATTERN
qsub myScript1.sh
qsub myScript2.sh
$ cat myScript1.sh
#!/bin/bash
#q require $INPUT_FILE $MATCH_PATTERN $OUTPUT_FILE
#$ -N parse_input_file
echo “parsing $INPUT_FILE $MATCH_PATTERN lines to $OUTPUT_FILE”
grep $MATCH_PATTERN $INPUT_FILE | cut –f1 > $OUTPUT_FILE
$ cat myScript2.sh
#!/bin/bash
#q require $OUTPUT_FILE
#$ -N wc_output_file
echo “counting lines in $OUTPUT_FILE”
wc –l $OUTPUT_FILE
Here, the following lines tell q to submit two serial jobs, each of which will use a different worker script
to execute a single task:
qsub myScript1.sh
qsub myScript2.sh
where „serial‟ means that the job corresponding to „myScript2.sh‟ will not start execution until the job
corresponding to „myScript1.sh‟ has exited successfully. Because the second worker script uses the
$OUTPUT_FILE of the first script, it is necessary that $OUTPUT_FILE be defined in „myMaster.q‟ and
passed to each worker script, instead of being defined in „myScript1.sh‟. This is easily seen in the
following lines.
$OUTPUT_FILE $INPUT_FILE.$MATCH_PATTERN
#q require $INPUT_FILE $MATCH_PATTERN $OUTPUT_FILE
#q require $OUTPUT_FILE
It might not be obvious why it can be advantageous to use two serial jobs instead of incorporating serial
work into a single job. First, it may help you organize your work. More importantly, consider a situation
where one serial command uses a lot of memory but runs for a short time, while the second command
uses little memory but takes a long time. Queuing these as a single job with a large requested memory
would occupy all of that memory for the long time that the second command is running, which is
wasteful of system resources. In such instances, it is better to separate the work into two jobs.
STEP 11: Execute multiple tasks – parallel jobs (arrays and threads)
One of the most critical issues in efficient job execution and resource management is the ability to
submit independent jobs in a parallel fashion to different available processors. We have already
encountered one way in which q allows the submission of parallel jobs – matrices of input values to a
single task are always submitted in parallel (see Step 7).
10
Using q: a step-by-step guide
A second parallelization strategy is implicit to the fact that worker scripts support any job scheduler
directives you might already be using. One relevant directive class requests that multiple processors
be assigned to the job, to be used by some appropriately aware program called by the job. q does not
alter this behavior so it will not be demonstrated further.
A third parallelization strategy is again implicit to available scheduler directives, namely the execution of
array jobs through the „-t‟ directive:
$ cat myMaster.q
$INPUT_PATH /path/to/my/inputs
$N_FILES 2
$MATCH_PATTERN myPattern
qsub myScript.sh
$ cat myScript.sh
#!/bin/bash
#q require $INPUT_PATH $N_FILES $MATCH_PATTERN
#$ -N parse_input_file
#$ -t 1-$N_FILES
INPUT_FILE=”$INPUT_PATH/input$TASK_ID.file”
OUTPUT_FILE=”$INPUT_FILE.$MATCH_PATTERN”
echo “parsing $INPUT_FILE $MATCH_PATTERN lines to $OUTPUT_FILE”
grep $MATCH_PATTERN $INPUT_FILE | cut –f1 > $OUTPUT_FILE
Array jobs are multiple iterations of the same job that differ only in the value of a task ID variable that is
set and passed by the job scheduler to the job. q bridges different platforms by always making the task
ID available through the common environment variable $TASK_ID. Thus, given that $N_FILES was
assigned a value of 2, the following lines:
#$ -t 1-$N_FILES
INPUT_FILE=”$INPUT_PATH/input$TASK_ID.file”
will result in an array of two jobs being queued. One with $TASK_ID=1 will act on $INPUT_FILE
$INPUT_PATH/input1.file, the other with $TASK_ID=2 will act on $INPUT_FILE
$INPUT_PATH/input2.file. Array jobs are managed by the job scheduler, but q keeps track of them
for you in status and report calls, for example:
$ q status myMaster.q
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
updating myMaster.q.status
updated 01/01/2012 01:00:00
qType
SGE
submitted
01/01/2012 12:00:00
myUser
job_name
array job_ID
exit_status start_time wall_time maxvmem
parse_input_file -------- @
45678
0
Sun 01/01/12 12:00 00:05:00
0.549G
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tells you that the job „parse_input_file‟ is an array job because the „@‟ symbol has appeared in the
array column. Notice that you get only one status line for the entire array, but „q report‟ will show a log
file for each task within the array.
q finally supports the creation of “threads” in q files, which allow series of jobs to be handled in parallel.
Consider the following (which is not well-written code, see Step 12):
11
Using q: a step-by-step guide
$ cat myMaster.q
$INPUT_FILE /path/to/my/input.file
thread fork1
$MATCH_PATTERN myPattern1
$OUTPUT_FILE $INPUT_FILE.$MATCH_PATTERN
qsub myScript1.sh
qsub myScript2.sh
thread fork2
$MATCH_PATTERN myPattern2
$OUTPUT_FILE $INPUT_FILE.$MATCH_PATTERN
qsub myScript1.sh
qsub myScript2.sh
thread fusion fork1 fork2
qsub myScript3.sh
Here, we seek to call „myScript1.sh‟ and „myScript2.sh‟ on both „myPattern1‟ and „myPattern2‟. Similar
to above examples, we know that the „myScript2.sh‟ needs to begin after „myScript1.sh‟, but
„myScript1.sh‟ can begin work immediately on each of the two patterns. The indicated thread
arrangement achieves this, because line:
thread fork2
forces a break in the job dependency chain such that any job following the thread designation is longer
dependent on jobs preceding it. As in any multi-threading environment, it is often necessary that job
threads be brought back together, which is manifest in lines:
thread fusion fork1 fork2
qsub myScript3.sh
which declares that thread „fusion‟ depends on successful completion of threads „fork1‟ and „fork2‟, and
thus that „myScript3.sh‟ will not begin until both instances of „myScript2.sh‟ have exited successfully.
One of the core functions of q is to keep track of these potentially many and complex job dependencies
for you. This information is communicated to you at submission time in the following manner:
$ q submit myMaster.q
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
no syntax errors detected
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
job_name
array job_ID
job_# depends on job_#
parse_input_file ------------45678
1
wc_output_file --------------45679
2
1
parse_input_file ------------45680
3
wc_output_file --------------45681
4
3
myScript3_job ---------------45682
5
2,4
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
where the rightmost two columns use a simplified job number designation to show you that job 2
depends on job 1, job 4 depends on job 3, and job 5 depends on both 2 and 4.
12
Using q: a step-by-step guide
STEP 12: Create modular pipelines (slaves)
The last version of „myMaster.q‟ shown in Step 11 is not a particularly good design because the lines:
$OUTPUT_FILE $INPUT_FILE.$MATCH_PATTERN
qsub myScript1.sh
qsub myScript2.sh
are repeated within the code. This is never good, since you might accidentally create two different
versions without realizing it. What you want is a modular structure with two calls to the above series of
lines. This is achieved in q using the „invoke‟ instruction as follows:
$ cat myMaster.q
$INPUT_FILE /path/to/my/input.file
thread fork1
$MATCH_PATTERN myPattern1
invoke mySlave.q
thread fork2
$MATCH_PATTERN myPattern2
invoke mySlave.q
$ cat mySlave.q
$OUTPUT_FILE $INPUT_FILE.$MATCH_PATTERN
qsub myScript1.sh
qsub myScript2.sh
Here, we have created a module of code within a new q file called „mySlave.q‟ and made a call to that
code in each master thread in lines:
invoke mySlave.q
In a well-designed pipeline, only master files identify the primary data to be analyzed. Downstream q
files called by the master are generic and only do what they are told by the master, hence their
designation as “slave” files. Slave files receive all variable information defined in the master at the point
at which the slave is invoked (and can also pass information back to the master, see Step 13). If a
slave attempts to use a variable that a master has not defined, a submission-time error is thrown and
no jobs are queued. Slaves can be invoked with multiple values and using linear and combinatorial
matrices exactly as described above for qsub commands, for example the following are all valid and
result in multiple iterative parallel calls to „mySlave.q‟:
invoke
invoke
invoke
invoke
mySlave.q
mySlave.q
mySlave.q
mySlave.q
$VAR $VARS
$VAR 1 2 3
$VAR1 $VAR1S + $VAR2 $VAR2S
$VAR1 $VAR1S * $VAR2 $VAR2S
Importantly, slave files can themselves invoke other q slaves – this is common in well-developed
pipelines. There is no limit to the amount of slave nesting that is possible. myMaster.q could invoke
mySlave1.q, which invokes mySlave2.q, which invokes mySlave3.q, ad infinitum.
Because slaves are generic and modular, it follows that you will often want to reuse them in different
pipelines. Just one example might be a slave that validates the md5sum of downloaded files, which
might be important for many projects you are working on. A well-designed generic and modular
13
Using q: a step-by-step guide
structure also means that slaves can be easily shared with investigators seeking to perform the same
task as you, without the receiver having to rewrite your slave code to their purpose.
STEP 13: Pass information upstream from slave to master
In addition to slave files that queue jobs, many pipelines use slave files whose only purpose is to assign
values to variables. One common use might be a slave that communicates elements of a file
organization schema to a number of different masters, as illustrated here:
$ cat myMaster.q
invoke mySlave.q
$INPUT_FILE $INPUT_PATH/input.file
$MATCH_PATTERN myPattern
qsub myScript.sh
$ cat mySlave.q
preserve $INPUT_PATH /path/to/my/files
where the slave command „preserve‟ tells q that „/path/to/my/files‟ should be passed upstream to the
master that invoked the slave and placed into master variable $INPUT_PATH. A shortcut for slaves that
assign values to multiple variables is „preserve all‟:
$ cat mySlave.q
$INPUT_PATH /path/to/my/input/files
$OUTPUT_PATH /path/to/my/output/files
preserve all
which passes all slave values back to the master with the same variable names as defined in the slave.
A critical behavior occurs when a slave is subjected to multiple parallel invocations. Here, the slave
value that is being preserved is appended to a growing list in the master, with one new value added for
every slave invocation in the matrix. Thus, the following:
$ cat myMaster.q
invoke mySlave.q $FILE_NUMBER 1 2
$MATCH_PATTERN myPattern
qsub myScript.sh $INPUT_FILE $INPUT_FILES
$ cat mySlave.q
$INPUT_PATH /path/to/my/input/files
preserve $INPUT_FILES $INPUT_PATH/input$FILE_NUMBER.file
is yet another way of calling „myScript.sh‟ on a list of files. Here, when the code reaches master line:
qsub myScript.sh $INPUT_FILE $INPUT_FILES
$INPUT_FILES carries the value „/path/to/my/input/files/input1.file /path/to/my/input/files/input2.file‟,
NOT simply „/path/to/my/input/files/input2.file‟. This behavior is by design to allow slaves to create lists
in the master, but is sometimes unexpected by people who think that only the last slave value will be
retained. It is to highlight this distinction that q uses the keyword „preserve‟ instead of the more
common „return‟, since „return‟ does not append in other common scripting languages.
14
Using q: a step-by-step guide
STEP 14: Simplify your code with embedded files
Up to this point, we have shown every worker script or q slave called by a master file as being recorded
as a new file on the file system. This becomes tedious when many small files are defined in a pipeline.
A method of simplifying the number of files you work with is to embed workers and slaves into a q file.
Thus, the last example in Step 13 can be re-written as:
$ cat myMaster.q
invoke mySlave.q $FILE_NUMBER 1 2
$MATCH_PATTERN myPattern
qsub myScript.sh $INPUT_FILE $INPUT_FILES
<file name=”mySlave.q”>
$INPUT_PATH /path/to/my/input/files
preserve $INPUT_FILES $INPUT_PATH/input$FILE_NUMBER.file
</file>
<file name=”myScript.sh”>
#!/bin/bash
#q require $INPUT_FILE $MATCH_PATTERN
#$ -N parse_input_file
OUTPUT_FILE=”$INPUT_FILE.$MATCH_PATTERN”
echo “parsing $INPUT_FILE $MATCH_PATTERN lines to $OUTPUT_FILE”
grep $MATCH_PATTERN $INPUT_FILE | cut –f1 > $OUTPUT_FILE
</file>
where „<file name=”...”>...</file>‟ is an XML-like tag schema for marking off a block of text to be treated
as if it were a file on the file system. These embedded “virtual” files are called without any file path
information, i.e. simply as „myScript.sh‟. Here, it should be noted that throughout this tutorial we have
not provided file path information to disk scripts for brevity and clarity, but in fact qsub and invoke calls
do need to be made with appropriate file paths, such as:
$ cat myMaster.q
$SCRIPT_PATH /path/to/my/scripts
$INPUT_FILE /path/to/my/input.file
$MATCH_PATTERN myPattern
qsub $SCRIPT_PATH/myScript.sh $INPUT_FILE $MATCH_PATTERN
STEP 15: Diversify your pipeline with other scripting languages
Up to this point, we have only considered worker scripts written in the shell scripting language, which
are indeed a common and often exclusive target of job scheduler qsub commands. However, q allows
you to use worker scripts written in virtually any scripting language. Common non-shell examples are
Perl, Python, and R. The only requirements of your target script are that it:
1) be written in ASCII text (i.e. non-binary)
2) have an interpreter that can be specified on the shebang line
3) support „#‟ format comment lines
Thus, the following example uses Perl to perform the same bit of work that we have done up to now
using „grep ...‟:
15
Using q: a step-by-step guide
$ cat myMaster.q
$INPUT_FILE /path/to/my/input.file
$MATCH_PATTERNS myPattern1 myPattern2
qsub myScript.pl $MATCH_PATTERN $MATCH_PATTERNS
<file name=”myScript.pl”>
#!/usr/bin/perl
#q require $INPUT_FILE $MATCH_PATTERN
#$ -N parse_input_file
my $OUTPUT_FILE = "$ENV{INPUT_FILE}.$ENV{MATCH_PATTERN}";
print "parsing $ENV{INPUT_FILE} $ENV{MATCH_PATTERN} lines to $OUTPUT_FILE\n";
open my $inH, "<", $ENV{INPUT_FILE} or die "$!\n";
open my $outH, ">", $OUTPUT_FILE or die "$!\n";
while(<$inH>){
$_ =~ m|$ENV{MATCH_PATTERN}| or next;
chomp;
my ($field) = split("\t", $_);
print $outH “$field\n”;
}
close $inH;
close $outH;
</file>
Notice that most facets of calling this non-shell worker script remain the same, including the use of
embedded files, the placement of q and job scheduler directives within the worker, and the ability to call
the worker with multiple parallel input values. It is also still true that q variables are passed to the
worker script as environment variables – the syntax of how these values are accessed will depend on
the scripting language (Perl uses the %ENV hash).
STEP 16: Optimize pipelines using built-in commands
Although you can use almost any text-based scripting language with q, there are advantages to shellbased worker scripts. First, the example in Step 15 demonstrates that standard Linux shell commands
such as „grep‟ and „cut‟ streamline many tasks and simplify your code compared to what you will often
get when you rewrite these tasks in another language.
More importantly for this tutorial, q provides an expanding set of utilities that are available to any shell
worker script as single-word commands that can be called without you needing to worry about loading
the utilities source file (q does that for you). These q-provided utility functions are:
checkTaskID
Ensure that $TASK_ID is defined, i.e. that this script
has been successfully called as part of an array job.
checkForData "some command"
Check that a data stream will have at least one line
of data. Exit quietly without error if the stream has
no data. This helps prevent errors when later steps
of a data analysis stream crash if provided with zero
lines of data.
checkPipe
Check whether all components of a data stream
have exited successfully, i.e. with exit status 0. If
any part of a stream failed, the worker fails with exit
status 100.
16
Using q: a step-by-step guide
waitForFile $FILE [$TIME_OUT]
Wait for up to $TIME_OUT seconds (default: 60) for a
$FILE to appear on disk with at least 1 byte of data.
This can help prevent a 2nd job in a series from
beginning before a file written by the 1st job has
been recorded in the file system.
snipStream "some command" [$MAX_LINES]
Excerpt a data stream into the log file being written
by this job. The log “snip” will consist of
$MAX_LINES (default: 10) lines taken from the head
and tail of the stream.
snipFile $FILE [$MAX_LINES]
Similar to snipStream, except applied to a file.
The following shows most of these commands in use in a single worker script:
$ cat myScript.sh
#!/bin/bash
#q require $INPUT_PATH $N_FILES $MATCH_PATTERN
#$ -N parse_input_file
#$ -t 1-$N_FILES
checkTaskID
INPUT_FILE=”$INPUT_PATH/input$TASK_ID.file”
waitForFile $INPUT_FILE
OUTPUT_FILE=”$INPUT_FILE.$MATCH_PATTERN”
checkForData "grep $MATCH_PATTERN $INPUT_FILE"
echo “parsing $INPUT_FILE $MATCH_PATTERN lines to $OUTPUT_FILE”
grep $MATCH_PATTERN $INPUT_FILE | cut –f1 > $OUTPUT_FILE
checkPipe
snipFile $OUTPUT_FILE
These built-in shell commands, in addition to other shell utilities used behind the scenes by q, require
functions specific to the bash shell. This is one reason that q requires that „/bin/bash‟ be available on
your host server system.
STEP 17: Manage jobs and handle errors
No matter how carefully you use the above features of q to optimize flow control in your pipeline, job
errors inevitably occur that affect not only the job that failed but also any jobs that were dependent on it
(called “successor” jobs). An important feature of q is that it remembers the dependency chains applied
to the jobs in your pipeline.
Here, we must consider one key difference between the SGE and PBS job schedulers. In PBS, when a
job fails (exit status != 0), any successor jobs are automatically deleted and will not run. This is not true
with SGE, where most non-zero exit statuses cause the job to fail and successor jobs to begin
immediately. However, SGE jobs that exit with status 100 are transitioned to error state „Eqw‟ and held
indefinitely. Successor jobs continue to wait their turn with status „hqw‟.
q understands this difference, and this particular behavior of SGE, by wrapping all worker jobs in such a
way that they either succeed with exit status 0 or fail with exit status 100. In this way, successor jobs
will never begin on an SGE system when a predecessor job fails. This configuration also supports the
SGE-specific command „q clear‟. This command removes the Eqw state from a job, a series of jobs, or
all jobs in a pipeline, and causes them to restart anew. This is useful if jobs failed because of a data
17
Using q: a step-by-step guide
source error that has now been corrected. It is useless if the underlying problem was a code error,
since clearing an error causes the exact same job to try again.
More generally, on both SGE and PBS q supports manipulation of queued jobs using the „q delete‟ and
„q resubmit‟ commands. „q delete‟ kills pending or running job(s), while „q resubmit‟ re-queues exact
replicas of previously queued jobs, in a manner analogous to their associated job scheduler
commands, „qdel‟ and „qresub‟. The key difference is that the q commands obey the job dependency
chain specified by your pipeline definition files. Thus, if you target „q delete‟ to a specific job, q ensures
that the entire dependency tree below that job is also deleted and recorded as such.
STEP 18: Protect your output data
The purpose of most pipelines is to create derivative data output files. It is important that these files be
protected against accidental loss. q provides two internal functions to assist with this – write-protection
and backup copying. Here, the added value of q is that it allows you to record what files you want
write-protected and backed up within the q files that define the pipeline, where information like file paths
are readily available:
$ cat myMaster.q
$INPUT_PATH /path/to/my/inputs
$INPUT_FILES $INPUT_PATH/input1.file $INPUT_PATH/input2.file
$MATCH_PATTERNS myPattern1 myPattern2
qsub myScript.sh $INPUT_FILE $INPUT_FILES * $MATCH_PATTERN $MATCH_PATTERNS
protect $INPUT_FILES
backupDir /path/to/my/backup
backup $INPUT_PATH
where „protect‟ takes a single file, a list of files, a file glob, or a „find‟ command that tells q what files to
write-protect, „backupDir‟ is the local or remote directory where backup copies should be placed, and
„backup‟ identifies the directory (not the files) that should be backed up. When „q submit‟ is applied to
„myMaster.q‟, protect and backup jobs are automatically added after all other jobs of the pipeline have
been queued. Additionally, you can reapply these functions using „q protect‟ and „q backup‟ from the
command line, and reverse them using „q unprotect‟ and „q restore‟. Once again, notice how these
functions of q are applied at the pipeline level. The pipeline carries the information needed to protect
itself, and you protect the pipeline with a single command directed at the master file that defines the
pipeline, without needing to remember exactly what files it created.
q‟s approach to backup is robust and streamlined, but fairly simplistic. Sometimes much more complex
backup strategies are desirable, for example maintenance of incremental backups. In those instances,
users should use rsync directly.
STEP 19: Manage your pipeline
Often, you do not want a master file to submit jobs again, which would just re-execute finished work. q
provides internal protection against this eventuality by causing „q submit‟ and „q extend‟ to check
whether work has previously been queued and either prompt for permission to re-queue it („q submit‟)
or ignore repeated jobs („q extend‟). To make absolutely sure that no work can get re-queued (for
example by users who like to use q option „--force‟, which bypasses repeat-job checks), you can
execute „q lock‟ on a master file, and reverse it using „q unlock‟.
Other times you will want to delete everything, or just the most recent things, you have done in a
pipeline. To support this, q maintains a rolling stack of the history of submission steps that have been
18
Using q: a step-by-step guide
taken, given that q inherently does allow incremental submissions of a pipeline at different times for
various reasons discussed in Step 8. You can force a copy of the current status onto the stack using „q
archive‟ (although typically the automatic copies are all that is needed), and view the most recently
archived status copy using „q status --archive‟. To delete just the most recent set of submitted jobs, use
„q rollback‟. To delete all jobs and wipe the pipeline slate clean, use „q purge‟. Both „q rollback‟ and „q
purge‟ will, after getting permission, delete any running jobs as well as remove all status, log and script
files for the stack levels being deleted. They will not, however, remove data output files that might have
been created by the pipeline.
Lastly, use „q move‟ to move a pipeline to a new file location. This is analogous to Linux command
„mv‟, but you should not use „mv‟ to move a q master file that has previously submitted jobs. This is
because the database of files that q is using to keep track of your pipeline are tied to the master file.
„q move‟ knows this and handles movement of all required files, as well as renames content within
status, log and report files as needed.
STEP 20: Organize your pipeline definition files
One of the consistent attributes of an efficient pipeline is that it is neat and tidy, with an organized file
schema. There are many ways one might do this, but the following organization of q instructions files
and worker scripts is consistent with the logic of q and recommended:
/path/to/some/folder
|__projects
| |__<projectName> ...
|
|__masters
|
| |__<masterClass> ...
|
|
|__README
|
|
|__<masterClass>.q.template
|
|
|__<masterName>.q ...
|
|__slaves
|
|__<commonClass> ...
|
| |__<slaveName>.q
|
| |__<slaveName>.sh
|
|__<masterClass> ...
|
|__<slaveName>.q
|
|__<slaveName>.sh
|__slaves
|__<commonClass> ...
|__<slaveName.q>
|__<slaveName.sh>
In words, work is often logically organized into projects (e.g. exome sequencing). Within each project,
you will often want or need to divide work into different classes of master tasks (e.g. read mapping and
variant calling). These classes then each have multiple iterations of named master files, where an
individual master file provides information about a specific set of data inputs (e.g. a new sequencing
run). As a unit of logical work, a master class specifies help information in a README file and provides
a q template file to make it easy to create new master file iterations.
Slave files called by the masters are kept under separate directories to prevent them from being
confused with masters and to support robust modular pipeline structures. The recommended
organization recognizes multiple types of modular slaves. Some slaves are specific to a master class
and define the work performed by the master class. These slaves are held in a slave directory of the
same name as the master class. Other slaves provide functions common to a project, for example a
19
Using q: a step-by-step guide
file schema definition. These are held under common slave class directories. Still other slaves provide
common utilities used across many projects (e.g. file system utilities or standardized read mapping
algorithms). These are logically held outside of the projects scope.
The file schema above does not include data input files or pipeline output files. You should organize
these files as makes sense for your work, but in general it is recommended that data files be kept in
separate directories from the q instructions files and worker scripts that use and create them.
For the most part the file schema above is only a recommendation. When run from the command line,
q can just as easily be used with any alternative organization you prefer. However, if you wish to use
the web interface described below, you will need to honor at least the portions of the organization
schema highlighted in red above, „.../<projectName>/masters/<masterClass>/...‟
STEP 21: Use the q remote web interface
Throughout this tutorial we have assumed you were working at the command line, which is where you
should learn the basic concepts of what q does. However, many users will soon want to transition to
q‟s graphical interface. This interface takes the form of a web server referred to as „q remote‟, which is
accessed by users via any standard web browser. Once connected, q running on the job server can be
controlled by intuitive buttons and inputs, with q output displayed in the browser:
There are two modes in which q remote can be run, depending on whether you are in position to install
a web server on your job server. In „deamon mode‟, the q web server is installed on your local
computer, for example on your desktop or laptop PC, and accesses your job server via ssh or Putty.
This requires that each end user install Perl and two Perl modules on their PC, as well as generate and
install an RSA key for host authentication. Any Windows, Mac or Linux machine can be used. Once
installed, the q remote daemon can be made to point at any number of job servers by creating multiple
configuration files.
In „server mode‟, the q web server is installed on the job server. The job server must therefore run
Apache httpd or some other http server. Users point their web browser to an address determined by
the web server. It is up to the system administrator to ensure that users accessing the web site are
subjected to appropriate authentication and authorization, for example using LDAP, since once
20
Using q: a step-by-step guide
connected they will be able to run virtually any job on your server! Jobs are submitted to the job
scheduler under the user name assigned to the web server.
The interface provided by daemon and server modes are essentially identical once loaded. The modes
differ in the way that they are set up and accessed. In general, server mode will provide a faster
interface, but it requires installation by a system administrator on any job server to be accessed, while
the daemon is entirely under the user's control and is not specific to any one job server. More detailed
instructions for installing the q remote deamon and server are beyond this scope of this tutorial.
Additional help can be found in the README files found within „/path/to/some/folder/q-<#.#.#>/remote„,
and from documentation for the various things that need to be installed.
Once installed, point your web browser to the address provided to you either by the daemon or by your
system administrator. Once loaded, you first need to „Add‟ a new project by providing the full server file
path to a project directory, where a project directory is defined as one that contains a „masters‟
subdirectory according to the file schema in Step 20. The project directory must already exist on the
job server - q remote does not create it for you. From there, select or create master classes and
master names, again according to the file schema in Step 20. Once you have selected a master file,
the remaining inputs should be intuitive if you have basic mastery of the q command line. Mouse-over
“tool tips” provide context-specific help for all inputs.
One functionality built into q remote is the ability edit q instructions files, worker scripts, READMEs and
other files from within the web page. Here, the most common sequence is to create a new master file
as a replicate of the master class template, enter your sample information into the new master file,
Save, Submit, and Monitor.
Finally, once you are enjoying the ease of managing q pipelines from the remote web interface, you
might want to be able to poke and prod a bit more freely around the file system without having to open
a separate ssh/Putty window. q remote supports this through the “Shell” button. The other reason you
may need to use Shell is to allow you to execute commands on the system as the web server user
when using q remote in server mode.
STEP 22: Share your work with others
Congratulations! You have performed all required analyses on your input data, hopefully having
invented and/or discovered something important along the way that you now wish to share with others,
for example in a publication. You should be able to easily share every last detail about your pipeline
design and the specific instances of its execution that led to your interpretations and conclusions.
However, this is often prohibitively difficult, leading to insufficient methods details in many
bioinformatics papers to truly replicate someone else‟s work.
q supports simple and fully transparent pipeline-level sharing. The command „q publish‟ is, like all q
commands, directed at a master instructions file. It creates a static zipped HTML report of many
categories of information defined by that master file for sharing with anyone via standard web browsers.
This information includes (i) the code that defined the pipeline, organized in a hierarchy that reflects the
flow of pipeline execution, (ii) the status, log, script and environment files of all jobs queued by the
pipeline, and (iii) versions and help for system commands and target programs called by the pipeline.
Together, this information provides everything except the raw input data used by your pipeline to allow
the possibility of precise replication of your work, including exact copies of all scripts.
21
Using q: a step-by-step guide
Once again, many options that alter the publication report are specified within the q instructions files
that define the pipeline, where relevant information is naturally available. The following example:
$ cat myMaster.q
$INPUT_FILE /path/to/my/input.file
$MATCH_PATTERN myPattern
qsub myScript.sh $INPUT_FILE $MATCH_PATTERN
publishDir /path/to/my/publication
publishTitle My Pipeline
publishMask patientID,/path/to/my/pipeline
shows a master file specifying the title of its publication HTML report (publishTitle) as well the server
directory into which the report should be placed (publishDir). Here, it is important that multiple
pipeline master files can be published into the same directory. This allows for a single concerted HTML
report to be generated even when a pipeline‟s work was distributed across multiple masters (e.g.
Run_440 and Run_459 in the screenshot above). Finally, instruction publishMask specifies a commadelimited list of strings that should be masked to symbol „~‟ in the final publication report. This allows
the public distribution to be cleaner and free of potentially sensitive information about your system or
the samples under study.
Summary
This tutorial has covered a lot of ground in describing the many functions of q. Much of what q does is
harder to explain than it is to implement. You should go back to the beginning steps of this tutorial and
use them to start building your own pipeline. Start simple and most things will quickly become obvious
to you. Soon enough you will become addicted to how easy q makes many pipeline management
tasks.
The big picture to keep in mind is that q itself makes absolutely no assumptions about the kind of work
you want to do, what your application is, what tools you might be using, how you like write code, etc.
22
Using q: a step-by-step guide
What q provides is a consistent framework around whatever work you do that allows you to define,
manage and communicate that work in logical chunks using a streamlined interface. q helps you to
maximize your use of the job scheduler to get your work done as quickly as the resources of your
system will allow, in part by promoting many modes of parallelization and optimization. It also allows
you to think in larger units of work than you previously could when you were trying to manage that work
on a job-by-job basis. It remembers everything you have done for record keeping and to help you
share your work with others. If you have ever found yourself repeatedly submitting variations of the
same job but not being completely sure that you were submitting it the same way twice or how to report
your activities in a publication – q was designed for you!
Inevitably, q must have its own scripting language to control what your pipeline wrappers do, but this
language is intentionally very minimal with only a few keywords to learn. The concept is that q should
not get in the way of the work you want to do, or redefine how you choose to do it. Instead, q files
should provide you with a simple set of commands collected in one place that integrate and organize
your many data analysis tasks.
Index
In addition to this tutorial and index, use „q --help‟ and „q <command> --help‟ from the command line to
get help on q and its commands.
archive
arrays
backup
backupDir
checkForData
checkPipe
checkTaskID
clear
daemon mode
delete
dependency chains
dieIf
dieUnless
directives
dynamic variables
embedded files
environment
error handling
exit
exit status
exitIf
exitUnless
extend
file schema
<file> tag
history stack
installation
introduction
invoke
lists
lock
master files
matrices
modular design
move
non-shell scripts
parallel jobs
prerequisites
preserve
protect
publish
publishDir
publishIntro
publishMask
publishTitle
purge
q remote
qsub
report
require
restore
resubmit
rollback
run
script
serial jobs
server mode
shebang line
23
slave files
snipFile
snipStream
status
submission time
submit
summary
system commands
$TASK_ID
thread
unlock
unprotect
variable assignment
waitForFile
web interface
write-protection

Download Report

Using q: a step-by-step guide - tewlab

Paperzz.com

Your Paperzz