User Manual - satrap

SATRAP v0.1 -
Solid Assembly TRAnslation Program
User Manual
Introduction
A color space assembly must be translated into
bases before applying bioinformatics analyses.
SATRAP is designed to accomplish this important
task adopting
a very efficient strategy. The
package integrates the Oases pipeline and
several optimizations specifically designed for
color space management. All steps of the
pipeline allow to produce a SOLiD de novo
transcriptome assembly and the subsequent
color space translation. Alternatively, SATRAP can
be used as a stand alone program to perform
color space translation for either RNA-seq or
DNA-seq SOLiD assemblies.
Installing
Programs are supported only for UNIX OS and are written in C++. If you are
interested to install them in a Windows OS system you may use the GCC Cross
Compiler (not tested). For the compilation of binaries, please open a shell and
type the command "make". All binaries will be compiled into the "bin" directory. In
the case of 32 bits CPU systems you can type "make -f Makefile32"
Please see the referred manuals to install the Oases and Velvet programs.
Remember that Oases and elvet must be compiled using the “color” option.
This is the example for Oases:
make color 'VELVET_DIR=/full_path_of_velvet_dir/' 'MAXKMERLENGTH=63'
'LONGSEQUENCES=1'
and this is the other one for Velvet:
make color 'MAXKMERLENGTH=63' 'LONGSEQUENCES=1'
Important requirements
SOLiD DNA-seqs or SOLiD RNA-seqs must be firstly converted into CSFASTQ
format.
This operation could be easy done using the program csfasta_to_fastq inside the
SATRAP package. Alternatively you can download the "CONVERSION TOOLS"
available at http://pass.cribi.unipd.it that allows to manipulate read files and
obtain the CSFASTQ format.
Easy setting for impatient users
Preparing your data
(1)
create the directory containing the SOLiD RNA-seq data.
For instance: mkdir MY_DATASET
(2)
If you have the native SOLiD data convert them in csfastq format.
For instance:
./csfasta_fastq -csfasta reads_file,fasta -qual quality_file.qual
> MY_DATASET/New_name.csfastq
repeat the same operation for all files.
Important notice for paired-end assembling
If you run only the translation of a SOLiD assembly (Step 3b below) then the
name of the two paired csfastq files must be the same, but the last
character of the file name must be different, for instance:
brain-replicaA-1.csfastq and brain-replicaA-2.csfastq
this allows to discriminate the files associated to different sequenced ends.
Note that the last character must be at the end of the filename and the
suffix remains unmodified. Please check the log of the steps 1 and 3 to
make sure that the files are correctly associated.
SATRAP execution
(3a)
Execute the entire analysis. In the case of paired-end sequenced
libraries you must make sure that the read tag names are those indicated in
the ”-tags” parameter. Typically for SOLiD RNA-seqs they are: “_F3” and
“_F5-RNA”.
For instance you can run:
bin/satrap -step 1 2 3 4 -reads_path DATASET/ -file_esten .csfastq \
-tags _F3 _F5-RNA -velvet_path bin/velvet_1.2.10 -oases_path \
bin/oases_0.2.8/ -q 18 -t1 5 -t2 0
-step 1 2 3 4
-reads_path
-file_esten
-tags
-velvet_path
-oases_path
-q
-t1
-t2
enables all steps of the analysis
The directory containing the csfastq files
The extension name of csfastq files “.csfastq”
Set the 2 tags of the sequenced ends
The path of the installed Velvet program
The path of the installed Oases program
Parameter of Step 1 – reads quality threshold
Parameter of Step 1 – Trims the reads of first file at 3' end
Parameter of Step 1 – Trims the reads of second file at 3' end
If the reads are not paired-end, do not specify the -tags parameter.
(3b) Execute only the translation of a SOLiD assembly.
You can run the following command for RNA assembly:
bin/satrap -step 3 4 -reads_path DATASET/ -file_esten .csfastq \
-fasta transcripts.fa -Q 9 -T2 5
-fasta
-step 3 4
-reads_path
-file_esten
-Q
-T2
parameter of Steps 3 and 4 – a solid color space assembly
Enables steps 3 and 4 of the analysis
The directory containing the csfastq files
The extension name of csfastq files “.csfastq”
Parameter of Step 3 – reads quality threshold
Parameter of Step 1 – Trims the second read at 3' end
Note that the parameter -step is set to enable both the steps 3 and 4 and
reads are not paired-end as the -tags parameter is omitted.
Furthermore using the parameter -fasta the user intend to translate an own
color space assembly generated using another pipeline.
For SOLiD DNA-seq assembly the procedure is the same, but the option
“-no_clustering” should be specified.
Detailed general explanation
Memory requirements
The memory requirements for the color space translation is approximatively
equivalent to the length of the color space assembly x 10. However depending on
the settings both assembly and mapper programs could require more RAM.
Parameter setting
The entire
(1)
(2)
(3)
(4)
pipeline includes four consecutive steps:
Making double encoded reads for assembling
Oases transcriptome assembly
Double encoding for translation
Color space translation
If the SOLiD raw data must be processed to produce a de novo transcriptome
assembly, then all the four steps are required (parameter -step 1 2 3 4).
Alternatively, if a color space assembly is already available, only steps 3 and 4 are
required (parameter -step 3 4).
In any case you need to create a directory for the csfastq files or to their symbolic
links, then you must define the path with the "-reads_path" parameter. If your
data are in the native color space format you can use the program
"csfasta_to_fastq" inside this package to convert them in the csfastq format.
Example of conversion:
csfasta_fastq -csfasta reads_file -qual quality_file > DIR/reads.csfastq
The name of the csfastq files must be properly chosen especially if you want to
assemble SOliD paired-end data. For instance the sample1 could be named:
brain-replicaA-1.csfastq and brain-replicaA-2.csfastq
that allow to discriminate the two sequenced end files by the only differences “1”
and “2” in the names (note that this different character must be at the end of the
filename while the suffix remains unmodified). In the same way a second replicas
could be named:
brain-replicaB-1.csfastq and brain-replicaB-2.csfastq etc …
This will allow the alphanumerical sorting of the the pair-end reads. Check the log
of this steps 1 and 3 to verify that the files were correctly associated. If the reads
are not paired-ended, then there are no restrictions on the file names.
Parameters for general setting
-step
(vector<int>) Set the steps to be performed
-bin
(string)
Set the directory path where binaries are located [bin/]
-tmp_dir
(string)
Set the temporary directory where results will be saved [tmp/].
-file_exten (string)
Set the extension of input read files. Example "-file_esten .csfastq"
reads_path (string)
Set the directory containing the SOLiD reads in CSFASTQ format.
Step 1 – Making double encoded reads for
assembling
(1)Skip this step if you have already have a double encoded assembly to be
translated. This step has the only purpose to produce the double encoded
reads to be assembled with the assembly pipelines (for instance Oases).
Depending on the hardware, users can tune the amount of reads to be
assembled also considering quality and trimming. The file 1.de that will be
saved in the STEP1 directory (inside the temporary directory) will contain
the double encoded reads. This file will not be considered for translation
purpose (see STEP3).
(2) You need to create the directory containing the csfastq files or the symbolic
links to these files and then you should pass the directory path with the "reads_path" parameter.
(3) You need to indicate the extension of the csfastq files with "-file_esten"
parameter, usually “-file_esten csfastq”
Parameters for Step 1
(*) -reads_path (string)
directory containing the SOLiD reads in CSFASTQ format
(*) -file_esten (string)
extension of read files. Example "-file_esten .csfastq"
-max_reads (float)
Max number of reads per analyzed file or pair of files [10]
-tags
(string,string) pair-end tag names for assembling purpose. It enables
paired-end management (-t1) (tag examples: F3, F5-RNA ...)
-t1
(int)
it trims the first sequenced end at 3' (if paired-end) [0]
-t2
(int)
it trims the second sequenced end at 3' [0]
-q
(int)
minimum mean quality tolerated for paired_end sequences [15]
-len
(int)
minimum read size after trimming [30]
-mate-pair
The sequences coming from mate pair libraries will be managed
as paired-end (for assembling purpose) [disabled]
Important notice: for paired-end libraries the trimming function needs both
parameters -t1 and -t2 set. Non paired-end libraries require only the setting of
parameter -t2 , so the same trimming will be applied to all files.
(*) required input.
Step 2 – Oases pipeline processing
(1)This step executes the Oases pipeline. In the bin directory the bin/CONFIG
file contains the basic setting for Velvet and Oases. You can edit and modify
this file to change the settings but some parameters and the output will be
managed automatically by the pipeline. Before running this step you must
set the paths for Velvet and Oases binaries using -velvet_path and
-oases_path parameters. Remember that Oases and Velvet must be
compiled using the color option. This is the example for Oases:
make
color
'VELVET_DIR=/full_path_of_velvet_dir/'
'MAXKMERLENGTH=63' 'LONGSEQUENCES=1'
and this is the other one for Velvet:
make color 'MAXKMERLENGTH=63' 'LONGSEQUENCES=1'
Parameters for Step 2
-velvet_path (string)
-oases_path (string)
-strand_specific
-kmer_set (vector<int>)
-oases_kmer (int)
path to velvet binaries - example: path/velvet/
path to Oases binary - example: path/oases/
Velvet will be set considering specific strand
Set the kmer to be considered. [23 25 27 29 31]
Oases kmer parameter [27]
Step 3 – Double encoding for translation
(1)Firstly you need to create the directory containing the csfastq files or the
symbolic links to these files and then you should pass the directory path to
the "-reads_path" parameter. If your data are in the native color space
format you can use the program "csfasta_to_fastq" inside this package to
convert them in the csfastq format. Example of conversion:
"csfasta_fastq
-csfasta
reads_file
-qual
quality_file
>
DIR/reads.csfastq"
(2) You need to indicate the extension of the csfastq files with "-file_esten"
parameter
Parameters for Step 3
(*) -reads_path (string)
(*) -file_esten (string)
-T2
(int)
-Q
(int)
-len
(int)
directory containing the SOLiD reads in CSFASTQ format
extension of read files. Example "-file_esten .csfastq"
it trims sequences at 3' end [0]
minimum mean quality for reads [9]
minimum read size after trimming [30]
important note: if -fasta parameter is not specified the steps 1 and 2 are
required.
(*) required input.
Step 4 – Color space translation
(1)This step executes the color space translation and requires two main input:
the output of STEP 3 and the file path of the color space assembly in FASTA
format. The last information can be set using the -fasta parameter.
Parameters for Step 4
(*) -fasta (string)
-l
(int)
-n
(float)
-c
(int)
-erode (int)
-z
(float)
-erosion
Double encoded color space assembly in FASTA format.
Minimum contig length [100]
Maximum tolerated fraction of Ns for each translated contig[1].
Minimum coverage required to operate the assembly correction
If this parameter is used -z will be not considered.
Minimum coverage considered to erode contig ends [2]
z-score required to calculate the coverage threshold basing on
the statistical analysis of the sequence coverage [3]. Low values
are more conservative when the error correction is applied. As
consequence of this fact Ns will be introduced around color
incoherence not supported by enough sequence coverage.
it doesn't erodes contig ends in any way
(*) required input if steps 1 and 2 are not executed.
Output
A temporary directory will be created in the directory where the pipeline is
running. All results will be saved in the "-tmp_dir" path. Inside this directory other
STEP* directory will be created and the translated assembly will be saved into the
file STEP4/translated.fa. The files STEP4/clusters.* represent the output
produced in the transcript clustering process.
The file STEP4/translated_clustered_transcrips.fa represents the final output
and the file STEP4/STEP4.log will contain some statistics about the color space
translation.
CONFIG file description
It is possible to tune the settings of each program by modifying the bin/CONFIG
file.
Important notice: The row order follows the order of executions! If you erase a
row the associated command will not be executed. We strongly suggest to modify
only the field number 4.
Field meaning
Field 1: Referred analysis step
Field 2: Referred loop or sub-step inside the analysis
Field 3: binary name to be executed
Field 4: base setting that doesn't need changes during the analysis or loops
Default configuration
# Multi-kmrs loop
STEP2 1
velveth_de
STEP2 1
velvetg_de
STEP2 1
oases_de
# Merging kmer assembly
STEP2 2
velveth_de
STEP2 2
velvetg_de
STEP2 2
oases_de
# color space translation
STEP4 3
pass
STEP4 3
# Remove
STEP4 3
STEP4 3
-
read_trkg yes -min_contig_lgth 100
min_trans_lgth 100
-
read_trkg yes -conserveLong yes -min_contig_lgth 100
merge yes -min_trans_lgth 100
-
double_encoded -fid 90 -sam -query_size 300 -b -g 3
pst_word_range 6 6
cs2bs_assembly clean 30
the following rows for DNA-seq translation
cd-hit-est
T 0 -M 4000 -g 1
fasta_remove
-l 50 -f 0.1 -oases
-