Processing public plant RNA-seq data: challenges and pitfalls

Processing public plant RNA-seq data:
challenges and pitfalls
A scavenger hunt for expression data
Dries Vaneechoutte
Prof. Klaas Vandepoele
The goal
 Build a compendium of all publicly available RNA-Seq experiments for a given species
The plan
III
X
II
I
I.
Obtain RNA-Seq data from NCBI
• Some easy queries, right?
II.
Curate the metadata
• Select interesting experiments
III. Process RNA-Seq data in bulk
• Finding the best tool for the job
X. Cool bioinformatics stuff
• Co-expression analysis
• Study alternative splicing
• Gene function prediction
• …
I. Obtain RNA-Seq data from NCBI
http://nedroid.com/2012/05/honk-the-databus/
Strains?
Mutants?
=> Manual curation required
Replicates?
Metadata example: the good
RNA-Seq study with 6 experiments
Subject: cytokinin response
Strain: …
Tissue: …
Age: …
Replicates: …
Condition: …
Mutant: …
Metadata example: the good
RNA-Seq study with 6 experiments
Subject: cytokinin response
Strain: …
Tissue: …
Age: …
Replicates: replicate 3
Condition: cytokinin treated
Mutant: …
Metadata example: the good
RNA-Seq study with 6 experiments
Subject: cytokinin response
Strain: Columbia-0
Tissue: seedling
Age: 10 day
Replicates: replicate 3
Condition: cytokinin treated
Mutant: no
Metadata example: the bad
RNA-Seq study with 16 experiments
Subject: “BL” response (blue light?)
Strain: …
Tissue: …
Age: …
Replicates: …
Condition: …
Mutant: …
Metadata example: the bad
RNA-Seq study with 16 experiments
Subject: “BL” response (blue light?)
Strain: Columbia-0
Tissue: …
Age: …
Replicates: …
Condition: …
Mutant: …
Metadata example: the bad
RNA-Seq study with 16 experiments
Subject: “BL” response (blue light?)
Strain: Columbia-0
Tissue: ???
Age: ???
Replicates: ???
Condition: ???
Mutant: ???
• Metadata on NCBI’s Sequence Read Archive
• Important information is missing or is scattered across pages
• Manual curation on NCBI is a slow process
• Solution: Curation tool with graphical interface
• All metadata per study in a single window
• Automatic highlights and colored buttons to identify replicates
II. Curate the metadata
Curation tool
II. Curate the metadata
Curation tool
• Curation tool
•
•
•
•
Speeds up curation (~250 experiments per hour)
Reduces mistakes
Not limited to RNA-Seq
Not publicly released yet
The plan
III
X
II
I
I.
Obtain RNA-Seq data from NCBI
• Some easy queries, right?
II.
Curate the metadata
• Select interesting experiments
III. Process RNA-Seq data in bulk
• Finding the best tool for the job
X. Cool bioinformatics stuff
• Co-expression analysis
• Study alternative splicing
• Gene function prediction
• …
III. Process RNA-Seq data in bulk
Finding the right tool for the job
18
125
72
3
Download
Quality
Control
Read
mapping
Read
counting
Quality Control
18
125
72
3
• Do you need quality trimming/filtering?
 Modern sequencers produce high-quality reads
 Software can compensate
 Editing your reads might introduce bias
• Choose the workflow that bests suits your data
Read mapping:
18
125
72
3
• Some tools are easier to use then others
• TopHat
•
tophat –G <gtf_file> -o <output_dir> <tophat_index> <fastq_file>
• STAR
•
STAR --genomeDir <star_index> --readFilesIn <fastq_file> --outSAMstrandField intronMotif -outFilterIntronMotifs RemoveNoncanonicalUnannotated –outFilterMultimapNmax <mismatches> -outFilterMismatchNoverLmax <percent_mismatch> --alignIntronMin <min_intron> --alignIntronMax
<max_intron> --outFileNamePrefix <output_file>
• Take ease of use into account; More parameters offer more ways to screw up.
Read counting:
18
125
72
3
• Too many tools to test yourself
(Kanitz et al., 2015)
• Search for benchmark papers to pick your tools
18
125
72
3
Download
(50 million reads)
Quality
Control
Read
mapping
Read
counting
(10-15 minutes)
(15-60 minutes)
(5-20 minutes)
• Tools can vary in
• Running time
• Memory requirements
• Stability
• Take computational cost into account
18
125
72
3
Download
(50 million reads)
Quality
Control
Read
mapping
Read
counting
(10-15 minutes)
(15-60 minutes)
(5-20 minutes)
Alignment-free
(5-10 minutes)
• Groundbreaking new techniques can pop up any time
• Keep your pipelines up-to-date
Conclusion
• Metadata on NCBI’s Sequence Read Archive
• Querying for RNA-Seq data is harder than it should be
• Don’t be lazy when uploading your data
• The curation tool speeds up curation
• Choosing the right tool for the job
•
•
•
•
•
Choose the workflow that bests suits your data
Take ease of use into account
Search for benchmark papers
Take computational cost into account
Keep your pipelines up-to-date