Processing public plant RNA-seq data: challenges and pitfalls A scavenger hunt for expression data Dries Vaneechoutte Prof. Klaas Vandepoele The goal Build a compendium of all publicly available RNA-Seq experiments for a given species The plan III X II I I. Obtain RNA-Seq data from NCBI • Some easy queries, right? II. Curate the metadata • Select interesting experiments III. Process RNA-Seq data in bulk • Finding the best tool for the job X. Cool bioinformatics stuff • Co-expression analysis • Study alternative splicing • Gene function prediction • … I. Obtain RNA-Seq data from NCBI http://nedroid.com/2012/05/honk-the-databus/ Strains? Mutants? => Manual curation required Replicates? Metadata example: the good RNA-Seq study with 6 experiments Subject: cytokinin response Strain: … Tissue: … Age: … Replicates: … Condition: … Mutant: … Metadata example: the good RNA-Seq study with 6 experiments Subject: cytokinin response Strain: … Tissue: … Age: … Replicates: replicate 3 Condition: cytokinin treated Mutant: … Metadata example: the good RNA-Seq study with 6 experiments Subject: cytokinin response Strain: Columbia-0 Tissue: seedling Age: 10 day Replicates: replicate 3 Condition: cytokinin treated Mutant: no Metadata example: the bad RNA-Seq study with 16 experiments Subject: “BL” response (blue light?) Strain: … Tissue: … Age: … Replicates: … Condition: … Mutant: … Metadata example: the bad RNA-Seq study with 16 experiments Subject: “BL” response (blue light?) Strain: Columbia-0 Tissue: … Age: … Replicates: … Condition: … Mutant: … Metadata example: the bad RNA-Seq study with 16 experiments Subject: “BL” response (blue light?) Strain: Columbia-0 Tissue: ??? Age: ??? Replicates: ??? Condition: ??? Mutant: ??? • Metadata on NCBI’s Sequence Read Archive • Important information is missing or is scattered across pages • Manual curation on NCBI is a slow process • Solution: Curation tool with graphical interface • All metadata per study in a single window • Automatic highlights and colored buttons to identify replicates II. Curate the metadata Curation tool II. Curate the metadata Curation tool • Curation tool • • • • Speeds up curation (~250 experiments per hour) Reduces mistakes Not limited to RNA-Seq Not publicly released yet The plan III X II I I. Obtain RNA-Seq data from NCBI • Some easy queries, right? II. Curate the metadata • Select interesting experiments III. Process RNA-Seq data in bulk • Finding the best tool for the job X. Cool bioinformatics stuff • Co-expression analysis • Study alternative splicing • Gene function prediction • … III. Process RNA-Seq data in bulk Finding the right tool for the job 18 125 72 3 Download Quality Control Read mapping Read counting Quality Control 18 125 72 3 • Do you need quality trimming/filtering? Modern sequencers produce high-quality reads Software can compensate Editing your reads might introduce bias • Choose the workflow that bests suits your data Read mapping: 18 125 72 3 • Some tools are easier to use then others • TopHat • tophat –G <gtf_file> -o <output_dir> <tophat_index> <fastq_file> • STAR • STAR --genomeDir <star_index> --readFilesIn <fastq_file> --outSAMstrandField intronMotif -outFilterIntronMotifs RemoveNoncanonicalUnannotated –outFilterMultimapNmax <mismatches> -outFilterMismatchNoverLmax <percent_mismatch> --alignIntronMin <min_intron> --alignIntronMax <max_intron> --outFileNamePrefix <output_file> • Take ease of use into account; More parameters offer more ways to screw up. Read counting: 18 125 72 3 • Too many tools to test yourself (Kanitz et al., 2015) • Search for benchmark papers to pick your tools 18 125 72 3 Download (50 million reads) Quality Control Read mapping Read counting (10-15 minutes) (15-60 minutes) (5-20 minutes) • Tools can vary in • Running time • Memory requirements • Stability • Take computational cost into account 18 125 72 3 Download (50 million reads) Quality Control Read mapping Read counting (10-15 minutes) (15-60 minutes) (5-20 minutes) Alignment-free (5-10 minutes) • Groundbreaking new techniques can pop up any time • Keep your pipelines up-to-date Conclusion • Metadata on NCBI’s Sequence Read Archive • Querying for RNA-Seq data is harder than it should be • Don’t be lazy when uploading your data • The curation tool speeds up curation • Choosing the right tool for the job • • • • • Choose the workflow that bests suits your data Take ease of use into account Search for benchmark papers Take computational cost into account Keep your pipelines up-to-date
© Copyright 2026 Paperzz