The iPlant Collaborative
Community Cyberinfrastructure for Life Science
Tools and Services Workshop
Intro to RNA-Seq with the Tuxedo
Suite
Experiment Overview
Goals
Determine differential expression abundance of transcripts
in between a WT and mutant organism
RNA-Seq Overview
Basic concept
Image source: http://www.bgisequence.com
Experiment Overview
Example experiment
• LONG HYPOCOTYL 5 (HY5) is a basic leucine zipper transcription factor (TF).
• Mutations cause aberrant phenotypes in Arabidopsis morphology, pigmentation and
hormonal response.
•
We will use RNA-Seq to compare WT and hy5 to identify HY5-regulated genes.
Source: http://www.gla.ac.uk/media/media_73736_en.jpg
Experiment Overview
Read statistics
• Genome alignments from TopHat were saved as BAM files, the binary version of SAM
(samtools.sourceforge.net/).
• Reads retained by TopHat are shown below
Now what?
@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41
CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC
+
BA?39AAA933BA05>A@A=?4,9#################
@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41
GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT
+
@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##
@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41
TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA
+
A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9?
@SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41
CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC
+
BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B?
@SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41
AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA
+
BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@
@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41
GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG
+
BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>
@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41
GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC
+
?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:
Now what?
@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41
CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC
+
BA?39AAA933BA05>A@A=?4,9#################
@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41
GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT
+
@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##
@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41
TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA
+
A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9?
@SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41
CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC
+
BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B?
@SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41
AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA
+
BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@
@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41
GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG
+
BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>
@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41
GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC
+
?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:
Getting a feel for the data
FASTQ format
Now what?
0
@SRR070570.4 HWUSI-EAS455:3:1:1:1096 length=41
CAAGGCCCGGGAACGAATTCACCGCCGTATGGCTGACCGGC
+
BA?39AAA933BA05>A@A=?4,9#################
@SRR070570.12 HWUSI-EAS455:3:1:2:1592 length=41
GAGGCGTTGACGGGAAAAGGGATATTAGCTCAGCTGAATCT
+
@=:9>5+.5=?@<6>A?@6+2?:</7>,%1/=0/7/>48##
@SRR070570.13 HWUSI-EAS455:3:1:2:869 length=41
TGCCAGTAGTCATATGCTTGTCTCAAAGATTAAGCCATGCA
+
A;BAA6=A3=ABBBA84B<&78A@BA=(@B>AB2@>B@/9?
@SRR070570.32 HWUSI-EAS455:3:1:4:1075 length=41
CAGTAGTTGAGCTCCATGCGAAATAGACTAGTTGGTACCAC
+
BB9?A@>AABBBB@BCA?A8BBBAB4B@BC71=?9;B:3B?
@SRR070570.40 HWUSI-EAS455:3:1:5:238 length=41
AAAAGGGTAAAAGCTCGTTTGATTCTTATTTTCAGTACGAA
+
BBB?06-8BB@B17>9)=A91?>>8>*@<A<>>@1:B>(B@
@SRR070570.44 HWUSI-EAS455:3:1:5:1871 length=41
GTCATATGCTTGTCTCAAAGATTAAGCCATGCATGTGTAAG
+
BBBCBCCBBBBBA@BBCCB+ABBCB@B@BB@:BAA@B@BB>
@SRR070570.46 HWUSI-EAS455:3:1:5:1981 length=41
GAACAACAAAACCTATCCTTAACGGGATGGTACTCACTTTC
+
?A>-?B;BCBBB@BC@/>A<BB:?<?B?=75?:9@@@3=>:
1
0
1
0
1
1
0
0
1
Bioinformatician
Papers and Background
Read these first!
Tuxedo Workflow
Differential expression
*TopHat and Cufflinks require a sequenced genome
No reference genome?
Resources
Encode Standards
Suggestions before you sequence
http://encodeproject.org/ENCODE/protocols/dataStandards/ENCODE_RNAseq_Standards_V1.0.pdf
$
$
$
$
$
$
tophat
tophat
tophat
tophat
tophat
tophat
-p
-p
-p
-p
-p
-p
$
$
$
$
$
$
cufflinks
cufflinks
cufflinks
cufflinks
cufflinks
cufflinks
8
8
8
8
8
8
-p
-p
-p
-p
-p
-p
-G
-G
-G
-G
-G
-G
8
8
8
8
8
8
genes.gtf
genes.gtf
genes.gtf
genes.gtf
genes.gtf
genes.gtf
-o
-o
-o
-o
-o
-o
-o
-o
-o
-o
-o
-o
C1_R1_thout
C1_R2_thout
C1_R3_thout
C2_R1_thout
C2_R2_thout
C2_R3_thout
C1_R1_clout
C1_R2_clout
C1_R3_clout
C2_R1_clout
C2_R2_clout
C2_R3_clout
genome
genome
genome
genome
genome
genome
C1_R1_1.fq
C1_R2_1.fq
C1_R3_1.fq
C2_R1_1.fq
C2_R2_1.fq
C2_R3_1.fq
C1_R1_2.fq
C1_R2_2.fq
C1_R3_2.fq
C1_R1_2.fq
C1_R2_2.fq
C1_R3_2.fq
C1_R1_thout/accepted_hits.bam
C1_R2_thout/accepted_hits.bam
C1_R3_thout/accepted_hits.bam
C2_R1_thout/accepted_hits.bam
C2_R2_thout/accepted_hits.bam
C2_R3_thout/accepted_hits.bam
$ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt
$ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf \
./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,\
./C1_R3_thout/accepted_hits.bam \./C2_R1_thout/accepted_hits.bam,\
./C2_R3_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam
Your transformed RNA-Seq Data
Your RNA-Seq Data
Command line version
Tophat (bowtie)
Using a GUI
Cufflinks
Cuffmerge
Discovery Environment
Discovery Environment
Your Data
Cuffdiff
Atmosphere
FASTQ
iPlant Data Store
CummeRbund
Moving your data in
Complete documentation
www.iplantc.org/ds1
iDrop Desktop
Easy to use!
Discovery Environment
Easy to use!
Decompress your data
Know what files you have
Remove barcodes?
Demultiplexing and adapter trimming
Image from: http://www.westburg.eu/lp/rna-seq-library-preparation
Pre-process sequences if needed (e.g., Sabre for de-multiplexing reads, and Scythe for
removing primer/adapter sequences)
Quality Control
FastQC
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Quality Control
Per base sequence quality
BAD
•
•
•
•
GOOD
The central red line is the median value
The yellow box represents the inter-quartile range (25-75%)
The upper and lower whiskers represent the 10% and 90% points
The blue line represents the mean quality
Quality Control
Per sequence quality
BAD
GOOD
Fail: most frequently observed mean quality is below 20 (1% error rate)
Quality Control
Sequence length distribution
GOOD
Fail: error if any of the sequences have zero length.
Quality Control
Overrepresented sequences
BAD
Fail: module will issue an error if any sequence is found to represent
more than 1% of the total
TopHat
Maps reads to reference genome
TopHat
Maps reads to reference genome
• TopHat is one of many applications for aligning
short sequence reads to a reference genome.
• It uses the BOWTIE aligner internally.
• Other alternatives are BWA, MAQ,
STAR,OLego, Stampy, Novoalign, etc.
TopHat
Maps reads to reference genome
• TopHat has a number of parameters and options, and their
default values are tuned for processing mammalian RNASeq reads.
• If you would like to use TopHat for another class of organism,
we recommend setting some of the parameters with more
strict, conservative values than their defaults.
• Usually, setting the maximum intron size to 4 or 5 Kb is
sufficient to discover most junctions while keeping the
number of false positives low.
- TopHat User Manual
IGV
Visualize mapped reads
Cufflinks
Assemble transcripts
Cufflinks
Assemble transcripts
Hint: Provide a mask file (gtf/gff)
• Tells Cufflinks to ignore all reads that
could have come from transcripts in this
GTF file.
• Annotated rRNA, mitochondrial
transcripts other abundant transcripts you
wish to ignore.
Cufflinks User Manual
Cufflinks
Assemble transcripts
1) transcripts.gtf
This GTF file contains Cufflinks' assembled isoforms. The first 7 columns are
standard GTF, and the last column contains attributes, some of which are
also standardized ("gene_id", and "transcript_id"). There one GTF record per
row, and each record represents either a transcript or an exon within a
transcript.
2) isoforms.fpkm_tracking
This file contains the estimated isoform-level expression values (FPKM).
3) genes.fpkm_tracking
This file contains the estimated gene-level expression values (FPKM).
- Cufflinks User Manual
Cufflinks
Assemble transcripts
Cuffmerge
Assemble transcriptome from RABT and Cufflinks
Cuffmerge is a meta-assembler; Assembly of Cufflinks transcripts /
Reference based assembly
Cuffdiff
Determine sample differences
Cuffdiff
Determine sample differences
•Cuffdiff evaluates variation in read counts for each
gene across the replicates this estimate is used to
calculate significance of expression changes
•Cuffdiff can identify genes that are differentially
spliced or differentially regulated via promoter
switching. Isoforms of a gene that have the same TSS
are grouped
•Detection rate of differentially expressed
genes/transcripts is strongly dependent on
sequencing depth
Cuffdiff
Determine sample differences
Changes in fragment counts ≠ changes in expression
True expression is estimated by the sum of the length-normalized
isoform read counts so the entire transcript must be taken into account.
Cuffdiff
Determine sample differences
1. FPKM tracking files
Cuffdiff calculates the FPKM of each transcript, primary transcript, and gene in each sample. Primary transcript and
gene FPKMs are computed by summing the FPKMs of transcripts in each primary transcript group or gene group.
(tss_groups.fpkm_tracking tracks summed FPKM of transcripts sharing tss_ids)
2) Count tracking files
Estimate of the number of fragments that originated from each transcript, primary transcript, and gene in each
sample.
3) Read group tracking files
Expression and fragment count for each transcript, primary transcript, and gene in each replicate.
4) Differential expression tests
Tab delimited file lists the results of differential expression testing between samples for spliced transcripts, primary
transcripts, genes, and coding sequences.
Plus several other outputs (diff splicing, CDS, promoter, etc.)
Cuffdiff
Determine sample differences
Example filtered Cuffdiff results generated in the Discovery Environment.
Cuffdiff
Determine sample differences
Example filtered Cuffdiff results generated in the Discovery Environment.
Cuffdiff
Density plot
Cuffdiff
Scatter plot
Cuffdiff
Volcano plot
CummeRbund
Using R in Atmosphere (tomorrow)
Keep asking: ask.iplantcollabortive.org
The iPlant Collaborative is funded by a grant from the National Science
Foundation Plant Cyberinfrastructure Program (#DBI-0735191).
© Copyright 2026 Paperzz