Burt, Dave - The Roslin Institute

Next Generation Sequencing
Current Status and Prospects
Avian Genomics in the 21st Century
The Roslin Institute and Royal (Dick)
School of Veterinary Studies
University of Edinburgh
[email protected]
Sequencing Technologies
Read length
Raw read
(bases)
Accuracy (%)
Reads
per run
Gbases
96
0.0003
NGS technology
Sequencing principle
1st generation
Sanger
Dideoxy sequencing
~1,000
≥99.999
Roche/454
Illumina/Solexa
ABI/SOLiD
Pyrosequencing
Reversible terminator chemistry
Sequencing by ligation
350-450
36–100
35-60
≥99
≥98–99
≥99.99
8.00E+05
0.4
6.00E+09
600
1.00E+08 50–120
3rd generation
PacBio
Helicos
Single-molecule sequencing
Single-molecule sequencing
1000-4500
25–55
≥80
≥97
4.80E+04 0.05
6.00E+08 21–35
2nd generation
2
First Generation Sequencing
Frederick Sanger
In 1958 awarded Nobel prize in
chemistry "for his work on the
structure of proteins”. In 1980, Gilbert
and Sanger shared the chemistry
prize "for their contributions
concerning the determination of base
sequences in nucleic acids".
3
First Generation Sequencing
4
Genetic Maps
•
•
•
•
•
•
Genetic markers (e.g. Microsatellites)
Mapping populations (e.g. East Lansing)
Comparative maps (e.g. Chicken/Human)
Resource populations (e.g. B X L cross)
QTL mapping
Marker-Assisted-Selection
5
QTL Mapping
6
QTL Mapping
• QTL not mapped precisely
• Confidence intervals for QTL large
• Markers account for limited genetic
variation (~4%)
7
Genomic Tools
• Expressed sequence tags (ESTs)
• Chicken genome sequence
• Gene expression chips
– Affymetrix/Chicken Genome Consortium
• 3M SNPs between RJF, Broiler,
Layer and Silkie lines
• 3, 20, 42, 60K SNP panels
• ARK-Genomics facility
• Genome Browsers
– Ensembl and NCBI
8
Genomic Tools
• Expressed sequence tags (ESTs)
• Chicken genome sequence
• Gene expression chips
– Affymetrix/Chicken Genome Consortium
• 3M SNPs between RJF, Broiler,
Layer and Silkie lines
• 3, 20, 42, 60K SNP panels
• ARK-Genomics facility
• Genome Browsers
– Ensembl and NCBI
9
Genomic Tools
• Expressed sequence tags (ESTs)
• Chicken genome sequence
• Gene expression chips
– Affymetrix/Chicken Genome Consortium
• 3M SNPs between RJF, Broiler,
Layer and Silkie lines
• 3, 20, 42, 60K SNP panels
• ARK-Genomics facility
• Genome Browsers
– Ensembl and NCBI
10
Genomic Tools
• Expressed sequence tags (ESTs)
• Chicken genome sequence
• Gene expression chips
www. ark-genomics.org
– Affymetrix/Chicken Genome Consortium
• 3M SNPs between RJF, Broiler,
Layer and Silkie lines
• 3, 20, 42, 60K SNP panels
• ARK-Genomics facility
• Genome Browsers
– Ensembl and NCBI
11
Further BBSRC support 2011-2014 (Gallus 4, RNAseq, SNPs...)
12
Sequencing Technologies
Read length
Raw read
(bases)
Accuracy (%)
Reads
per run
Gbases
96
0.0003
NGS technology
Sequencing principle
1st generation
Sanger
Dideoxy sequencing
~1,000
≥99.999
Roche/454
Illumina/Solexa
ABI/SOLiD
Pyrosequencing
Reversible terminator chemistry
Sequencing by ligation
350-450
36–100
35-60
≥99
≥98–99
≥99.99
8.00E+05
0.4
6.00E+09
600
1.00E+08 50–120
3rd generation
PacBio
Helicos
Single-molecule sequencing
Single-molecule sequencing
1000-4500
25–55
≥80
≥97
4.80E+04 0.05
6.00E+08 21–35
2nd generation
13
NGS: Illumina/Solexa
14
Clonal Single Molecule Arrays
Attach single molecules to surface
Amplify to form clusters
Random array of clusters
15
Sequencing By Synthesis
3’
5’
Cycle 1:
Add sequencing reagents
First base incorporated
Remove unincorporated bases
Detect signal
A
T
G
C
C
G
T
PPP
T
A
C
A
Deblock and defluor
Base
Fluor
C
G
A
T
T
A
G
A
C
T
Cycle 2-n: Add sequencing reagents and repeat
C
C
G
A
G
C
T
C
•
•
•
All four labeled nucleotides in one reaction
High accuracy
Base-by-base sequencing
G
A
T
5’
16
Base Calling from Raw Data
TG C TAC GAT …
1
2
3
4
5
6
7
8
9
TTTTTTTGT…
Identity of each base of a cluster is read off from sequential images
17
Applications
Minou N. 2010 Eukaryotic Cell 9:1300-1310
18
Applications
19
Avian Genomes
Ning Li, Yao Feng Zhao, China Agricultural University, Beijing, China
Wubin Qian, Ju Wang, Beijing Genome Institute, Shenzhen, China
David W Burt, Jacqueline Smith, Yinhua Huang, University of Edinburgh, UK
20
Avian Genomes
Flight
Small genome
Unique karyotype
Immune system
Learning
Migration
Lifespan …
21
Phylogenomics
• Clade and species-specific biology
• Gene diversification
– Gene innovation, duplication and expansion
– Gene deletion, contraction and extinction
– Selection constraints on protein coding
sequences (negative, neutral, positive)
22
Computational Pipeline for
Ensembl/Compara Process
WUBlastp + SmithWaterman
hcluster_sg1
multiple aligners
consensified by M-Coffee
TreeBeST
Javier Herrero, Leo Gordon,
Steve Searle, European
Bioinformatics Institute, UK
23
Gene Family Expansion and
Contraction of Adaptive Significance?
• CAFE “Computational Analysis of gene Family
Evolution” (Hahn et al, 2007) was used to predict gene
family expansions and contractions of putative adaptive
value
• CAFE models gene expansion/contraction as a
“birth/death” process with a specific probability
• This value may be the same for all lineages or may vary
in two or more lineages
• The likelihood can be calculated and compare different
models
24
Changes in gene family size along each branch
Average expansion = (total genes gained – total genes lost)/n
-0.005
+0.001
+0.081
-0.046
+0.046
+0.011
-0.035
+0.066
-0.068
+0.102
+0.084
+0.051
-0.022
+0.014
-0.126
-0.196
-0.074
-0.025
-0.052
-0.091
-0.020
-0.160
+0.010
-0.099
-0.060
+0.027
MRCA
-0.336
-0.109
+0.073
-0.201
+0.001
+0.202
Million year before present
25
Accelerated Evolution of Genes
Selection constraints on protein
coding sequences (negative,
neutral, positive)
Heebal Kim, Taehun Kim,
Seoul National University.
Korea and Rasmus Nielsen,
University of CaliforniaBerkeley, USA
ω= dN/dS
Birds vs. Mammals
Adaptive evolution or relaxed selective constraint,
during last ~100 million years?
26
Compare Rates of Evolution
Selection constraints on protein
coding sequences (negative,
neutral, positive)
ω= dN/dS
Birds vs. Mammals
4,224 orthologs between eight species, 766 showed
accelerated evolution in birds and 762 in mammals
27
Rates Birds > Mammals
proliferation of B cells
activation and migration of leukocytes
and T-cells
cardiovascular system
(metabolic demands of flight, running,
swimming and diving)
Beak shape and size
nervous system and behavior
(birds have around three times the
visual acuity of humans)
hepatic function (migratory birds)
28
28
Rates Mammals > Birds
movement of B cells
lymphoid tissue structure and
development
(no lymph nodes in birds)
Reproduction
endocrine system
development
embryonic development
visual system
29
Species
Latin Names
Adelie penguin
American Crow
Angola turaco
Anna's hummingbird
Barn owl
Bar-tailed trogon
Brown mesite
Budgerigar
Caribbean flamingo
Chicken
Chimney swift
Common Cuckoo
Crested Ibis
Crowned crane
Cuckoo roller
Dalmatian pelican
domestic pigeon
Downy Woodpecker
Emperor penguin
Golden-collared Manakin
Great black cormorant
Great tinamou
Great-crested grebe
Pygoscelis adeliae
Corvus brachyrhynchos
Tauraco erythrolophus
Calypte anna
Tyto alba
Apaloderma vittatum
Mesitornis unicolor
Melopsittacus undulatus
Phoenicopterus ruber
Gallus gallus
Chaetura pelagica
Cuculus canorus
Nipponia nippon
Balearica regulorum gibbericeps
Leptosomus discolor
Pelecanus crispus
Columba livia
Picoides pubescens
Aptenodytes forsteri
Manacus vitellinus
Phalacrocorax carbo
Tinamus major
Podiceps cristatus
Sequence Number
Species
Depth
Genes
60X
15,300 Hoatzin
90X
16,742 Houbara Bustard
30X
14,667 Javan rhinoceros hornbill
110X
16,750 Kea
27X
14,048 Killdeer
28X
14,917 Little egret
29X
15,275 Medium ground finch
30X
16,368 Nightjar
33X
13,811 Northern Carmine bee-eater
7x Sanger 16,516 Northern Fulmar
106X
15,608 Ostrich
100X
15,681 Peking duck
105X
16,434 Peregrine falcon
33X
14,821 Red throated loon
32X
14,719 Red-legged seriema
34X
14,353 Rifleman
64X
17,300 Speckled mousebird
105X
16,396 Sunbittern
60X
16,470 Turkey
110X
16,103 Turkey vulture
24X
13,909 white-tail eagle
100X
15,504 White-tailed tropicbird
30X
13,957 Yellow-thoated Sandgrouse
Zebra finch
Latin Names
Ophisthocomus hoazin
Chlamydotis undulata
Buceros rhinoceros silvestris
Nestor notabilis
Charadrius vociferus
Egretta garzetta
Geospiza fortis
Caprimugus Carolinensis
Merops nubicus
Fulmarus glacialis
Struthio camelus
Anas platyrhynchos domestica
Falco peregrinus
Gavia stellata
Cariama cristata
Acanthisitta chloris
Colius striatus
Eurypyga helias
Meleagris gallopavo
Cathartes aura
Haliaeetus albicilla
Phaethon lepturus
Pterocles guturalis
Taeniopygia guttata
Sequence
Depth
100X
27X
35X
32X
100X
74X
115X
30X
37X
33X
85X
50X
105X
33X
24X
29X
27X
33X
30C
25X
26X
39X
25X
6X Sanger
Avian Phylogenomics Group: BGI, Duke, Univ Copenhagen, Am Museum Nat His, Bowdoin, Cal Academy
Sci, Cardiff, CNPq Brazil, Copenhagen Zoo, Florida, Griffith, Harvard, Heidelberg Institute Theoretical
Physics, Mississippi Sate Univ, Montellier Univ, Murdoch Univ, New Mex State Univ, NIEHS, NIH, OHSU,
San Diego Zoo, Smithsonian, U Texas Austin, UCSC, Univ DelawareUniv Maryland, Univ Minnesota, Univ
Sydney, Utah, Wash Univ,, Roslin/Univ Edinburgh
30
Number
Genes
14,937
14,090
13,835
14,736
16,146
15,814
16,780
14,502
14,019
14,186
15,417
19,144
16,262
13,933
15,329
16,034
14,807
13,582
14,108
13,600
13,793
14,667
14,897
17,471
Applications
31
QTL Mapping
• QTL not mapped precisely
• Confidence intervals for QTL large
• Markers account for limited genetic
variation (~4%)
32
Genome Wide Selection
• Genotype 1000’s of markers to
predict breeding values
• High density SNP panel for
whole genome (e.g. 600K)
• QTL close to one or more
markers
• Allows SNP with smaller
effects to be used effectively
• GWS will account for all QTL
and all genetic variation
33
SNP Discovery for Array
Illumina ultra highthroughput sequencing
243 chickens from 24 lines, samples
in pools of 10-15 individuals;
Av. coverage 7-17X per line
Sequence alignment to
reference genome
Used new GGA genome assembly
(still unpublished)
SNP detection: 78M
SNPs (segregating in
one or more lines)
Criteria: Samtools Phred score ≥ 20,
MAF ≥ 5, coverage ≥ 5, variant
present in at least 5% of the reads
SNP selection
(stage1): 24M
Criteria: Phred quality score ≥ 60
SNP selection
(stage2): 10M
Criteria: No other SNPs within 10bp
at least on one side, uniformly paced
out according to genetic distance
SNP selection
(stage3): 2M
Criteria: Predicted reproducibility in
array, 50:50 broiler and layer SNPs
SNP selection
(stage4): 650K
Criteria: True reproducibility,
Mendelian inheritance, HWE, LD
34
Distribution of SNPs
20
% of SNPs
15
10
5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Z
Chromosome
Number of SNP/Kb
78M
24M
10M
2M
200
150
100
50
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Z
Chromosome
Number of SNPs/cM
78M
24M
10M
2M
40000
30000
20000
10000
0
1 2 3 4 5
6 7 8 9 10 11 12 13 14 15 17 18 19 20 21 22 23 24 25 26 27 28 Z
Chromosome
78M
24M
10M
2M
35
Distribution of Minor Allele Frequency
50
% of SNPs
40
30
20
10
0
0-0.05
0.05-0.1
0.1-0.15
0.15-0.2
0.2-0.25
0.25-0.3
0.3-0.35
0.35-0.4
10M
2M
0.4-0.45
0.45-0.5
0.5-0.55
MAF
78M
24M
36
Annotation of SNPs
Stop gain/loss
1%
50
45
40
Synonymous
48%
35
%
Non-synonymous
51%
30
25
20
15
10
5
0
Intergenic
Intronic
Exonic
Upstream
Downstream
37
SNP Genotyping Panels
•
•
•
•
•
•
3K
6K
20K
42K
60K
600K
192 samples/run
125 million genotypes/run
38
Final Panel Selection
• 600K panel to be selected based on
– Call rate of markers
– Mendelian inheritance (MI)
– Minimum allele frequency (MAF)
– Linkage disequilibrium (LD)
– Prediction of SNP effects on coding sequence
39
Criteria for Passing SNPs
• Polymorphic, with at
least 3 examples of
the minor allele
• Robust assay:
– Genotype call rate
(≥98%)
– Cluster separation
– Reproducibility
40
Applications of SNP Panel
•
•
•
•
•
Genomic selection: broilers and layers
Genome wide association studies
High resolution genetic mapping
Selection signature analysis
SNP annotations, phenotypic effects and
functional studies
41
Structural Variants
42
Acknowledgements
Funding
BBSRC/Defra LINK; Aviagen Ltd, Affymetrix Ltd;
German Federal Ministry of Education and Research
Roslin Institute
David Burt
John A. Woolliams
Chris Haley
Almas Gheyas
Clarissa Boschiero
Andy Law
Le Yu
Peter Kaiser
Paul Hocking
Aviagen
Kellie A. Watson
Andreas Kranis
Hyline
Janet E. Fulton
ARK genomics
Richard Talbot
Frances Turner
Sarah Smith
Alison Downing
Mark Fell
Affymerix
Fiona Brew
Lucy Raynold
Ali Pirani
Synbreed
Henner Simianer
Ruedi Fries
Rudolf Preisinger
Steffen Weigend
Klaus Meyer
George Haberer
Saber Qanbari
43
Applications
44
RNA-Seq
45
Gene Models: CRY1
46
Gene Models: TEF
47
Infectious Bursal Disease
•
•
•
•
•
•
•
Also known as Gumboro disease
Caused by a Birnavirus (ds RNA)
Usually diagnosed at 3-6 weeks old
Spread through contaminated feed and water
Infects B-cells
Mortality can be up to 90% (usually around 20%)
Symptoms: anorexia, depression, diahorrea, ruffled
feathers, bursal lesions, immuno-suppression
• Vaccination program (but different serotypes)
48
Experimental Design
• 3 spleen samples from control birds (line
BrL)
• 3 spleen samples from IBDV-infected birds
(4dpi) (line BrL)
• Compared Affymetrix whole genome
expression arrays with RNA-Seq
49
RNA-Seq Bioinformatics
Fastqc
fastx
Soap2
Our own
database
Counts of RNA-Seq
tags for each gene
edgeR
50
Differential Gene Expression
Gene Symbol
ART1
IL28B
PTX3
IFNB1
VEPH1
MX2
RSAD2
IFIT5
TMPRSS2
LYG1
Gene Description
ADP-ribosyltransferase 1 [Gallus gallus]
Interferon lambda, Interleukin 28 ; [Gallus gallus]
pentraxin-related gene, rapidly induced by IL-1 beta [Gallus gallus]
Interferon type B Precursor [Gallus gallus]
ventricular zone expressed PH domain homolog 1 (zebrafish) [Gallus gallus]
myxovirus (influenza virus) resistance 2 (mouse) [Gallus gallus]
radical S-adenosyl methionine domain containing 2 [Gallus gallus]
interferon-induced protein with tetratricopeptide repeats 5 [Gallus gallus]
transmembrane protease, serine 2 [Gallus gallus]
lysozyme G-like 1 [Gallus gallus]
LOC768689
TNFRSF13C
DAAM1L
LOC424146
FAM5B
PTPN5
DCLK1
PROKR2
AMY1C
CLRN3
hypothetical protein LOC768689 [Gallus gallus]
tumor necrosis factor receptor superfamily, member 13C [Gallus gallus]
dishevelled-associated activator of morphogenesis 1-like [Gallus gallus]
hypothetical LOC424146 [Gallus gallus]
family with sequence similarity 5, member B [Gallus gallus]
protein tyrosine phosphatase, non-receptor type 5 (striatum-enriched) [Gallus gallus]
doublecortin-like kinase 1 [Gallus gallus]
Prokineticin receptor 2 [Gallus gallus]
amylase, alpha 1C (salivary) [Gallus gallus]
clarin 3 [Gallus gallus]
adjPVal
FC
2.13E-235 166
5.60E-09 159
1.63E-17 89
4.69E-07 71
2.70E-17 57
1.21E-66 52
5.22E-61 48
4.62E-15 42
9.01E-26 42
2.58E-48 41
1.72E-10
1.18E-10
2.89E-13
6.77E-20
2.32E-45
3.45E-38
1.05E-15
2.55E-18
9.82E-09
3.60E-106
-15
-15
-16
-16
-17
-17
-23
-37
-40
-49
51
Annotated Genes
• Microarrays
– 693 / 828 (84%) annotated Affymetrix probes
• RNA-Seq
– 1509 /1867 (81%) annotated RNA tags
– 1082 (72%) unique to RNA-Seq
52
Enrichment Analysis: Microarray
Microarrays: Genes up: 330 Genes down: 223
GO enrichment:
Up-regulated genes
Immune response; cytokine
activity; chemokine activity;
regulation of IL-6 etc.
Down-regulated genes
Protein binding
Enriched locations: None
TFBS enrichment:
Up-regulated genes
ISRE, IRF7, Ovo
Down-regulated genes
None
53
Enrichment Analysis: RNA-Seq
RNA-Seq: Genes up: 733 Genes down: 822
GO enrichment:
Up-regulated genes
As for array data
Down-regulated genes
Carbohydrate binding; structure of
ribosome; biological adhesion;
multicellular organismal development
Enriched locations:
Up-regulated genes
chr1, chr20
Down-regulated genes
chrZ, chr4
TFBS enrichment:
Up-regulated genes
ISRE, IRF7, ZNF42
Down-regulated genes
CdxA, Nkx6_2, RSRFC4,
Prrx2, FOXP1
54
Advantages of RNA-Seq
55
Alternate Transcripts
56
Acknowledgements
Funding
Biotechnology and Biological
Sciences Research Council
Roslin Institute
Dave Burt
Bob Paton
Ark-Genomics
Le Yu
Institute for Animal Health
Pete Kaiser
Jean-Remy Sadeyen
Centre for Genomic Regulation
(Barcelona)
Darek Kedra
Cedric Notredam
57
Avian RNA-Seq Consortium
• 37+ labs world-wide,
agreed to pool RNA-Seq
data
• Multiple tissues,
treatments, embryo and
adults
• Build gene models within
Ensembl
• Return for data analysis
of gene expression
58
Applications
59
60
DNA Methylation
61
MeDIP-Seq: NPAS4
62
Data Integration
63
Sequencing Technologies
Read length
Raw read
(bases)
Accuracy (%)
Reads
per run
Gbases
96
0.0003
NGS technology
Sequencing principle
1st generation
Sanger
Dideoxy sequencing
~1,000
≥99.999
Roche/454
Illumina/Solexa
ABI/SOLiD
Pyrosequencing
Reversible terminator chemistry
Sequencing by ligation
350-450
36–100
35-60
≥99
≥98–99
≥99.99
8.00E+05
0.4
6.00E+09
600
1.00E+08 50–120
3rd generation
PacBio
Helicos
Single-molecule sequencing
Single-molecule sequencing
1000-4500
25–55
≥80
≥97
4.80E+04 0.05
6.00E+08 21–35
2nd generation
64
PacBio Real-Time Sequencing
65
Sequencing Technologies
Read length
Raw read
(bases)
Accuracy (%)
Reads
per run
Gbases
96
0.0003
NGS technology
Sequencing principle
1st generation
Sanger
Dideoxy sequencing
~1,000
≥99.999
Roche/454
Illumina/Solexa
ABI/SOLiD
Pyrosequencing
Reversible terminator chemistry
Sequencing by ligation
350-450
36–100
35-60
≥99
≥98–99
≥99.99
8.00E+05
0.4
6.00E+09
600
1.00E+08 50–120
3rd generation
PacBio
Helicos
Single-molecule sequencing
Single-molecule sequencing
1000-4500
25–55
≥80
≥97
4.80E+04 0.05
6.00E+08 21–35
2nd generation
66