from phenotype to genotype and back again

From Phenotype to Genotype and Back Again —
Animal Genomics Enabling Prediction
Alan L. Archibald
The Roslin Institute and Royal (Dick) School of Veterinary Studies
University of Edinburgh
Genotype - phenotype
• Aim
– To predict outcomes
•
•
•
•
Efficacy of drug
Susceptibility to cancer
Performance of daughters of elite dairy bull
Susceptibility to nematode infections
• Discovery
– From phenotype to genotype (gene)
• Prediction
– From phenotype to genotype (breeding value)
– From genotype to phenotype
– From sequence to consequence
1953
Watson and
Crick
1977
DNA
sequenced
ΦX174
5,386 nt
1990
Human
Genome
Project
launched
Animal model, infinitesimal model
1920s and 30s
Fisher, Lush and
others Population
Genetics
1970s +
Advances in
quantitative
analysis
1991
PiGMaP
project starts
2001
Draft human genome
sequence
‘Halothane’
gene test
Marker Assisted Selection (MAS)
1990s +
Quantitative trait
locus (QTL)
mapping
2001
Genomic selection
proposed
PiGMaP – 25 years old
€1.2 million
Linkage / recombination map
Physical / cytogenetic map
Comparative map
cDNA (ESTs)
microsatellites
1953
Watson and
Crick
1977
DNA
sequenced
ΦX174
5,386 nt
1990
Human
Genome
Project
launched
Animal model, infinitesimal model
1920s and 30s
Fisher, Lush and
others Population
Genetics
1970s +
Advances in
quantitative
analysis
1991
PiGMaP
project starts
2001
Draft human genome
sequence
‘Halothane’
gene test
Marker Assisted Selection (MAS)
1990s +
Quantitative trait
locus (QTL)
mapping
2001
Genomic selection
proposed
1962
2002
Prediction success
•
•
•
•
Selective animal breeding
Animal model
Phenotypic selection
Prediction of breeding value
(genotype) from phenotype
• Successful EU companies
• e.g. AquaGen, Aviagen,
Cherry Valley, Cogent,
CRV, Genus, JSR Genetics,
Hendrix-Genetics,
Landcatch Natural
Selection, Topigs Norsvin
50% more
pigs
14 pigs/yr
21 pigs/yr
33% less
feed
410 kg
34 kg
feed / pig lean / pig
33% more
lean
273 kg
45 kg
feed / pig lean / pig
Modern intensive agriculture is efficient
“Why Industrial Farms Are Good for the Environment”
Jayson Lusk, New York Times, 23 Sept 2016
Selection works
• Age – matched
• Seven rounds of
selection per
annum
• Black box, but…
Successes – from association to causation
• DGAT1 – dairy cattle, milk
yield
• Callipyge – sheep, muscling
• MSTN – sheep, muscling
• IGF2 – pigs, muscling
• Noteworthy
– Regulatory sequences,
epigenetics
• One gene at a time: slow,
inefficient
Knowledge of causation enabled more
sophisticated selective breeding
2001
Genomic selection
proposed
2002
Mouse draft
genome
sequence
2004
2003
Chicken
Human genome
genome
sequence “finished”
sequenced
$3 billion
2008
Human 1000
Genomes
Project
2007 launched
Cat
genome
sequenced
2010
Turkey genome
sequenced
2009
Cattle genome
sequenced
Horse genome
Sequenced
Mouse genome
“finished”
2005
Dog
genome
sequenced
2003
ENCODE
(1%)
launched
2008
Bovine 50K
SNP chip
2007
ENCODE
genome-wide
2009
2010
Pig 60K SNP chip 750K bovine
SNP chips
Sheep 60K SNP
chip
From Marker-Assisted to Genomic Selection
“…This type of approach combined with cheap and high
density markers, could allow a move from selection based
on a combination of “infinitesimal” effects plus individual
loci to effective total genomic selection…..”
Genomics already delivering socioeconomic impact in agriculture
• Genomic selection (GS)
– GS theory developed in 2001 before technology available
– First 50K SNP chip (cattle) 2008; 650K in 2010
– GS implemented in all major livestock sectors in
developed world
– GS is underpinning faster, more accurate and sustainable
genetic improvement
Accuracy – what has been achieved?
USDA dairy cattle genomic evaluation
Courtesy of George Wiggans (USDA, Beltsville)
Milk yield
Pedigree
Genomic
Accuracy
0.51
0.86
Evolution of Genomic selection
• GS0.0
– The original model
– Linkage disequilibrium based
• GS1.0
– What has happened in practice
– Linkage based
• GS2.0
– The future
– LD and QTN based
– Requires lots of data
Goddard & Hayes
Nat. Rev. Genet. 2009
GS accuracy
• Accurate really only for close relatives
0.9
Accuracy
0.8
0.7
0.6
R² = 0.962
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
Mean of the Top Ten Relationships
Clark et al. (2012)
From SNPs to sequence
• In next five years sequence data will supplant SNP
genotypes
• Two approaches
– Sequencing individuals (e.g. 1000 Bull genomes project)
• Expensive even at $1000 per genome
• Alternatively genotyping-by-sequencing on new platforms (e.g.
Illumina HiSeqX), then impute
– Sequencing populations
• Aiming for $10 per genome
Multiple (aligned) animal genomes
 Pigs
•
•
•
•
Groenen (Wageningen) ~300 individual pigs
Korean ~60 individual pigs
China ?? Pigs
96 pig exomes (Roslin)
 Sheep
•
453 genomes in SheepGenomesDB http://sheepgenomesdb.org/home
 Chickens
•
10’s of individuals (e.g. 10 individual J line brown egg layers)
 Cattle
 ~3,000 genomes (Taylor estimate)
 1000 Bull Genomes Project
•
•
•
•
Collaborative, Cloud data repository
1500+ bulls, average coverage ~11x
Data analysis cycles for genomic prediction
NextGen
– >400 sheep, goat, cattle genomes
Sequencing populations
• Aim: sequence data for 100K to 1M individuals at
$10 per individual
• Exploits:
– pedigree structures in managed population
– imputation from low sequence coverage
• Assemble shared halpotypes from partial low
coverage sequence of 100’s of related individuals
LCSeq for whole genome sequencing
A
True haplotypes
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Derive consensus
haplotypes
2x sequencing
(10 individuals)
1
Sire
1
1
1 1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
Progeny …
1
1
1
Progeny 1
Progeny 2
1
1
1
1
1
1
1
Progeny 10
1
1
1
1
Progeny …
Sire
1
Progeny 1
Progeny 2
Impute sequence for
all individuals
1
1
1
1
Progeny 10
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
1
1
1
1
B
1
1
1
1
10x sequencing
(5 individuals)
1
1
1
1
1
1
C
• Sequencing few
individuals not that
useful
• Sequence everybody at
low-x & impute
• Make the population
the target not the
individual
– ~250K pigs, Genus
– ~250K chickens, Aviagen
Genomic selection
• GS theory developed in 2001 before technology available
• First 50K SNP chip (cattle) 2008; 650K in 2010
• GS implemented in all major livestock sectors in developed
world
• GS is underpinning faster, more accurate and sustainable
genetic improvement
• From SNPs to sequence (via imputation)
• Adding knowledge of SNP effects
– Coding/non-coding; known/predicted
2012
ENCODE
2012
Pig genome
sequenced
2012
Chicken 600K
SNP chip
2016
Improved reference
genomes – goat,
pig, sheep, cattle,
chicken,
2013
Goat genome
sequenced
2013
Duck genome
sequenced
2015
Functional
Annotation of
Animal Genomes
(FAANG) launched
2016
FAANG-Europe
COST Action
2014
Sheep genome
sequenced
2013 onwards
Genotype-bysequence
2014
Salmon
SNP chip
2015
Pig 650K
SNP chip
2015 onwards
LCseq for genomic selection
SNPs impute to sequence
Fish: Tilapia, Cod, Salmon,……
From sequence to consequence
Phenome
Growth
Feed
efficiency
Body
composition
Disease
resistance
Adapted from Ritchie et al. 2015 Nature Reviews Genetics 16: 85
Reference genome improvement
• PacBio long read technology, de novo assembly
– Goat, pig, sheep, cattle
• Sscrofa10.2: 73,500 contigs; contig N50 ~80 kbp
• Sscrofa11: <200 contigs; contig N50 ~35 Mbp
• Disruptive technology, multiple genome(s)
assemblies
– Annotation - “Best in genome”
– Graph visualization, alignment tools under development
Discovering functional sequences
• Evolutionary
– Sequence comparison,
conservation
– 1000G, G10K,…
– Genome sequence sufficient
– Conserved, but what is it?
– Highly variable ≠ nonfunctional
• Functional, biochemical
– Assay-by-sequence
– ENCODE, iHEC, Epigenome
roadmap, FAANG
– Expensive
– Exploring 4-demensional
space (location + time)
– Noise or biologically
meaningful?
• 80.4% participates in at
least one biochemical
RNA- and/or chromatinassociated event in at
least one cell type
• promoter functionality
can explain most of the
variation in RNA
expression
• SNPs associated with
disease by GWAS are
enriched within noncoding functional
elements
>$250 million
Richly annotated reference genomes
• A key shared (open access) resource
– for 21st century biological research
• For effective exploitation
– for genomics enabled prediction
• e.g. selective animal breeding
• Expensive shared resources
– International collaborative consortia