International HapMap Project Resource For Association Studies of

International HapMap Project
Resource For Association
Studies of The Future
Gonçalo Abecasis
Center for Statistical Genetics
University of Michigan
Genetic Association Studies
¾
Identify genetic variants with relatively small
individual contributions to disease risk
¾
Require detail measurement of genetic variation
z
z
> 10,000,000 catalogued genetic variants, so …
Until recently, limited to candidate genes or regions
• A hit-and-miss approach…
z
¾
Decreasing assay costs allow comprehensive studies
A few variants could represent all common variants
z
But identifying this subset could be challenging …
Linkage Disequilibrium
¾
Chromosomes are
mosaics
Ancestor
Present-day
¾
Shaped by
z
¾
Recombination, Mutation, Drift
Tightly linked markers
z
z
Reflect ancestral haplotypes
Alleles associated
Tagging SNPs
¾
¾
In a typical short chromosome segment, there are
only a few distinct haplotypes
Carefully selected SNPs can determine status of
other SNPs
Haplo1
T T
T
30%
Haplo2
T T
T
20%
Haplo3
T T
T
20%
Haplo4
T T
T
20%
Haplo5
T T
T
10%
HapMap Project
¾
High-density SNP genotyping across the
genome will provide information about
z
z
SNP validation, frequency, assay conditions
Relationship between alleles in the genome
¾
Public resource to increase efficiency of
association for medically relevant traits
¾
Data is freely available, www.hapmap.org
International Effort
¾
Genotyping being carried out in:
z
z
z
z
z
¾
Canada (McGill)
China (Beijing, Hong Kong, Shanghai)
Japan (Riken)
United Kingdom (Sanger)
United States (Baylor, Illumina, UCSF/WashU, Broad)
International Working Groups:
z
Analysis, QA/QC, Dataflow, Ethical Issues, …
The Hapmap Samples
¾
European Ancestry
z
¾
African Ancestry
z
¾
90 CEPH family members (CEU, 30 trios)
90 Yoruba from Ibadan, Nigeria (YRI, 30 trios)
East Asian
z
z
45 Japanese from Tokyo, Japan (JAT)
45 Han Chinese from Beijing, China (HCB)
Current Status
¾ Data publicly available for >1,000,000
SNPs assayed in 4 populations
z
Including about 700,000 common SNPs
¾ The first phase provides information on
one polymorphic marker per 5kb in each
population
Ongoing Quality Assessment
¾ Project includes exercises to evaluate and
ensure quality of generated data
z
Including evaluating agreement across platforms
¾ Latest completed round:
z
z
Data for 1496 markers from public website
Repeat data collected with 11 different protocols
Consensus Summary
¾
¾
116,536 genotypes
1,295 polymorphic markers
z
Each Genotyped at 8.8 Centers on Average
¾
99.99% completion rate (14 missing genotypes)
¾
5 Mendelian inconsistencies
z
z
z
In 38,836 trios and 9 parent-offspring pairs
Corresponds to "error rate" of 0.00016
(But these are probably not errors!)
Quality Assessment Summary
¾ Error rate is 0.00228 vs original submission
z
z
Most markers show zero or one differences
3 markers account for most differences
¾ Error rate seems lower with “internal” checks
z
z
0.0006 based on same protocol duplicates
0.0006 based on Mendelian inconsistencies
Empirical of LD (chromosome 2, R²)
0.6
Statistic
0.4
Han, Japanese
Yoruba
CEPH
0.2
0
0
50
100
150
200
250
Distance (kb)
LD extends further in CEPH and the Han/Japanese than in the Yoruba
Most chromosomes exhibit regions of extended LD on the megabase scale
Genomic Distribution of LD
(CEPH)
Expected r2 at 30kb
Bright Red > 0.88
Dark Blue < 0.12
Genomic Distribution of LD
(JPT + HCB)
Expected r2 at 30kb
Bright Red > 0.88
Dark Blue < 0.12
Genomic Distribution of LD
(YRI)
Expected r2 at 10kb
Bright Red > 0.88
Dark Blue < 0.12
Genomic Distribution of LD
(YRI)
Expected r2 at 30kb
Bright Red > 0.88
Dark Blue < 0.12
Big Picture
¾
Substantial variation in genomic LD
¾
Global features:
z
z
z
z
¾
More LD on larger chromosomes and X
LD is generally low near telomeres
LD is generally high near centromeres
Relative LD is usually similar across samples
Best predictor of disequilibrium in a region is LD in
another population for that region
Tagging SNPs
¾ Fine description of linkage disequilibrium
allows identification of SNPs that capture
variation in each region
¾ To illustrate the possibilities, selected set
of tagging SNPs using pairwise r2 criterion
z
SNPs are good surrogates if:
•
•
Nearly identical frequency
Nearly always appear on the same haplotype
Example:
Chromosome 3 Tags
Gene Density
Genotyped SNPs
Available SNPs
¾
45,727 SNPs genotyped (December 2004)
z
Exhaustive tagging:
• 11,366 tags (r2 threshold of 0.50)
• 18,638 tags (r2 threshold of 0.80)
z
But even fewer tags can be quite useful …
Example:
Chromosome 3 Tags
Gene Density
Genotyped SNPs
Available SNPs
¾
45,727 SNPs genotyped (December 2004)
z
Exhaustive tagging:
• 11,366 tags (r2 threshold of 0.50)
• 18,638 tags (r2 threshold of 0.80)
z
Low hanging fruit (the 10% best tags):
• 1,136 tags capture 21,491 markers (r2 threshold of 0.50)
• 1,863 tags capture 18,994 markers (r2 threshold of 0.80)
Chromosome 2 Tagging…
(r2 threshold of 0.50)
¾
Exhaustive tagging:
z
z
z
¾
13,706 tags for 57,503 common SNPs
23,710 tags for 53,534 common SNPs
11,045 tags for 43,950 common SNPs
Low hanging fruit (top 10% of all tags)
z
z
z
¾
CEPH:
YRI:
HBJT:
CEPH:
YRI:
HBJT:
1,370 tags for 27,151 common SNPs
2,371 tags for 21,724 common SNPs
1,104 tags for 20,202 common SNPs
10 scans “half-genome scans” focused on low hanging
fruit cost about the same as one comprehensive scan
z
But genotype cost may fall too fast to make this necessary!
So far …
¾
Description of extensive genomic variation in
linkage disequilibrium
z
z
¾
Reflects the main purpose of the project,
z
¾
Similarities and differences between populations
Ability to find SNPs that capture variation in any region
Aid association studies for medically important traits
However, data also contain information about
interesting and unusual patterns of variation …
Does Sequence Variation
Predict Linkage Disequilibrium?
Some Very Hot Spots
¾ High recombination regions
z
z
z
z
Pair of markers within 10kb
5 markers within 40kb to the left
5 markers within 40kb to the right
None of 25 pairs shows r2 > 0.10
¾ Compared to unselected regions
z
As above, no constraint on LD
10
5
Chromosome
15
20
Some Very Hot Spots
0
50
100
150
Position (Mb)
200
250
Sequence in Very Hot Intervals
Feature Candidates
CEU
YRI
JPT, HCB
N of Intervals
*varies
10155
17144
12224
Centromere (5Mb)
Telomere (5Mb)
GC
CpG Island
Known Exons
0.047
0.036
0.407
0.006
0.026
0.033
0.079
0.437
0.008
0.021
0.032
0.091
0.439
0.008
0.022
0.031
0.078
0.437
0.008
0.021
--+++
+++
+++
---
Repeat -- LINE
Repeat -- SINE
Repeat -- Simple
0.168
0.117
0.008
0.136
0.144
0.010
0.135
0.143
0.010
0.134
0.143
0.010
--+++
+++
Segmental Duplications
0.014
0.010
0.011 --
0.010
---
* Intervals evaluated: 611255 (HCB, JPT), 670675 (CEPH) 696575 (YRI)
Association
Big Picture
¾ Best predictor of LD is information about a
different sample
¾ Multiple genomic features appear to be
associated with LD
z
GC, repeat types, exons, conserved sequences
¾ We are in the process of comparing the
relationship between gene function and LD
Linkage Analysis with Markers in
Linkage Disequilibrium
Gonçalo Abecasis
University of Michigan
SNPs
¾ Abundant diallelic genetic markers
¾ Amenable to automated genotyping
z
Fast, cheap genotyping with low error rates
¾ Rapidly replacing microsatellites in many
linkage studies
The Problem
¾ Linkage analysis methods assume that
markers are in linkage equilibrium
z
Violation of this assumption can produce large
biases
¾ This assumption affects ...
z
z
z
Parametric and nonparametric linkage
Variance components analysis
Haplotype estimation
Standard Hidden Markov Model
G1
G3
G2
P(G1 | I1 )
P(G2 | I 2 )
P ( I 2 | I1 )
P(G3 | I 3 )
I3
I2
I1
GM
P( I 3 | I 2 )
P(GM | I M )
IM
P (...)
Observed Genotypes Are Connected Only Through IBD States …
Our Approach
¾ Cluster groups of SNPs in LD
z
z
z
Assume no recombination within clusters
Estimate haplotype frequencies
Sum over possible haplotypes for each founder
¾ Two pass computation …
z
z
Group inheritance vectors that produce
identical sets of founder haplotypes
Calculate probability of each distinct set
Hidden Markov Model
G1
G3
G2
P(G1 , G2 | I cluster1 )
I cluster1
GM −1 GM
G4
I cluster 2
P( I cluster 2 | I cluster1 )
P (GM −1,GM | I clusterN )
P(G3 , G4 | I cluster 2 )
I clusterN
P (...)
Example With Clusters of Two Markers …
Practically …
P(G1...GC | f1... f h , v) =
h
∑ ... ∑ Pr(G ...G
H1 =1
=
¾
H 2 f =1
h
h
H1 =1
H 2 f =1
∑ ... ∑
1
C
| H1...H 2 f , v) Pr( H1...H 2 f | f1... f h )
2f
Pr(G1...GC | H1...H 2 f , v)∏ Pr( H i | f1... f h )
i =1
Probability of observed genotypes G1…GC
z
z
¾
h
Conditional on haplotype frequencies f1 .. fh
Conditional on a specific inheritance vector v
Calculated by iterating over founder haplotypes
Computationally …
¾ Avoid iteration over h2f
z
z
founder haplotypes
List possible haplotype sets for each cluster
List is product of allele graphs for each marker
¾ Group inheritance vectors with identical lists
z
z
z
First, generate lists for each vector
Second, find equivalence groups
Finally, evaluate nested sum once per group
Simulations …
¾ 2000 genotyped individuals per dataset
z
z
0, 1, 2 genotyped parents per sibship
2, 3, 4 genotyped affected siblings
¾ Clusters of 3 markers, centered 5 cM apart
z
Used Hapmap to generate haplotype frequencies
• Clusters of 3 SNPs in 100kb windows
• Windows are 5 Mb apart along chromosome 13
• All SNPs had minor allele frequency > 5%
z
Simulations assumed 1 cM / Mb
Average LOD Scores
(Null Hypothesis)
Analysis
Strategy
Ignore
LD
Average LOD
Model
Independent
LD
SNPs
No parents genotyped
… 2 sibs per family
… 3 sibs per family
… 4 sibs per family
1.516
2.695
2.412
-0.010
-0.002
-0.002
-0.005
-0.002
-0.005
One parent genotyped
… 2 sibs per family
… 3 sibs per family
… 4 sibs per family
0.571
0.660
0.522
0.008
-0.021
0.002
0.005
-0.028
0.007
Two parents genotyped
… 2 sibs per family
… 3 sibs per family
… 4 sibs per family
0.030
0.014
0.014
-0.003
-0.016
-0.007
-0.005
-0.021
-0.006
5% Significance Thresholds
(based on peak LODs under null)
Analysis
Strategy
Significance Threshold
Ignore
Model
Independent
LD
LD
SNPs
No parents genotyped
… 2 sibs per family
… 3 sibs per family
… 4 sibs per family
11.90
18.06
15.52
1.18
1.38
1.24
1.18
1.29
1.22
One parent genotyped
… 2 sibs per family
… 3 sibs per family
… 4 sibs per family
5.52
5.84
4.79
1.54
1.25
1.42
1.36
1.20
1.36
Two parents genotyped
… 2 sibs per family
… 3 sibs per family
… 4 sibs per family
1.64
1.51
1.41
1.47
1.41
1.34
1.36
1.31
1.26
Empirical Power
Analysis
Strategy
Ignore
LD
Power
Model
Independent
LD
SNPs
No parents genotyped
… 2 sibs per family
… 3 sibs per family
… 4 sibs per family
0.097
0.133
0.196
0.304
0.540
0.874
0.247
0.405
0.680
One parent genotyped
… 2 sibs per family
… 3 sibs per family
… 4 sibs per family
0.088
0.213
0.397
0.158
0.562
0.782
0.161
0.430
0.608
Two parents genotyped
… 2 sibs per family
… 3 sibs per family
… 4 sibs per family
0.141
0.436
0.779
0.163
0.435
0.779
0.145
0.401
0.701
Disease Model, p = 0.10, f11 = 0.01, f12 = 0.02, f22 = 0.04
Conclusions from Simulations
¾ Modeling linkage disequilibrium crucial
z
Especially when parental genotypes missing
¾ Ignoring linkage disequilibrium
z
z
z
Inflates LOD scores
Both small and large sibships are affected
Loses ability to discriminate true linkage
Example
¾ Psoriasis data of Stuart et al (2005)
¾ 274 families
z
z
2 or more affected individuals
Up to 21 genotyped individuals per family
¾ Initial microsatellite scan, additional fine-
mapping markers added to chromosome
17 candidate region
0
1
2
3
Kong and Cox LOD Score
4
D17S928
rs869190
rs2019154
rs1564864
AAT09
D17S784
D17S1830
D17S1806
D17S836
D17S1847
D17S674
D17S802
TTCA006M
D17S939
D17S937
D17S785
rs895691
rs734232
rs745318
D17S1301
D17S2059
D17S1786
D17S2193
D17S1290
AAT245
D17S787
D17S800
D17S1299
D17S1294
D17S783
D17S2196
D17S799
ATA78D02N
D17S974
GATA158H0
D17S796
D17S1308
D17S849
Position (cM)
150
100
50
0