International HapMap Project Resource For Association Studies of The Future Gonçalo Abecasis Center for Statistical Genetics University of Michigan Genetic Association Studies ¾ Identify genetic variants with relatively small individual contributions to disease risk ¾ Require detail measurement of genetic variation z z > 10,000,000 catalogued genetic variants, so … Until recently, limited to candidate genes or regions • A hit-and-miss approach… z ¾ Decreasing assay costs allow comprehensive studies A few variants could represent all common variants z But identifying this subset could be challenging … Linkage Disequilibrium ¾ Chromosomes are mosaics Ancestor Present-day ¾ Shaped by z ¾ Recombination, Mutation, Drift Tightly linked markers z z Reflect ancestral haplotypes Alleles associated Tagging SNPs ¾ ¾ In a typical short chromosome segment, there are only a few distinct haplotypes Carefully selected SNPs can determine status of other SNPs Haplo1 T T T 30% Haplo2 T T T 20% Haplo3 T T T 20% Haplo4 T T T 20% Haplo5 T T T 10% HapMap Project ¾ High-density SNP genotyping across the genome will provide information about z z SNP validation, frequency, assay conditions Relationship between alleles in the genome ¾ Public resource to increase efficiency of association for medically relevant traits ¾ Data is freely available, www.hapmap.org International Effort ¾ Genotyping being carried out in: z z z z z ¾ Canada (McGill) China (Beijing, Hong Kong, Shanghai) Japan (Riken) United Kingdom (Sanger) United States (Baylor, Illumina, UCSF/WashU, Broad) International Working Groups: z Analysis, QA/QC, Dataflow, Ethical Issues, … The Hapmap Samples ¾ European Ancestry z ¾ African Ancestry z ¾ 90 CEPH family members (CEU, 30 trios) 90 Yoruba from Ibadan, Nigeria (YRI, 30 trios) East Asian z z 45 Japanese from Tokyo, Japan (JAT) 45 Han Chinese from Beijing, China (HCB) Current Status ¾ Data publicly available for >1,000,000 SNPs assayed in 4 populations z Including about 700,000 common SNPs ¾ The first phase provides information on one polymorphic marker per 5kb in each population Ongoing Quality Assessment ¾ Project includes exercises to evaluate and ensure quality of generated data z Including evaluating agreement across platforms ¾ Latest completed round: z z Data for 1496 markers from public website Repeat data collected with 11 different protocols Consensus Summary ¾ ¾ 116,536 genotypes 1,295 polymorphic markers z Each Genotyped at 8.8 Centers on Average ¾ 99.99% completion rate (14 missing genotypes) ¾ 5 Mendelian inconsistencies z z z In 38,836 trios and 9 parent-offspring pairs Corresponds to "error rate" of 0.00016 (But these are probably not errors!) Quality Assessment Summary ¾ Error rate is 0.00228 vs original submission z z Most markers show zero or one differences 3 markers account for most differences ¾ Error rate seems lower with “internal” checks z z 0.0006 based on same protocol duplicates 0.0006 based on Mendelian inconsistencies Empirical of LD (chromosome 2, R²) 0.6 Statistic 0.4 Han, Japanese Yoruba CEPH 0.2 0 0 50 100 150 200 250 Distance (kb) LD extends further in CEPH and the Han/Japanese than in the Yoruba Most chromosomes exhibit regions of extended LD on the megabase scale Genomic Distribution of LD (CEPH) Expected r2 at 30kb Bright Red > 0.88 Dark Blue < 0.12 Genomic Distribution of LD (JPT + HCB) Expected r2 at 30kb Bright Red > 0.88 Dark Blue < 0.12 Genomic Distribution of LD (YRI) Expected r2 at 10kb Bright Red > 0.88 Dark Blue < 0.12 Genomic Distribution of LD (YRI) Expected r2 at 30kb Bright Red > 0.88 Dark Blue < 0.12 Big Picture ¾ Substantial variation in genomic LD ¾ Global features: z z z z ¾ More LD on larger chromosomes and X LD is generally low near telomeres LD is generally high near centromeres Relative LD is usually similar across samples Best predictor of disequilibrium in a region is LD in another population for that region Tagging SNPs ¾ Fine description of linkage disequilibrium allows identification of SNPs that capture variation in each region ¾ To illustrate the possibilities, selected set of tagging SNPs using pairwise r2 criterion z SNPs are good surrogates if: • • Nearly identical frequency Nearly always appear on the same haplotype Example: Chromosome 3 Tags Gene Density Genotyped SNPs Available SNPs ¾ 45,727 SNPs genotyped (December 2004) z Exhaustive tagging: • 11,366 tags (r2 threshold of 0.50) • 18,638 tags (r2 threshold of 0.80) z But even fewer tags can be quite useful … Example: Chromosome 3 Tags Gene Density Genotyped SNPs Available SNPs ¾ 45,727 SNPs genotyped (December 2004) z Exhaustive tagging: • 11,366 tags (r2 threshold of 0.50) • 18,638 tags (r2 threshold of 0.80) z Low hanging fruit (the 10% best tags): • 1,136 tags capture 21,491 markers (r2 threshold of 0.50) • 1,863 tags capture 18,994 markers (r2 threshold of 0.80) Chromosome 2 Tagging… (r2 threshold of 0.50) ¾ Exhaustive tagging: z z z ¾ 13,706 tags for 57,503 common SNPs 23,710 tags for 53,534 common SNPs 11,045 tags for 43,950 common SNPs Low hanging fruit (top 10% of all tags) z z z ¾ CEPH: YRI: HBJT: CEPH: YRI: HBJT: 1,370 tags for 27,151 common SNPs 2,371 tags for 21,724 common SNPs 1,104 tags for 20,202 common SNPs 10 scans “half-genome scans” focused on low hanging fruit cost about the same as one comprehensive scan z But genotype cost may fall too fast to make this necessary! So far … ¾ Description of extensive genomic variation in linkage disequilibrium z z ¾ Reflects the main purpose of the project, z ¾ Similarities and differences between populations Ability to find SNPs that capture variation in any region Aid association studies for medically important traits However, data also contain information about interesting and unusual patterns of variation … Does Sequence Variation Predict Linkage Disequilibrium? Some Very Hot Spots ¾ High recombination regions z z z z Pair of markers within 10kb 5 markers within 40kb to the left 5 markers within 40kb to the right None of 25 pairs shows r2 > 0.10 ¾ Compared to unselected regions z As above, no constraint on LD 10 5 Chromosome 15 20 Some Very Hot Spots 0 50 100 150 Position (Mb) 200 250 Sequence in Very Hot Intervals Feature Candidates CEU YRI JPT, HCB N of Intervals *varies 10155 17144 12224 Centromere (5Mb) Telomere (5Mb) GC CpG Island Known Exons 0.047 0.036 0.407 0.006 0.026 0.033 0.079 0.437 0.008 0.021 0.032 0.091 0.439 0.008 0.022 0.031 0.078 0.437 0.008 0.021 --+++ +++ +++ --- Repeat -- LINE Repeat -- SINE Repeat -- Simple 0.168 0.117 0.008 0.136 0.144 0.010 0.135 0.143 0.010 0.134 0.143 0.010 --+++ +++ Segmental Duplications 0.014 0.010 0.011 -- 0.010 --- * Intervals evaluated: 611255 (HCB, JPT), 670675 (CEPH) 696575 (YRI) Association Big Picture ¾ Best predictor of LD is information about a different sample ¾ Multiple genomic features appear to be associated with LD z GC, repeat types, exons, conserved sequences ¾ We are in the process of comparing the relationship between gene function and LD Linkage Analysis with Markers in Linkage Disequilibrium Gonçalo Abecasis University of Michigan SNPs ¾ Abundant diallelic genetic markers ¾ Amenable to automated genotyping z Fast, cheap genotyping with low error rates ¾ Rapidly replacing microsatellites in many linkage studies The Problem ¾ Linkage analysis methods assume that markers are in linkage equilibrium z Violation of this assumption can produce large biases ¾ This assumption affects ... z z z Parametric and nonparametric linkage Variance components analysis Haplotype estimation Standard Hidden Markov Model G1 G3 G2 P(G1 | I1 ) P(G2 | I 2 ) P ( I 2 | I1 ) P(G3 | I 3 ) I3 I2 I1 GM P( I 3 | I 2 ) P(GM | I M ) IM P (...) Observed Genotypes Are Connected Only Through IBD States … Our Approach ¾ Cluster groups of SNPs in LD z z z Assume no recombination within clusters Estimate haplotype frequencies Sum over possible haplotypes for each founder ¾ Two pass computation … z z Group inheritance vectors that produce identical sets of founder haplotypes Calculate probability of each distinct set Hidden Markov Model G1 G3 G2 P(G1 , G2 | I cluster1 ) I cluster1 GM −1 GM G4 I cluster 2 P( I cluster 2 | I cluster1 ) P (GM −1,GM | I clusterN ) P(G3 , G4 | I cluster 2 ) I clusterN P (...) Example With Clusters of Two Markers … Practically … P(G1...GC | f1... f h , v) = h ∑ ... ∑ Pr(G ...G H1 =1 = ¾ H 2 f =1 h h H1 =1 H 2 f =1 ∑ ... ∑ 1 C | H1...H 2 f , v) Pr( H1...H 2 f | f1... f h ) 2f Pr(G1...GC | H1...H 2 f , v)∏ Pr( H i | f1... f h ) i =1 Probability of observed genotypes G1…GC z z ¾ h Conditional on haplotype frequencies f1 .. fh Conditional on a specific inheritance vector v Calculated by iterating over founder haplotypes Computationally … ¾ Avoid iteration over h2f z z founder haplotypes List possible haplotype sets for each cluster List is product of allele graphs for each marker ¾ Group inheritance vectors with identical lists z z z First, generate lists for each vector Second, find equivalence groups Finally, evaluate nested sum once per group Simulations … ¾ 2000 genotyped individuals per dataset z z 0, 1, 2 genotyped parents per sibship 2, 3, 4 genotyped affected siblings ¾ Clusters of 3 markers, centered 5 cM apart z Used Hapmap to generate haplotype frequencies • Clusters of 3 SNPs in 100kb windows • Windows are 5 Mb apart along chromosome 13 • All SNPs had minor allele frequency > 5% z Simulations assumed 1 cM / Mb Average LOD Scores (Null Hypothesis) Analysis Strategy Ignore LD Average LOD Model Independent LD SNPs No parents genotyped … 2 sibs per family … 3 sibs per family … 4 sibs per family 1.516 2.695 2.412 -0.010 -0.002 -0.002 -0.005 -0.002 -0.005 One parent genotyped … 2 sibs per family … 3 sibs per family … 4 sibs per family 0.571 0.660 0.522 0.008 -0.021 0.002 0.005 -0.028 0.007 Two parents genotyped … 2 sibs per family … 3 sibs per family … 4 sibs per family 0.030 0.014 0.014 -0.003 -0.016 -0.007 -0.005 -0.021 -0.006 5% Significance Thresholds (based on peak LODs under null) Analysis Strategy Significance Threshold Ignore Model Independent LD LD SNPs No parents genotyped … 2 sibs per family … 3 sibs per family … 4 sibs per family 11.90 18.06 15.52 1.18 1.38 1.24 1.18 1.29 1.22 One parent genotyped … 2 sibs per family … 3 sibs per family … 4 sibs per family 5.52 5.84 4.79 1.54 1.25 1.42 1.36 1.20 1.36 Two parents genotyped … 2 sibs per family … 3 sibs per family … 4 sibs per family 1.64 1.51 1.41 1.47 1.41 1.34 1.36 1.31 1.26 Empirical Power Analysis Strategy Ignore LD Power Model Independent LD SNPs No parents genotyped … 2 sibs per family … 3 sibs per family … 4 sibs per family 0.097 0.133 0.196 0.304 0.540 0.874 0.247 0.405 0.680 One parent genotyped … 2 sibs per family … 3 sibs per family … 4 sibs per family 0.088 0.213 0.397 0.158 0.562 0.782 0.161 0.430 0.608 Two parents genotyped … 2 sibs per family … 3 sibs per family … 4 sibs per family 0.141 0.436 0.779 0.163 0.435 0.779 0.145 0.401 0.701 Disease Model, p = 0.10, f11 = 0.01, f12 = 0.02, f22 = 0.04 Conclusions from Simulations ¾ Modeling linkage disequilibrium crucial z Especially when parental genotypes missing ¾ Ignoring linkage disequilibrium z z z Inflates LOD scores Both small and large sibships are affected Loses ability to discriminate true linkage Example ¾ Psoriasis data of Stuart et al (2005) ¾ 274 families z z 2 or more affected individuals Up to 21 genotyped individuals per family ¾ Initial microsatellite scan, additional fine- mapping markers added to chromosome 17 candidate region 0 1 2 3 Kong and Cox LOD Score 4 D17S928 rs869190 rs2019154 rs1564864 AAT09 D17S784 D17S1830 D17S1806 D17S836 D17S1847 D17S674 D17S802 TTCA006M D17S939 D17S937 D17S785 rs895691 rs734232 rs745318 D17S1301 D17S2059 D17S1786 D17S2193 D17S1290 AAT245 D17S787 D17S800 D17S1299 D17S1294 D17S783 D17S2196 D17S799 ATA78D02N D17S974 GATA158H0 D17S796 D17S1308 D17S849 Position (cM) 150 100 50 0
© Copyright 2026 Paperzz