SNP Resources and Applications SeattleSNPs PGA Debbie Nickerson Department of Genome Sciences [email protected] http://pga.gs.washington.edu Strategies for Genetic Analysis Populations Association Studies Families Linkage Studies C/C C/C C/T C/C C/T C/T C/C C/C 40% T, 60% C C/T C/C C/T C/T C/C C/C C as es 15% T, 85% C C o ntro ls Simple Inheritance Complex Inheritance Single Gene Multiple Genes Rare Variants Common Variants ~1,000 Short Tandem Repeat Markers and now 3,000 SNPs Polymorphic Markers > 500,000 -1,000,000 Single Nucleotide Polymorphisms (SNPs) Complex inheritance/disease Many Other Genes Variant Gene Environment Disease Diabetes Obesity Cancer Heart Disease Multiple Sclerosis Asthma Schizophrenia Celiac Disease Autism Two hypotheses: 1- common disease/common variants 2- common disease/many rare variants Genetic Strategy - New Insights STRONG LINKAGE effect size ASSOCIATION Genome-wide Sequencing WEAK LOW allele frequency HIGH Ardlie, Kruglyak & Seielstad (2002) Nat. Genet. Rev. 3: 299-309 Zondervan & Cardon (2004) Nat. Genet. Rev. 5: 89-100 Finding SNPs Strategies Total sequence variation in humans Population size: 6x109 (diploid) Mutation rate: 2x10–8 per bp per generation Expected “hits”: 240 for each bp - Every variant compatible with life exists in the population BUT most are vanishingly rare in the population! Compare 2 haploid genomes: 1 SNP per 1331 bp* *The International SNP Map Working Group, Nature 409:928 - 933 (2001) SNP Discovery: HapMap and others Generate more SNPs: Random Shotgun Sequencing Genomic DNA (multiple individuals) Sources of SNPs: Perlegen SNP data Sequence chromatograms from Celera project HapMap Random Shotgun Sequence and align (reference sequence) TACGCCTATA TCAAGGAGAT GTTACGCCAATACAGGATCCAGGAGATTACC Draft Human Genome dbSNP 127 - 11.8 Million SNPs and 5.7 Million SNPs Validated Finding SNPs: Sequence-based SNP Mining Genomic mRNA RT errors cDNA Library Sequencing Quality EST Overlap BAC Library RRS Library BAC Overlap Shotgun Overlap DNA SEQUENCING Sequence Overlap - SNP Discovery G GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC C Validated SNPs - two independent discoveries SNP discovery is dependent on your sample population size Fraction of SNPs Discovered 2 chromosomes { GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC 1.0 88 0.5 2 0.0 0.0 0.1 0.2 0.3 Minor Allele Frequency (MAF) 0.4 0.5 Candidate Gene Resource SNP Discovery in SeattleSNPs Complete analysis: cSNPs, Linkage Disequilbrium and Haplotype Data 5’ Arg-Cys Val-Val 3’ PCR amplicons •Generate SNP data from complete genomic resequencing (i.e., 5’ regulatory, exon, intron, 3’ regulatory sequence) Increasing Sample Size Improves SNP Discovery { Fraction of SNPs Discovered 2 chromosomes GTTACGCCAATACAGGATCCAGGAGATTACC GTTACGCCAATACAGCATCCAGGAGATTACC SeattleSNPs 1.0 96 48 24 16 HapMap Based on ~ 6 chromosomes 88 0.5 2 0.0 0.0 0.1 0.2 0.3 Minor Allele Frequency (MAF) 0.4 0.5 SNPs in the Average Gene Average Gene Size - 25 kb ~ Compare 2 haploid - 1 in 1,000 bp ~150 SNPs (200 bp) - 15,000,000 SNPs ~ 50 SNPs > 0.05 MAF (600 bp) - 6,000,000 SNPs (33-40%) ~ 5 coding SNPs (half change the amino acid sequence) Crawford et al Ann Rev Genomics Hum Genet 2005;6:287-312 SeattleSNPs panel HapMap Integration (~4 million SNPs) High Density Genic Coverage (SeattleSNPs) Low Density Genome Coverage (HapMap) = SeattleSNPs discovery (1/188 bp) = HapMap SNPs (~1/1000 bp) Sequence Variation and the HapMap Summary: The Current State of SNP Resources Random SNP discovery generates many SNPs (HapMap) Random approaches to SNP discovery have reached limits of discovery and validation (~ 50% of the common SNPs) Resequencing approaches continue to catalog important variants (rare and common not captured by the HapMap) SeattleSNPs has generated SNP data across >300 key candidate genes NHLBI - Candidate Genes and Medical Resequencing http://rsng.nhlbi.nih.gov/scripts/index.cfm Typing SNPs: Approaches HapMap Project: Genotype validated SNPs in the dbSNP Genotype SNPs in Four populations: Initially 1 Million -> Now 4 Million • • • • CEPH (CEU) (Europe - n = 90, trios) Yoruban (YRI) (Africa - n = 90, trios) Japanese (JPT) (Asian - n = 45) Chinese (HCB) (Asian - n =45) To produce a genome-wide map of common variation Genotyping Adds Value to SNPs HapMap Genotyping • Confirms a SNP as “real” and “informative” • Determines Minor Allele Frequency (MAF) - common or rare • Determines MAF in different populations • Detection of SNP correlations (Linkage Disequilibrium and Haplotypes) Genotype correlations among SNPs decreases the number of SNPs that need to be genotyped An Example of SNP Correlation in the Human IL1A Gene IL1A in Europeans • 18.5 kb • 50 SNPs • 46 common SNPs (> 10%MAF) Carlson et al. (2004) Am J Hum Genet. 74: 106-120. Homozygote common Heterozygote Homozygote alternative allele Missing Data 46 Common SNPs reduces to 3 SNPs Select one SNP per bin using LDSelect • Threshold LD: r2 – Bin 1: 22 sites – Bin 2: 18 sites – Bin 3: 5 sites • Genotype 1 SNP from each bin • TagSNP, chosen for biological intuition or ease of assay design Common Variants - LD (Association) Patterns Not the same in all genes for all populations All SNPs SNPs > 10% MAF AfricanAmerican EuropeanAmerican How do I pick TagSNPs? TagSNPs for any gene - Use GVS http://gvs.gs.washington.edu/GVS/ TagSNPs in any Gene TagSNPs for a gene for typing multiple populations TagSNPs for a gene for typing multiple populations TagSNPs in a pathway of genes Human Association Studies C-Reactive Protein (CRP) • Pentamer belonging to pentraxin family • Acute-phase protein produced by the liver in response to cytokine production (IL-6, IL-1, tumor necrosis factor) • Non-specific response to inflammation, infection, tissue damage Well designed candidate gene studies have provided significant insights and these have been replicated in genome-wide association studies CRP Analysis • CRP is an independent risk factor for CVD • CRP levels are heritable (~40% in FHS) • Several reported SNPs alter CRP levels tagSNP selection for CRP 6 “cosmopolitan” tagSNPs 1 rare synonymous SNP Synonymous SNP “Promoter” SNPs (2667) Intron SNP (790, 1440) (1919) 3’ UTR SNP (3006) Downstream SNPs (3872, 5237) Association between CRP SNPs and Serum CRP Levels CARDIA - Carlson et al Am J Hum Genet 77: 64-77, 2005 NHANES- Crawford et al Circulation 114: 2458-65, 2006 CHS - Lange et al JAMA 296: 2703-11, 2006 Framingham - Larson et al Circulation 113: 1415-23, 2006 Other - Szalai et al J. Mol Med 83: 440-7, 2005 High CRP Associated with SNPs in USF1 Binding Site • USF1 (Upstream Stimulating Factor) – Polymorphism at 1421 alters another USF1 binding site 1420 1430 1440 H1-4 gcagctacCACGTGcacccagatggcCACTCGtt SNP Alters Expression In Vitro H7-8 gcagctacCACGTGcacccagatggcCACTAGtt Altered Gel Shift in Vitro H5 gcagctacCACGTGcacccagatggcCACTTGtt H6 gcagctacCACATGcacccagatggcCACTTGtt Genome-wide studies lead to regional and candidate genes studies Genome-Wide Association Studies Genome-Wide Platforms Affymetrix Illumina Random SNPs TagSNPs 100,000 or 500,000 Quasi-Random SNPs 100,000, 317,000, 550,000, 650,000Y SNPs 1 Million Products are here! Genome-wide Tour de force Nature 447: 661-678 Read all the supplemental materials too! Applying HapMap - Will it work? YES!! Hits: Macular Degeneration, Obesity, Cardiac Repolarization, Inflammatory Bowel Disease, Diabetes T1 and T2, Coronary Artery Disease, Rheumatoid Arthritis, Breast Cancer, Colon Cancer ….. -There are misses as well unclear why - Phenotype, Coverage, Environmental Contexts? Example of a miss - Hypertension -There are lots more hits in these data sets - sample size, low proxy coverage with other SNPs ….. -Analysis of associations between phenotype(s) and even individual sites is daunting and this will just be the first stage, and this does even consider multi-site interactions How robust are the new genomewide platforms? How well do they capture common SNPs? LD-based coverage of Sequence Variation MAF > 0.05 Bhangale et al, unpublished How can I get more information about a reference SNP (rs) identified from an association study? Searching for Genomic Information with an RS number http://gvs.gs.washington.edu/GVS/ Structural Variation Structural Variation - Large Insertion-Deletion Events Structural Variants Identified in the HapMap • Conrad, et al. (Nature Genetics 38:75-81, 2006) • Hinds, et al. (Nature Genetics 38:82-85, 2006) • McCarroll, et al. (Nature Genetics 38:86-92, 2006) ~ 1,500 indels Lots more of them - this was only a start New Variation to Consider - Structural Variation Types of Structural Variants Insertions/Deletions Inversions Duplications Translocations Size: Large-scale (>100 kb) intermediate-scale (500 bp–100 kb) Fine-scale (1–500 bp) More than 10% of the genome sequence Nature 447: 161-165, 2007 NA 18 61 2H CB HC B 863 2 NA 18 63 5H CB NA1 NA18 572H CB NA18592HCB NA18555HCB NA06991CEU CEU 7CEU NA1270 0 086 NA 108 NA1 •Genomes have dense SNP maps (HapMap) A1 18 31 C E NA 12 75 1C EU EU 1C 76 12 NA CEPH N EU 6C 15 12 NA A Human Genome Structural Variation Project Goal: Complete characterization of normal pattern of structural variation in 62 human genomes N U 59C EU NA10 838C EU NA18966JPT NA07034CEU NA18981JPT NA18 561HC NA 186 75CEU EU 65C 128 NA EU 5C 05 7 0 EU NA C 22H C NA128 N Japanese & Chinese JPT JPT 9000 NA1 78 28 A1 B B NA18515YRI NA18 516Y RI NA1 9153 YRI NA 19 15 4Y R NA I 18 50 N 1Y A1 RI 85 02 YR I RI 2Y 17 RI 19 7Y 50 NA I YRI NA19143YRI RI 18 05Y NA 185 NA YRI 9204 NA1 RI NA19132Y NA19240 YR I NA1 9129 YR NA 190 93 Nature 447:161-165, 2007 PT 8J 94 18 A N NA 11 83 0C NA 12 EU 87 2C E NA U 118 NA10854 40CE CEU U •62 additional human genome projects underway PT 75 189 NA NA11994CEU •Select most genetically diverse individuals J 92 89 A1 Yoruba Sequence-Based Resolution of Structural Variation Human Genomic DNA Genomic Library (1 million clones) Sequence ends of genomic inserts & Map to human genome Concordant Inversions Deletion Insertion Fosmid > < > < > < < Build35 Dataset: 1,122,408 fosmid pairs preprocessed (15.5X genome coverage) 639,204 fosmid pairs BEST pairs (8.8 X genome coverage) < Kidd, Cooper, and Eichler - unpublished Detection of Indels in Genotype Data X-linked SNP Unknown indel Carlson et al, Hum. Mol. Genet. 15: 1931-1937, 2006 Searching for Genomic Information with an RS number http://gvs.gs.washington.edu/GVS/ DNA Sequencing the ultimate genotyping platform? Rare Variant Versus Common Variant Both could play a role Rare Variant - Sequence Individuals Common Variants - Genotype a Smaller Set of Variants to Explore Correlations Individuals Sequencing Known Candidate Genes for Functional Variation From Individuals at the Tails of the Trait Distribution Low HDL High HDL High Density Lipoprotein (HDL) ABCA1 and HDL-C –Cohen et al, Science 305, 869-872, 2004 Many examples emerging Common Disease Rare Variants • Observed excess of rare, nonsynonymous variants in low HDL-C samples at ABCA1 • Demonstrated functional relevance in cell culture Personalized Human Genome Sequencing Solexa - an example New Technologies 1 Gigabyte of Sequence Problem is to Target - Genes or Regions Short reads - 30-35bp - quality? Variation discovery needs ~ 20-fold coverage Needs to be fairly uniform Provide 30-50 Mb of baseline Human Genome Variation - Summary • SeattleSNPs and HapMap - Common variation sources - SeattleSNPs offers insights into coverage • New Genotyping Platforms - Very successful but more coverage will be coming • Many genome associations are being identified regions • Other variants of interest emerging structural variation • Paradigm Shift in Sequencing Technology Acknowledgements UW Stanford FHCRC CARDIA • David Siscovick • Dale Williams • Beth Lewis • Kiang Liu • Carlos Irribaren • Myriam Fornage • Cashell Jaquish • Eric Boerwinkle • • • • • Mark Rieder Alex Reiner Greg Cooper Peggy Robertson Tushar Bhangale • Chris Carlson Vanderbilt • Dana Crawford • Shelley Force-Aldred • Rick Myers NHLBI - SeattleSNPs
© Copyright 2026 Paperzz