rare and common not captured by the HapMap

SNP Resources and Applications
SeattleSNPs PGA
Debbie Nickerson
Department of Genome Sciences
[email protected]
http://pga.gs.washington.edu
Strategies for Genetic Analysis
Populations
Association Studies
Families
Linkage Studies
C/C
C/C
C/T
C/C
C/T
C/T
C/C
C/C
40% T, 60% C
C/T
C/C C/T
C/T
C/C C/C
C as es
15% T, 85% C
C o ntro ls
Simple Inheritance
Complex Inheritance
Single Gene
Multiple Genes
Rare Variants
Common Variants
~1,000 Short Tandem Repeat Markers
and now 3,000 SNPs
Polymorphic Markers > 500,000 -1,000,000
Single Nucleotide Polymorphisms (SNPs)
Complex inheritance/disease
Many Other
Genes
Variant
Gene
Environment
Disease
Diabetes
Obesity
Cancer
Heart Disease
Multiple Sclerosis
Asthma
Schizophrenia
Celiac Disease
Autism
Two hypotheses:
1- common disease/common variants
2- common disease/many rare variants
Genetic Strategy - New Insights
STRONG
LINKAGE
effect
size
ASSOCIATION
Genome-wide Sequencing
WEAK
LOW
allele frequency
HIGH
Ardlie, Kruglyak & Seielstad (2002) Nat. Genet. Rev. 3: 299-309
Zondervan & Cardon (2004) Nat. Genet. Rev. 5: 89-100
Finding SNPs Strategies
Total sequence variation in humans
Population size:
6x109 (diploid)
Mutation rate:
2x10–8 per bp per generation
Expected “hits”:
240 for each bp
- Every variant compatible with life exists in the population
BUT most are vanishingly rare in the population!
Compare 2 haploid genomes: 1 SNP per 1331 bp*
*The International SNP Map Working Group, Nature 409:928 - 933 (2001)
SNP Discovery: HapMap and others
Generate more SNPs:
Random Shotgun Sequencing
Genomic DNA
(multiple individuals)
Sources of SNPs:
Perlegen SNP data
Sequence chromatograms from Celera project
HapMap Random Shotgun
Sequence and align
(reference sequence)
TACGCCTATA
TCAAGGAGAT
GTTACGCCAATACAGGATCCAGGAGATTACC
Draft Human Genome
dbSNP 127 - 11.8 Million SNPs and 5.7 Million SNPs Validated
Finding SNPs: Sequence-based SNP Mining
Genomic
mRNA
RT errors
cDNA
Library
Sequencing
Quality
EST
Overlap
BAC
Library
RRS
Library
BAC
Overlap
Shotgun
Overlap
DNA
SEQUENCING
Sequence Overlap - SNP Discovery
G
GTTACGCCAATACAGGATCCAGGAGATTACC
GTTACGCCAATACAGCATCCAGGAGATTACC
C Validated SNPs - two independent discoveries
SNP discovery is dependent on your sample population size
Fraction of SNPs Discovered
2 chromosomes
{
GTTACGCCAATACAGGATCCAGGAGATTACC
GTTACGCCAATACAGCATCCAGGAGATTACC
1.0
88
0.5
2
0.0
0.0
0.1
0.2
0.3
Minor Allele Frequency (MAF)
0.4
0.5
Candidate Gene Resource
SNP Discovery in SeattleSNPs
Complete analysis: cSNPs, Linkage Disequilbrium and Haplotype Data
5’
Arg-Cys
Val-Val
3’
PCR amplicons
•Generate SNP data from complete genomic resequencing
(i.e., 5’ regulatory, exon, intron, 3’ regulatory sequence)
Increasing Sample Size Improves SNP Discovery
{
Fraction of SNPs Discovered
2 chromosomes
GTTACGCCAATACAGGATCCAGGAGATTACC
GTTACGCCAATACAGCATCCAGGAGATTACC
SeattleSNPs
1.0
96
48
24
16
HapMap
Based on
~ 6
chromosomes
88
0.5
2
0.0
0.0
0.1
0.2
0.3
Minor Allele Frequency (MAF)
0.4
0.5
SNPs in the Average Gene
Average Gene Size - 25 kb ~ Compare 2 haploid - 1 in 1,000 bp
~150 SNPs (200 bp) - 15,000,000 SNPs
~ 50 SNPs > 0.05 MAF (600 bp) - 6,000,000 SNPs (33-40%)
~ 5 coding SNPs (half change the amino acid sequence)
Crawford et al Ann Rev Genomics Hum Genet 2005;6:287-312
SeattleSNPs panel
HapMap Integration (~4 million SNPs)
High Density Genic Coverage (SeattleSNPs)
Low Density Genome Coverage (HapMap)
= SeattleSNPs discovery (1/188 bp)
= HapMap SNPs (~1/1000 bp)
Sequence Variation and the HapMap
Summary: The Current State of SNP Resources
 Random SNP discovery generates many SNPs (HapMap)
 Random approaches to SNP discovery have reached limits of
discovery and validation (~ 50% of the common SNPs)
 Resequencing approaches continue to catalog important variants
(rare and common not captured by the HapMap)
 SeattleSNPs has generated SNP data across >300 key
candidate genes
NHLBI - Candidate Genes and Medical Resequencing
http://rsng.nhlbi.nih.gov/scripts/index.cfm
Typing SNPs:
Approaches
HapMap Project: Genotype validated SNPs in the dbSNP
Genotype SNPs in Four populations: Initially 1 Million -> Now 4 Million
•
•
•
•
CEPH (CEU) (Europe - n = 90, trios)
Yoruban (YRI) (Africa - n = 90, trios)
Japanese (JPT) (Asian - n = 45)
Chinese (HCB) (Asian - n =45)
To produce a genome-wide map of common variation
Genotyping Adds Value to SNPs
HapMap Genotyping
• Confirms a SNP as “real” and “informative”
• Determines Minor Allele Frequency (MAF) - common or rare
• Determines MAF in different populations
• Detection of SNP correlations (Linkage Disequilibrium and Haplotypes)
Genotype correlations among
SNPs
decreases the number of SNPs
that need to be genotyped
An Example of SNP Correlation in the Human IL1A Gene
IL1A in Europeans
• 18.5 kb
• 50 SNPs
• 46 common
SNPs
(> 10%MAF)
Carlson et al. (2004)
Am J Hum Genet. 74: 106-120.
Homozygote common
Heterozygote
Homozygote alternative allele
Missing Data
46 Common SNPs reduces to 3 SNPs Select one SNP per bin using LDSelect
• Threshold LD: r2
– Bin 1: 22 sites
– Bin 2: 18 sites
– Bin 3: 5 sites
• Genotype 1 SNP from
each bin
• TagSNP, chosen for
biological intuition or
ease of assay design
Common Variants - LD (Association) Patterns Not the same in all genes for all populations
All SNPs
SNPs > 10% MAF
AfricanAmerican
EuropeanAmerican
How do I pick TagSNPs?
TagSNPs for any gene - Use GVS
http://gvs.gs.washington.edu/GVS/
TagSNPs in any Gene
TagSNPs for a gene for typing multiple populations
TagSNPs for a gene for typing multiple populations
TagSNPs in a pathway of genes
Human
Association
Studies
C-Reactive Protein (CRP)
• Pentamer belonging to
pentraxin family
• Acute-phase protein produced
by the liver in response to
cytokine production (IL-6, IL-1,
tumor necrosis factor)
• Non-specific response to
inflammation, infection, tissue
damage
Well designed candidate gene studies have provided significant insights and these
have been replicated in genome-wide association studies
CRP Analysis
• CRP is an independent risk factor for CVD
• CRP levels are heritable (~40% in FHS)
• Several reported SNPs alter CRP levels
tagSNP selection for CRP
6 “cosmopolitan” tagSNPs
1 rare synonymous SNP
Synonymous SNP
“Promoter” SNPs
(2667)
Intron SNP
(790, 1440)
(1919)
3’ UTR SNP
(3006)
Downstream SNPs
(3872, 5237)
Association between CRP SNPs and
Serum CRP Levels
CARDIA - Carlson et al Am J Hum Genet 77: 64-77, 2005
NHANES- Crawford et al Circulation 114: 2458-65, 2006
CHS - Lange et al JAMA 296: 2703-11, 2006
Framingham - Larson et al Circulation 113: 1415-23, 2006
Other - Szalai et al J. Mol Med 83: 440-7, 2005
High CRP Associated with SNPs in USF1 Binding Site
•
USF1 (Upstream Stimulating Factor)
– Polymorphism at 1421 alters another USF1 binding site
1420
1430
1440
H1-4 gcagctacCACGTGcacccagatggcCACTCGtt
SNP Alters Expression In Vitro
H7-8 gcagctacCACGTGcacccagatggcCACTAGtt
Altered Gel Shift in Vitro
H5
gcagctacCACGTGcacccagatggcCACTTGtt
H6
gcagctacCACATGcacccagatggcCACTTGtt
Genome-wide studies lead to regional and candidate genes studies
Genome-Wide
Association
Studies
Genome-Wide Platforms
Affymetrix
Illumina
Random SNPs
TagSNPs
100,000 or 500,000 Quasi-Random SNPs
100,000, 317,000, 550,000, 650,000Y SNPs
1 Million Products are here!
Genome-wide Tour de force Nature 447: 661-678
Read all the
supplemental
materials too!
Applying HapMap - Will it work? YES!!
Hits:
Macular Degeneration, Obesity, Cardiac Repolarization,
Inflammatory Bowel Disease, Diabetes T1 and T2, Coronary
Artery Disease, Rheumatoid Arthritis, Breast Cancer,
Colon Cancer …..
-There are misses as well unclear why - Phenotype, Coverage,
Environmental Contexts?
Example of a miss - Hypertension
-There are lots more hits in these data sets - sample size, low
proxy coverage with other SNPs …..
-Analysis of associations between phenotype(s) and even individual
sites is daunting and this will just be the first stage,
and this does even consider multi-site interactions
How robust are
the new genomewide platforms?
How well do they
capture common SNPs?
LD-based coverage of Sequence Variation
MAF > 0.05
Bhangale et al, unpublished
How can I get more information about a reference SNP (rs)
identified from an association study?
Searching for Genomic Information with an RS number
http://gvs.gs.washington.edu/GVS/
Structural
Variation
Structural Variation - Large Insertion-Deletion Events
Structural Variants Identified in the HapMap
• Conrad, et al. (Nature Genetics 38:75-81, 2006)
• Hinds, et al. (Nature Genetics 38:82-85, 2006)
• McCarroll, et al. (Nature Genetics 38:86-92, 2006)
~ 1,500 indels
Lots more of them - this was only a start
New Variation to Consider - Structural Variation
Types of Structural Variants
Insertions/Deletions
Inversions
Duplications
Translocations
Size:
Large-scale (>100 kb)
intermediate-scale (500 bp–100 kb)
Fine-scale (1–500 bp)
More than 10% of
the genome
sequence
Nature 447: 161-165, 2007
NA
18
61
2H
CB
HC
B
863
2
NA
18
63
5H
CB
NA1
NA18
572H
CB
NA18592HCB
NA18555HCB
NA06991CEU
CEU
7CEU
NA1270
0
086
NA
108
NA1
•Genomes have dense SNP maps (HapMap)
A1
18
31
C
E
NA
12
75
1C
EU
EU
1C
76
12
NA
CEPH
N
EU
6C
15
12
NA
A Human Genome Structural Variation Project
Goal: Complete characterization
of normal pattern of structural variation in
62 human genomes
N
U
59C
EU
NA10
838C
EU
NA18966JPT
NA07034CEU
NA18981JPT
NA18
561HC
NA
186
75CEU
EU
65C
128
NA
EU
5C
05
7
0
EU
NA
C
22H
C
NA128
N
Japanese &
Chinese
JPT
JPT
9000
NA1
78
28
A1
B
B
NA18515YRI
NA18
516Y
RI
NA1
9153
YRI
NA
19
15
4Y
R
NA
I
18
50
N
1Y
A1
RI
85
02
YR
I
RI
2Y
17
RI
19
7Y
50
NA
I
YRI
NA19143YRI
RI
18
05Y
NA
185
NA
YRI
9204
NA1
RI
NA19132Y
NA19240
YR
I
NA1
9129
YR
NA
190
93
Nature 447:161-165, 2007
PT
8J
94
18
A
N
NA
11
83
0C
NA
12
EU
87
2C
E
NA
U
118
NA10854 40CE
CEU U
•62 additional human genome projects underway
PT
75
189
NA
NA11994CEU
•Select most genetically diverse individuals
J
92
89
A1
Yoruba
Sequence-Based Resolution of Structural Variation
Human Genomic DNA
Genomic Library (1 million clones)
Sequence ends of genomic inserts &
Map to human genome
Concordant
Inversions
Deletion
Insertion
Fosmid
>
<
>
<
>
<
<
Build35
Dataset: 1,122,408 fosmid pairs preprocessed (15.5X genome coverage)
639,204 fosmid pairs BEST pairs (8.8 X genome coverage)
<
Kidd, Cooper, and Eichler - unpublished
Detection of Indels in Genotype Data
X-linked SNP
Unknown indel
Carlson et al, Hum. Mol. Genet. 15: 1931-1937, 2006
Searching for Genomic Information with an RS number
http://gvs.gs.washington.edu/GVS/
DNA Sequencing
the ultimate
genotyping
platform?
Rare Variant Versus Common Variant
Both could play a role
Rare Variant - Sequence Individuals
Common Variants - Genotype a
Smaller Set of Variants to
Explore Correlations
Individuals
Sequencing Known Candidate Genes for Functional Variation
From Individuals at the Tails of the Trait Distribution
Low HDL
High HDL
High Density Lipoprotein (HDL)
ABCA1 and HDL-C
–Cohen et al, Science
305, 869-872, 2004
Many examples emerging
Common Disease
Rare Variants
• Observed excess of rare, nonsynonymous variants in low HDL-C
samples at ABCA1
• Demonstrated functional relevance in cell culture
Personalized Human Genome Sequencing
Solexa - an example
New Technologies
1 Gigabyte of Sequence
Problem is to Target - Genes or Regions
Short reads - 30-35bp - quality?
Variation discovery needs ~ 20-fold coverage
Needs to be fairly uniform
Provide 30-50 Mb of baseline
Human Genome Variation - Summary
• SeattleSNPs and HapMap - Common variation
sources - SeattleSNPs offers insights into
coverage
• New Genotyping Platforms - Very successful but
more coverage will be coming
• Many genome associations are being identified
regions
• Other variants of interest emerging structural variation
• Paradigm Shift in Sequencing Technology
Acknowledgements
UW
Stanford
FHCRC
CARDIA
• David Siscovick
• Dale Williams
• Beth Lewis
• Kiang Liu
• Carlos Irribaren
• Myriam Fornage
• Cashell Jaquish
• Eric Boerwinkle
•
•
•
•
•
Mark Rieder
Alex Reiner
Greg Cooper
Peggy Robertson
Tushar Bhangale
• Chris Carlson
Vanderbilt
• Dana Crawford
• Shelley Force-Aldred
• Rick Myers
NHLBI - SeattleSNPs