Resources at HapMap.Org

Resources at HapMap.Org
HapMap Phase II Dataset
Release #21a, January 2007 (NCBI build 35)
3.8 M genotyped SNPs => 1 SNP/700 bp
# polymorphic
SNPs/kb in
consensus dataset
International HapMap Consortium
(2007). Nature 449:851-861
Goals of this segment
• Briefly summarize HapMap design and
current status
• Discuss the application of HapMap
HapMap Project
A freely-available public resource
to increase the power and efficiency
of genetic association studies to medical traits
High-density SNP genotyping across the
genome provides information about
– SNP validation, frequency, assay conditions
– correlation structure of alleles in the genome
All data is freely available on the web for application
in study design and analyses as researchers see fit
HapMap Samples
• 90 Yoruba individuals (30 parent-parent-offspring
trios) from Ibadan, Nigeria (YRI)
• 90 individuals (30 trios) of European descent from
Utah (CEU)
• 45 Han Chinese individuals from Beijing (CHB)
• 45 Japanese individuals from Tokyo (JPT)
Will HapMap apply to other
population samples?
CEU
CEU
CEU
Utah
Utah residents
residents with
with
European
European ancestry
ancestry
(CEPH)
(CEPH)
Whites
Whitesfrom
from
Los
Angeles,
Los Angeles,CA
CA
Botnia,
Botnia,Finland
Finland
Population differences add very little inefficiency
From Paul de Bakker
HapMap progress
PHASE I – completed, described in Nature paper
* 1,000,000 SNPs successfully typed in all 270
HapMap samples
* ENCODE variation reference resource available
PHASE II –complete, data released in 2007 , described
in Nature paper
* >3,500,000 SNPs typed in total !!!
PHASE II –complete, data released April 2009
ENCODE-HAPMAP variation project
• Ten “typical” 500kb regions
• 48 samples sequenced
• All discovered SNPs (and any others in dbSNP) typed in
all 270 HapMap samples
• Current data set – 1 SNP every 279 bp
A much more complete variation resource by which
the genome-wide map can evaluated
Completeness of dbSNP
Vast majority of common SNPs are contained in or highly
correlated with a SNP in dbSNP
Recombination hotspots are widespread
and account for LD structure
7q21
Utility of LD in association study
• “If I’m a causal variant, what is relevant
to my detection in association studies is
how well correlated I am with one of the
SNPs or haplotypes examined in the
study.”
Coverage of Phase II HapMap
(estimated from ENCODE data)
Panel
%r2 > 0.8
YRI
81
CEU
94
CHB+JPT
94
From Table 6 –
“A Haplotype Map of the Human Genome”, Nature
max r2
0.90
0.97
0.97
Coverage of Phase II HapMap
(estimated from ENCODE data)
Panel
%r2 > 0.8
YRI
81
CEU
94
CHB+JPT
94
max r2
0.90
0.97
0.97
Percentage of deeply ascertained common variants
highly correlated with a HapMap SNP
From Table 6 –
“A Haplotype Map of the Human Genome”, Nature
Coverage of Phase II HapMap
(estimated from ENCODE data)
Panel
%r2 > 0.8
YRI
81
CEU
94
CHB+JPT
94
max r2
0.90
0.97
0.97
Average maximum correlation between a deeply
ascertained variant and a neighboring HapMap SNP
From Table 6 –
“A Haplotype Map of the Human Genome”, Nature
Coverage of Phase II HapMap
(estimated from ENCODE data)
Panel
%r2 > 0.8
YRI
81%
CEU
94%
CHB+JPT
94%
max r2
0.90
0.97
0.97
Vast majority of common variation (MAF > .05)
captured by Phase II HapMap
HapMap Project
Phase 1
Phase 2
Phase 3
Samples & POP
panels
269 samples
(4 panels)
270 samples
(4 panels)
1,115 samples
(11 panels)
Genotyping
centers
HapMap
International
Consortium
Perlegen
Broad & Sanger
Unique QC+
SNPs
1.1 M
3.8 M
(phase I+II)
1.6 M (Affy 6.0 &
Illumina 1M)
Reference
Nature (2005)
437:p1299
Nature (2007)
449:p851
Draft Rel. 1
(May 2008)
Phase 3 Samples
label
ASW*
CEU*
CHB
CHD
GIH
JPT
LWK
MEX*
MKK*
TSI
YRI*
population sample
African ancestry in Southwest USA
Utah residents with Northern and Western
European ancestry from the CEPH collection
Han Chinese in Beijing, China
Chinese in Metropolitan Denver, Colorado
Gujarati Indians in Houston, Texas
Japanese in Tokyo, Japan
Luhya in Webuye, Kenya
Mexican ancestry in Los Angeles, California
Maasai in Kinyawa, Kenya
Toscans in Italy
Yoruba in Ibadan, Nigeria
* Population is made of family trios
# samples
90
QC+ Draft 1
71
180
162
90
100
100
91
100
90
180
100
180
1,301
82
70
83
82
83
71
171
77
163
1,115
Phase 3
• 11 panels & 1,115 samples
– 558/557 males/females
– 924/191 founders/non-founders
• Platforms:
– Illumina Human 1M (Sanger)
– Affymetrix SNP 6.0 (Broad)
• EXCLUDED from QC+ data set:
– Samples with low completeness, and SNPs with low call rate in
each pop (< 80%) and not in HWE (p < 0.001)
– Overall false positive rate: ~3.2%
• Data merged with PLINK (concordance over
249,889 overlapping SNPs = 0.9931)
• Alleles on the (+/fwd) strand of NCBI b36
Goals of This Tutorial
This tutorial will show you how to:
•
Find HapMap SNPs near a gene or region of
interest (ROI)
–
–
–
–
–
•
•
View patterns of LD in the ROI
Select tag SNPs in the ROI
Download information on the SNPs in ROI for use in
Haploview
Add custom tracks of association data
Create publication-quality images
Generate customized extracts of the entire data
set
Download the entire data set in bulk
Finding HapMap SNPs in a
Region of Interest
•
•
•
•
•
•
Find the TCF7L2 gene
Identify the characterized SNPs in the region
View the patterns of LD (NCBI b35)
Pick tag SNPs (NCBI b35)
Download the region in Haploview format
Upload your own annotations & superimpose on the
HapMap
• Make a customized image for publication
• View GWA hits & OMIM annotations in the region
(NCBI b36)
HapMap Glossary
• LD (linkage disequilibrium): For a pair of SNP alleles,
it’s a measure of deviation from random association
(which assumes no recombination). Measured by D’,
r2, LOD
• Phased haplotypes: Estimated distribution of SNP
alleles. Alleles transmitted from Mom are in same
chromosome haplotype, while Dad’s form the paternal
haplotype.
• Tag SNPs: Minimum SNP set to identify a
haplotype. r2= 1 indicates SNPs are redundant, so
either one “tags” the other.
• Questions?
[email protected]
1: Surf to the HapMap Browser
1a. Go to
www.hapmap.org
1b. Select
“HapMap Genome
Browser B35”
ncbi B35: full dataset
(includes LD patterns)
ncbi B36: latest, new
tracks (e.g., GWA hits)
2: Search for TCF7L2
2. Type search
term – “TCF7L2”
Search for a gene
name, a chromosome
band, or a phrase like
“insulin receptor”
3: Examine Region
Chromosome-wide
summary data is shown
in overview
Default tracks show
HapMap genotyped SNPs,
refGenes with exon/intron
splicing patterns, etc.
3: This exonic region has
many typed SNPs. Click
on ruler to re-center
image.
Region view puts
your ROI in
genomic context
3: Examine Region (cont)
Use the
Scroll/Zoom
buttons and menu
to change position &
magnification
As you zoom in
further, the display
changes to include
more detail
3: Examine Region (cont) Phase III
Use the
Scroll/Zoom
buttons and menu
to change position &
magnification
3: Mouse over a SNP to
see allele frequency table
As you zoom in
Click to
godisplay
to SNP
further,
the
details
page
changes
to include
more detail
4: Turn on LD & Haplotype Tracks
4a: Scroll down to the
“Tracks” section. Turn on
the LD Plot and Haplotype
Display tracks.
4b: Press
“Update Image”
These sections allow
you to adjust the
display and to
superimpose your own
data on the HapMap
5: View variation patterns
Triangle plot shows LD
values using r2 or D’/LOD
scores in one or more
HapMap population
Phased haplotype
track shows all 120
chromosomes with
alleles colored yellow
and blue
7: Adjust Track Settings (on the spot)
7a. Click on question
mark preceding
track name
7b. Adjust population
and display settings &
press “Configure”
7: Adjust Track Settings (cont)
Select the analysis
track to adjust and
press “Configure”
8: Turn on Tag SNP Track
8: Activate the “tag
SNP Picker” and press
“Update Image”
9: Adjust tag SNP picker
Tag SNPs are selected
on the fly as you navigate
around the genome
9a: Click on question mark
behind “tag SNP Picker”
Alternatively, you may
select “Annotate tag
SNP Picker” and press
“Configure…”
9: Adjust tag SNP picker (cont)
Select population
Select tagging
algorithm and
parameters
9b: Press “Configure” to
save changes
[optional] upload list
of SNPs to be
included, excluded, or
design scores
10: Generate Reports
10: Select the desired
“Download” option and
press “Go” or “Configure”
Available Downloads:
• Individual Genotypes
• Population Allele & Genotype
frequencies
• Pairwise LD values
•Tag SNPs
10: Generate Reports (cont)
The Genotype download
format can be saved to
disk or loaded directly
into Haploview
10: Generate Reports (cont)
The tag SNP download
is the same as you get
from TAGGER
…
11: Create your own tracks
Example:
• Interested in T2DM genetics
• Create file with custom annotations from
http://www.broad.mit.edu/diabetes and
superimpose on the HapMap
11: Upload example file:
TCF7L2_annotations.txt
Detailed help on the
format is under the
“Help” link
11: Create your own tracks (cont)
Formatted data for
the T2DM association
results (score is
-LOG10 of p-value)
Some SNPs were typed
(known platform) and
others were imputed.
Format data for both
typed & imputed SNPs.
Save as a text file!
11: Create your own tracks (cont)
11: Create your own tracks (cont)
Make edits on
your own browser
window by clicking
on “Edit File…”
11: Create your own tracks (cont)
12: Create Image for Publication
Click on the
+/- sign to
hide/show a
section
12a. Click on “Highres Image”
Mouse over a track
until a cross appears.
Click on track name to
drag track up or down.
12: Image for Publication (cont)
12b. Click on “View SVG Image in
new browser window”
12c. Save generate file
with “.svg” extensions
Can view file in Firefox, but
use other programs (Adobe
Illustrator or Inkscape) to
convert to other formats
and/or edit
12: Image for Publication (cont)
Inskape is free and
lets you edit and
convert to other
formats (many
journals prefer EPS)
13: View GWA hits
13a. Go to
www.hapmap.org
13b. Select
“HapMap Genome
Browser B36”
13: View GWA hits (cont)
13c. Type search
term - “FTO”
Default tracks for B36
include GWA hits, OMIM
predicted associations, and
Reactome pathways
14: Read PubMed abstracts for GWA
hits
14a: Mouse over a GWA hit
to learn more about the
association
14b: Click on the GWA hit
to see the study’s PubMed
abstract
Use HapMart to Generate Extracts of
the HapMap Dataset
Find all HapMap characterized SNPs
that:
1. Have a MAF > 0.20 in the Yoruban
population panel (YRI)
2. Cause a nonsynonymous amino acid
change
1. Go to hapmart.hapmap.org
1. From www.hapmap.org
click on “HapMart”
2. Select data source and population
of interest
2b. Press “Next”
Use schema menu
to select dataset
2a. Choose Yoruba population
or “All Populations”
3. Select the desired filters
3c. Press “Next”
3a. Check “Allele
Frequency Filter” and
select MAF >= 0.2
3b. Select “SNPs found in
Exons – non synonymous
coding SNPs”
4. Select output fields
4c. Press “Export”
4a. Choose among several
pages of fields
4b. Select the fields to
include in the report.
The summary
shows active
filters and # SNPs
to be output
Options at the
bottom let you
select text or
Excel format
5. Download report
Bulk downloads:
Download the Complete Data
• Download the entire HapMap data set
to your own computer
1. Surf to www.hapmap.org
Or directly
click on “Data”
1. From www.hapmap.org,
click on “Bulk Data
Download”
2. Choose the Data Type
2. Select
“Genotypes”
Raw genotypes
& frequencies
Analytic results
Protocols &
assay design
Your own copy of
the HapMap
Browser
* Data also available via FTP
ftp://www.hapmap.org
HapMap
Samples
3. Choose the dataset of interest
3. Select latest build,
fwd_strand orientation,
and “non-redundant”
fwd_strand => same as NCBI
reference assembly
rs_strand => same as in dbSNP
Available Genotype Datasets:
• Non-redundant: QC+ filtered &
redundant data removed
• Filtered-redundant: QC+ filtered;
duplicated data not removed
• Unfiltered-redundant: Includes
assays that failed QC
Applying the HapMap
• Study design - tagging
• Study coverage evaluation
• Study analysis - improving association
testing
• Study interpretation
– Comparison of multiple studies
– Connection to genes/genomic features
– Integration with expression and other functional
data
• Other uses of HapMap data
– Admixture, LOH, selection
Tagging from HapMap
• Since HapMap describes the majority of
common variation in the genome,
choosing non-redundant sets of SNPs
from HapMap offers considerable
efficiency without power loss in
association studies
Pairwise tagging
A/T
1
A
A
T
T
G/A
2
G
G
A
A
high r2
G/C
3
G
C
G
C
T/C
4
T
C
C
C
high r2
G/C
5
A/C
6
A
C
C
C
G
C
G
C
high r2
After Carlson et al. (2004) AJHG 74:106
Tags:
SNP 1
SNP 3
SNP 6
3 in total
Test for association:
SNP 1
SNP 3
SNP 6
Pairwise Tagging Efficiency
Table 7 Number of selected tag SNPs to capture all
observed common SNPs in the Phase I HapMap for the
three analysis panels using pairwise tagging at different
r2 thresholds
Pairwise
YRI
CEU
CHB+JPT
r2 ≥ 0.5
324,865
178,501
159,029
r2 ≥ 0.8
474,409
293,835
259,779
r2 = 1
604,886
447,579
434,476
Tag SNPs were picked to capture common SNPs in release 16c.1 for every
7,000 SNP bin using Haploview.
Tagging Phase I HapMap offers 2-5x gains in efficiency
Use of haplotypes can
improve genotyping efficiency
A/T
1
A
A
T
T
G/A
2
G
G
A
A
G/C
3
G
C
G
C
T/C
4
T
C
C
C
G/C
5
G
C
G
C
tags in multi-marker test should
be conditional on significance
of LD in order to avoid
overfitting
A/C
6
A
C
C
C
Tags:
SNP 1
SNP 3
SNP 6
2 in total
3 in total
Test for association:
SNP 1 SNP
captures
1 1+2
SNP 3 SNP
captures
3 3+5
“AG” haplotype
SNP
SNPcaptures
6
4+6
Relative power (%)
Efficiency and power
tag SNPs
random
SNPs
~300,000 tag SNPs
needed to cover common
variation in whole genome
in CEU
Average marker density (per kb)
P.I.W. de Bakker et al. (2005) Nat Genet Advance Online Publication 23 Oct 2005
How to pick tag SNPs?
• What is the genetic hypothesis? Which variants do
you want to test for a role in disease?
– functional annotation (coding SNPs)
– allele frequency (HapMap ascertainment)
– previously implicated associations
• Go to http://www.hapmap.org – DCC supported
interactive tagging
• Export HapMap data into tools such as Tagger,
Haploview (www.broad.mit.edu/mpg)
Will tag SNPs picked from
HapMap apply to other
population samples?
CEU
CEU
CEU
Utah
Utah residents
residents with
with
European
European ancestry
ancestry
(CEPH)
(CEPH)
Whites
Whitesfrom
from
Los
Angeles,
Los Angeles,CA
CA
Botnia,
Botnia,Finland
Finland
Population differences add very little inefficiency
Platform presentation: Paul de Bakker (#223: Sat 9.30)
Applying the HapMap
• Study design - tagging
• Study coverage evaluation
• Study analysis - improving association
testing
• Study interpretation
– Comparison of multiple studies
– Connection to genes/genomic features
– Integration with expression and other functional
data
• Other uses of HapMap data
– Admixture, LOH, selection
Genome-wide association coverage
• If genome-wide products are typed on the
HapMap sample panel, the SNPs on
HapMap not included in the panel provide
an evaluation for the coverage of the
product
– ENCODE (deep ascertainment)
– Phase II (dense, genome-wide)
Further Information
• HapMap Publications & Guidelines
http://hapmap.cshl.org/publications.html.en
• Past tutorials & user’s guide to HapMap.org
http://www.hapmap.org/tutorials.html.en
• Questions?
[email protected]