slides

Genotyping Technology 02-­‐223 How to Analyze Your Own Genome Fall 2013 HapMap Project Phase 1
Phase 2
Phase 3
Samples & POP
panels
269 samples
(4 populations)
270 samples
(4 populations)
1,115 samples
(11 populations)
Genotyping
centers
HapMap
International
Consortium
Perlegen
Broad & Sanger
Unique QC+
SNPs
1.1 M
3.8 M
(phase I+II)
1.6 M (Affy 6.0
& Illumina 1M)
Reference
Nature (2005)
437:p1299
Nature (2007)
449:p851
Draft Rel. 1
(May 2008)
Phase 3 Samples label
ASW*
CEU*
CHB
CHD
GIH
JPT
LWK
MEX*
MKK*
TSI
YRI*
population sample
African ancestry in Southwest USA
Utah residents with Northern and Western
European ancestry from the CEPH collection
Han Chinese in Beijing, China
Chinese in Metropolitan Denver, Colorado
Gujarati Indians in Houston, Texas
Japanese in Tokyo, Japan
Luhya in Webuye, Kenya
Mexican ancestry in Los Angeles, California
Maasai in Kinyawa, Kenya
Toscans in Italy
Yoruba in Ibadan, Nigeria
* Population is made of family trios
# samples
90
QC+ Draft 1
71
180
162
90
100
100
91
100
90
180
100
180
1,301
82
70
83
82
83
71
171
77
163
1,115
HapMap Browser 1a. Go to
www.hapmap.org
1b. Select
“HapMap phase 3”
Overview Genotyping technology -­‐ SNPs and copy number variaEons Processing the data from genotyping assays -­‐  Linkage disequilibrium -­‐  Haplotype inference, phasing -­‐  Tag SNP selecEon Preliminary analysis of HapMap data SNP Genotyping with SNP Array •  SNP arrays make use of the biochemical principle that nucleoEde bases bind to their complementary partners (A binds to T, C binds to G) –  An array of oligonucleoEde sequences is laid across the surface of the chip. –  The sample’s DNA is amplified, and hybridized to the array. –  The array is scanned to quanEfy the relaEve amount of sample bound to each feature. •  For SNPs, there is a pair of probes: one for each of the alleles. •  Widely used SNP array technology –  Affymetrix vs. Illumina SNP arrays Affymetrix GeneChip Probe Array SNP Array Technology: Affymetrix Array •  The fragment of DNA harboring an A/C SNP to be interrogated by the probes •  25-­‐mer probes for both alleles •  The locaEon of the SNP locus varies from probe to probe •  The DNA binds to both probes regardless of the allele it carries, but it does so more efficiently when it is complementary to all 25 bases (bright yellow) rather than mismatching the SNP site (dimmer yellow). •  This impeded binding manifests itself in a dimmer signal. 25-­‐mer (25 nucleoEdes) SNP Array Technology: Illumina Array •  The fragment of DNA harboring an A/C SNP to be interrogated by the probes •  A\ached to each Illumina bead is a 50-­‐mer sequence complementary to the sequence adjacent to the SNP site. •  The single-­‐base extension (T or G) that is complementary to the allele carried by the DNA (A or C, respecEvely) then binds and results in the appropriately-­‐colored signal (red or green, respecEvely) Calling Genotypes •  The raw signal intensiEes from the SNP array can be noisy •  How to cope with the noise –  Pool the raw signal intensiEes from mulEple individuals for each SNP and perform a cluster analysis –  Three clusters for each of the three possible genotypes (AA, AB, BB) Each dot represents the raw signal intensity for a SNP for each individual CNV Genotyping with Array CGH •  Genomic DNA from two cell populaEons is differenEally labeled (red and green) and hybridized to a microarray Copy numbers are mostly the same across the chromosome between Test and Ref samples ReducEon by a factor of two in copy numbers Log2 (red/green) =log2(red)-­‐log2(green) Overview Genotyping technology -­‐ SNPs and copy number variaEons Processing the data from genotyping assays -­‐  Linkage disequilibrium -­‐  Haplotype inference, phasing -­‐  Tag SNP selecEon Preliminary analysis of HapMap data Linkage Disequilibrium (LD) •  LD reflects the relaEonship between alleles at different loci. •  Ocen, r2 (correlaEon coefficient) is used as a measure of LD. Locus A Locus B Basic Concepts Parent 2
Parent 1
A ""
a ""
"B
"b
A
B
a
b
A
B
a
b
"
X
A ""
a ""
OR
a
b
A
B
A
B
a
b
High LD -> No Recombination
(r2 = 1) SNP1 “tags” SNP2
"B
"b
"
A
b
A
B
a
B
A
B
a
B
A
B
A
b
A
b
etc…
Low LD -> Recombination
Many possibilities
How to Compute r2 SNP2 SNP3 1 0 1 1 1 0 1 1 0 0 0 1 0 0 0 1 r2=1.0 r2=0.0 R2=0.0 r2 matrix SNP1 SNP2 SNP3 SNP1 SNP2 SNP3 Individuals SNP1 1
1
1
1 
0
0
0
0
1.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 Linkage Disequilibrium in SNP Data •  r2 in SNP data from a populaEon of individuals (Black: r2=1, white: r2=0) PopulaEon 2 PopulaEon 2 genome genome PopulaEon 1 PopulaEon 1 Summary •  SNP/CNV genotyping technology and genotype-­‐calling methods •  Linkage disequilibrium in the neighboring loci are due to non-­‐
random recombinaEon sites across the genome •  The level of linkage disequilibrium can be quanEfied by r2