投影片 1

Introduction to SNP resources
and a tool for variation
functional analysis
Chuan-Kun Liu
[email protected]
National Genotyping Center
Academia Sinica
Taiwan
2010/07/26
Training p.2
Outline
•
•
•
•
Basic Concepts
NCBI dbSNP
International HapMap Project
VarioWatch
Training p.3
Outline
•
•
•
•
Basic Concepts
NCBI dbSNP
International HapMap Project
VarioWatch
Training p.4
Basic Concepts
The Human Genome
• 3 X 109 base pairs
• The Human Genome Project is completed
(Telomere, Centromere, Contig order ?)
in 2003
• Approximately 20,000~25,000 genes
(alterative splicing ?)
• Sequence and structural variations are
seen in different genomes
Training p.5
Basic Concepts
• Allele :
Original definition : One of the
different forms of a gene that can
exist at a single locus
Extended definition : A site at
which DNA-either coding or noncoding differs among genomes
Training p.6
Basic Concepts
• Homozygous : Having two identical
alleles at corresponding loci on
homologous chromosomes
• Heterozygous : Having two different
alleles at corresponding loci on
homologous chromosomes
Training p.7
Basic Concepts
• Single Nucleotide Polymorphism (SNP) :
SNPs are DNA sequence variations that
occur when a single nucleotide (A, T, C, or
G) in the genome sequence is changed
SNP1
SNP2
TGTAGTTGTGCAGGCCTGTAGTCCCAG
TGTAGTAGTGCAGGCCTGTCGTCCCAG
TGTAGTAGTGCAGGCCTGTGGTCCCAG
Training p.8
Properties of SNPs
• more than 23 million SNPs have been
found in human genome (dbSNP 131)
• 3 million SNPs differences between any 2
individuals (3 billion * 1/1000)
• minor allele frequency ≧ 1%
• Allele type : biallelic [A/C], triallelic [A/C/G],
N[A/T/C/G], deletion [-/C], insersion [G/-],
and others
Training p.9
Outline
•
•
•
•
Background
NCBI dbSNP
International HapMap Project
VarioWatch
Training p.10
NCBI dbSNP
http://www.ncbi.nlm.nih.gov/projects/SNP
Training p.11
NCBI dbSNP
Human genome
assembly 36
2005/08
dbSNP125
(Build 35.1)
2005/10
2005
2006
dbSNP126
(Build 36.1)
2006/05
GRCh 37
2009/02
dbSNP127
(Build 36.2)
2007/03
2007
dbSNP129
(Build 36.3)
2008/04
2008
dbSNP128
(Build 36.2)
2007/10
dbSNP131
(Build 37.1)
2010/04
2009
dbSNP130
(Build 36.3)
2009/05
2010
Training p.12
NCBI dbSNP 126
Variation Class
Updated
2006/11/15
Number of Variation
Percentage (%)
SNP
9,636,322
80.82
DIP
2,202,926
18.48
MIXED
51,775
0.43
MNP
16,329
0.14
NAMED
9,815
0.08
STR
5,096
0.04
424
0.004
4
0.00003
NO VARIATION
HETEROZYGOUS
Training p.13
NCBI dbSNP 131
Variation Class
Updated
2010/04/01
Number of Variation
Percentage (%)
SNP
18,603,484
78.66
DIP
4,833,619
20.44
111,366
0.47
MNP
N/A
N/A
NAMED
N/A
N/A
STR
5,195
0.02
NO VARIATION
3,332
0.01
N/A
N/A
MIXED
HETEROZYGOUS
Training p.14
SNP Submission
Training p.15
SNP Submission Info
Notable Human SNP Submitters in dbSNP Build 131
Handle
Submitter Name
Institution
1000GENOMES 1000 Genomes Project Data
coordinated by EBI and
NCBI
European Bioinformatics
Institute / National Center
for Biotechnology
Information
COMPLETE_GE Radoje Drmanac
NOMICS
Complete Genomics,
Inc.
ILLUMINA
Cindy Taylor Lawley
Illumina Inc.
ENSEMBL
Ewan Birney
EMBL Outstation Hinxton
GMI
Jeongsun Seo
Seoul National University
Medical Research Center
AFFY
Affymetrix Technical
Support
Affymetrix
Training p.16
How to collect SNP data
1.
2.
3.
4.
dbSNP homepage
NCBI Entrez web query interface
NCBI Entrez Programming Utilities
Localized dbSNP
Training p.17
dbSNP homepage
http://www.ncbi.nlm.nih.gov/projects/SNP/
Training p.18
dbSNP homepage
Training p.19
NCBI Entrez web query interface
http://www.ncbi.nlm.nih.gov/snp/limits
1-based
NOT 0-based
Training p.20
NCBI Entrez web query interface
Training p.21
NCBI Entrez web query interface
Training p.22
NCBI Entrez web query interface
http://www.ncbi.nlm.nih.gov/snp?Db=snp&Cmd=DetailsSearch&
Term=%233+AND+%234+AND+”indchineseyh1”[Filter]
Training p.23
NCBI Entrez web query interface
Training p.24
SNP info
Can not represent the version
of the SNP info
SNP: GeneView
Training p.25
SNP info
Not always the same as the reference
sequence
Training p.26
SNP: GeneView
Training p.27
NCBI Map Viewer
Maps & options -> Sequence Maps -> Variation
Training p.28
NCBI Entrez Programming Utilities
http://eutils.ncbi.nlm.nih.gov
Training p.29
Eutils for EntrezSNP
http://www.ncbi.nlm.nih.gov/projects/SNP/SNPeutils.htm
Training p.30
NCBI Entrez Programming Utilities
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?
db=snp&term=59980974
Training p.31
NCBI Entrez Programming Utilities
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
db=snp&id=59980974&report=DocSet
Training p.32
Create a local copy of dbSNP
ftp://ftp.ncbi.nih.gov/snp/database/
README.create_local_dbSNP.txt
erd_dbSNP.pdf
/organism_data : Data for each organism specific database
/organism_schema : Schema DDL(Data Definition
Language in SQL) for organism specific database
/shared_data : Data in dbSNP_main that is shared by all
organism database
/shared_schema : Schema DDL for dbSNP_main
Training p.33
Practice
• What is the physical position of rs2770 in
dbSNP build 131 with assembly GRCh37?
• Is rs60817388 a valid snp_id?
• How many human SNPs in dbSNP build
131 with triallelic?
• Which genome assembly is provided when
query SNP info through Eutil?
Training p.34
Outline
•
•
•
•
Basic Concepts
NCBI dbSNP
International HapMap Project
VarioWatch
Training p.35
DNA recombination
http://www.nature.com/nrm/journal/v6/n1/fig_tab/nrm1546_F4.html
Training p.36
Haplotype / Tag SNP
LD (linkage disequilibrium) : For a pair of SNP alleles, it’s a measure
of deviation from random association (i.e., no recombination).
Measured by D’, r2, LOD
High LD -> No Recombination
http://hapmap.ncbi.nlm.nih.gov/originhaplotype.html.en
http://hapmap.ncbi.nlm.nih.gov/whatishapmap.html
Training p.37
HapMap Project
Launched on Oct, 2002
Phase 1
Phase 2
Phase 3
Samples & POP
panels
269 samples
(4 panels)
270 samples
(4 panels)
1,397 samples
(11 panels)
Genotyping
centers
HapMap
International
Consortium
Perlegen
Broad & Sanger
Unique QC+
SNPs
1.1 M
3.8 M
(phase I+II)
1.6 M (Affy 6.0
& Illumina 1M)
Reference
Nature (2005)
437:p1299
Nature (2007)
449:p851
Draft Rel. 3
(May 2010)
Training p.38
Phase 3 Samples
Label
Population Sample
ASW (A) African ancestry in Southwest USA
Samples QC+ Draft 3
98
87
CEU (C) Utah residents with Northern and Western
180
165
CHB (H) Han Chinese in Beijing, China
162
137
CHD (D) Chinese in Metropolitan Denver, Colorado
129
109
European ancestry from the CEPH collection
GIH (G)
Gujarati Indians in Houston, Texas
117
101
JPT (J)
Japanese in Tokyo, Japan
131
113
LWK (L) Luhya in Webuye, Kenya
122
110
MEX (M) Mexican ancestry in Los Angeles, California
104
86
MKK (K) Maasai in Kinyawa, Kenya
205
184
TSI (T)
Tuscan in Italy
114
102
YRI (Y)
Yoruban in Ibadan, Nigeria (West Africa)
220
203
1,582
1,397
http://ccr.coriell.org/Sections/Collections/NHGRI/hapmap.aspx?PgId=266&coll=HG
ftp://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/2010-05_phaseIII/relationships_w_pops_041510.txt
Training p.39
Phase 3: Draft Release 1
samples
71 ASW
162 CEU
82 CHB
70 CHD
83 GIH
82 JPT
83 LWK
71 MEX
171 MKK
77 TSI
163 YRI
QC+ SNPs
1,632,186
1,634,020
1,637,672
1,619,203
1,631,060
1,637,610
1,631,688
1,614,892
1,621,427
1,629,957
1,634,666
poly QC+ SNPs
1,536,247
1,403,896
1,311,113
1,270,600
1,391,578
1,272,736
1,507,520
1,430,334
1,525,239
1,393,925
1,484,416
Training p.40
Phase 3 Data
ftp://ftp.ncbi.nlm.nih.gov/hapmap/genotypes/
Training p.41
HapMap Genome Browser
http://www.hapmap.org
Training p.42
HapMap Genome Browser
Chr20
Chr9:660000..760000
SNP:rs6870660
NM_153254
BRCA2
5q31
ENm010
gwa*
PARK3
Training p.43
HapMap Genome Browser
Training p.44
HapMap Genome Browser
Click on ruler to re-center image
Training p.45
HapMap Genome Browser
Training p.46
HapMap Genome Browser
Training p.47
HapMap Genome Browser
Training p.48
HapMap Genome Browser
Training p.49
HapMap Genome Browser
Training p.50
HapMap Genome Browser
Training p.51
Impute genotypes using
HapMap Data
• Interested in the VAV1 gene
• Commercially available platforms with few
overlapping SNPs in this region
• HapMap genotyped lots of SNPs in region
• Use genotypes for HapMap SNPs to
impute genotypes & compare nonoverlapping SNP sets
Training p.52
Impute genotypes using
HapMap Data
Training p.53
Impute genotypes using
HapMap Data
Training p.54
Impute genotypes using
HapMap Data
example.dat
(20 user-provided SNPs; all should be part of the HapMap) :
example.ped
(genotypes for 336 unrelated inds) :
Training p.55
Impute genotypes using
HapMap Data
Training p.56
Impute genotypes using
HapMap Data
• Info (143 provided & imputed HapMap SNPs)
SNP
Al1
rs10419572
rs415218 T
rs4807100 A
rs4807101 T
rs1651876 T
…
Al2
T
A
G
C
C
Freq1
A
0.9709
0.4713
0.4714
0.9631
MAF
0.9041
0.0291
0.4713
0.4714
0.0369
Quality
0.0959
0.9427
0.9790
0.9803
0.9277
Rsq
0.8179
0.0313
0.9625
0.9649
0.0216
0.1069
• Geno (143 SNPs x 336 inds)
PED00001->IND00001 ML_GENO T/T T/T G/G C/C T/T T/T A/T G/G A/A T/T T/C …
PED00002->IND00002 ML_GENO T/T T/T A/G T/C T/T T/T A/T G/G A/A T/T T/C …
PED00003->IND00003 ML_GENO T/T T/T A/A T/T T/T T/T A/T G/G A/A T/T T/T …
…
• Dose (allele dosage)
PED00001->IND00001 ML_DOSE 1.719 1.911 0.004 0.003 1.913 1.980 1.246 1.884 1.949 1.948 1.302 …
PED00002->IND00002 ML_DOSE 1.861 1.957 1.000 1.000 1.952 1.892 1.086 1.909 1.949 1.948 1.096 …
PED00003->IND00003 ML_DOSE 1.994 1.999 1.993 1.995 1.955 1.656 1.297 1.863 1.987 1.988 1.374…
…
Training p.57
Outline
•
•
•
•
Basic Concepts
NCBI dbSNP
International HapMap Project
VarioWatch
Training p.58
VarioWatch
Training p.59
VarioWatch
rs5934
ADSSL1
APOE
chr8:19856718-19858718
chr14:104317518-104317518
chr11:234234+C
chr15:234343
Training p.60
VarioWatch
Training p.61
VarioWatch
Training p.62
VarioWatch
Training p.63
VarioWatch
Training p.64
VarioWatch
Training p.65
Practice
• Find tag SNPs in a range of human
genome and download those SNPs from
Hapmap
• Submit SNPs or a gene symbol to
VarioWatch and find SNPs with high risk
Training p.66
Acknowledgement
Training p.67
Thanks for your attention!