structure

中国科学院上海生命科学研究院研究生课程 人类群体遗传学
人类群体遗传学
基本原理和分析方法
徐书华
金 力
中科院-马普学会计算生物学伙伴研究所
第八讲
人群遗传结构分析 (II)
第八讲
► 人群分化与遗传多样性
► STRUCTURE分析
 文件格式
 参数设定
 结果解释
► 软件展示
 STRUCTURE 2.2.3
人群遗传结构分析
► 人群遗传结构分析





Gene tree based
AMOVA (hierarchical F statistics)
Factor analysis
Principle Component analysis
STRUCTURE analysis








Geographical distribution HGDP
samples (52 populations)
Previous genome-wide
data in HGDP panel
► Science
2002
 52 populations, 1,056 individuals
 377 autosomal STRs
► Plos
Genet 2005
 52 populations, 1,048 individuals
 783 STRs, 210 indels
► Nature
Genetics 2006
 52 populations, 927 individuals
 3,024 SNPs in 36 genomic regions
NIH & University of Michigan
Stanford University
Genotype, haplotype and copy-number
variation in worldwide human populations
►
Study design:
 Genome-wide patterns of variation;
 Fine-scale population structure.
►
Data structure:
 29 HGDP populations, 485 individuals.
 4 HapMap populations, 112 individuals.
 525,910 SNPs, 396 CNVs (Illumina HumanHap550K).
►
New findings:
 Increasing linkage disequilibrium is observed with increasing geographic
distance from Africa (a serial founder effect ).
 The global distribution of CNVs largely accords with population structure
analyses for SNP data sets of similar size.
►
Conclusions:
 Support the utility of CNVs in human population-genetic research.
Worldwide Human Relationships Inferred
from Genome-Wide Patterns of Variation
►
Study design:
 Human genetic diversity;
 Fine-scale population structure.
►
Data structure:
 51 populations; 938 individuals.
 650,000 SNPs (Illumina HumanHap650K).
►
New findings:
 The relationship between haplotype heterozygosity and geography was
consistent with the hypothesis of a serial founder effect with a single
origin in sub-Saharan Africa.
 Observed a pattern of ancestral allele frequency distributions that
reflects variation in population dynamics among geographic regions.
►
Conclusions:
 This data set allows the most comprehensive characterization to date of
human genetic variation. Individual ancestry and population substructure
are detectable with very high resolution.
NJ tree based on SNP genotypes
Population structure inferred by STRUCTURE
Maximum likelihood tree of 51 populations
Oceania
150,000 SNPs
America
East Asia
South/Central Asia
Europe
Middle East
North Africa
MDS plots
MDS plots of individuals
SNP
Haplotype
CNV
MDS
Chrom 21
220 SNPs
Nei’s DA
PCA plots
PCA of populations
PCA of individuals
STR can not, SNP can
Europe
Middle East
Han and Northern Han
56 ethnic groups in China
Genetic structure of language
families
Two types of genetic structure
All other Han
Chinese
Shy blue: CN-GA
CN-PH
Olive green: TW-HA
TW-HB
Brown: SG-CH
Inference on population structure
using multi-locus genotype data
STRUCTURE V2.2.3
Pritchard, Stephens, and Donnelly (2000)
► Falush, Stephens, and Pritchard (2003)
►
Main objective
►
Assign individuals to populations
on the bases of their genotypes,
while simultaneously estimating
population allele frequencies
Other objectives
► Begin
with a set of predefined populations
and to classify individuals of unknown origin
► Identify the extent of admixture of
individuals
► Infer the origin of particular loci in the
sampled individuals
Structure is a Model
Based method of
clustering
(we must be assumptions about a
lot of parameters and distributions)
Four basic models
1.
Model without admixture
each individual is assumed to originate in one
(only one) of K populations
2.
Model with admixture
each individual is assumed to have inherited some
proportion of its ancestry from each of K
populations
Four basic models
3.
Linkage model
“Chunks” of chromosomes as derived as intact
units from one or another K population and all
allele copies on the same “chunk” derive from
the same population.
The model consider the derived correlations in
ancestry
Four basic models
4.
F model
The populations all diverged from a common
ancestral population at the same time, but
allows that the populations may have
experienced different amounts of drift since
the divergence event
Assumptions
• “Our main modeling assumptions are
Hardy-Weinberg
equilibrium
within
populations and complete linkage equilibrium
between loci within populations”
• “Loosely speaking, the idea here is that the
model accounts for the presence oh HWD or
LD by introducing population structure and
attempts to find populations groupings that
(as far as possible) are not in disequilibrium”
Data
► Consider
a sample of N individuals each one
genotyped at L loci
► Assume that the individuals represent a
mixture of K unobserved populations (K
unknown)
► If diploid, we have an N×2L data matrix X
L
► If n-ploid X is N×l 1 J l
where Jl is the number of alleles at the lth
locus
Input file format
Parameter setting
► Main
parameters (mainparams.txt)
► Extra parameters (extraparams.txt)
软件演示 (structure)
Summary plot of estimates of individual membership fraction
常用软件
► STRUCTURE
 http://pritch.bsd.uchicago.edu/software/
structure2_2.html
► EIGENSOFT
 http://genepath.med.harvard.edu/~reic
h/Software.htm
► SPSS
练习
► 利用HapMap数据进行STRUCTURE分析;
 http://www.hapmap.org