Slides - Brown CS

Outline
•
•
•
•
Review: SNPs, Haplotypes
Hardy-Weinberg Law
Deviations from Hardy-Weinberg: FST
Linkage Disequilibrium
Genetics 101
• Humans are diploid: two copies of each
chromosome, maternal and paternal
• Locus: Region on a chromosome (gene,
nucleotide, etc.)
• Allele: “Value” at a locus
• Genotype: Pair of alleles (maternal and paternal)
at loci on a chromosome (homozygous,
heterozygous)
• Haplotype: Alleles of loci on same chromosome
(maternal or paternal)
Single Nucleotide Polymorphisms
Infinite Sites Assumption:
Each site mutates at most
once
00000101011
10001101001
01000101010
01000000011
00011110000
00101100110
By convention, SNPs are biallelic: only two of
four possible nucleotides present in population
Infinite Sites Assumption
A
B
00000000
3
00100000
5
8
00100001
•
•
•
•
00101000
The different sites are linked. A 1 in position 8 implies 0
in position 5, and vice versa.
Each sequence has single parent.
The history of a population can be expressed as a tree.
The tree can be constructed efficiently
Infinite sites Assumption and Perfect Phylogeny
• Each site is mutated at
most once in the history.
• All descendants must
carry the mutated value,
and all others must carry
the ancestral value
i
1 in position i
0 in position i
Perfect Phylogeny
• Assume an evolutionary model in which only mutation takes place,
• The evolutionary history is explained by a tree in which every
mutation is on an edge of the tree. All the species in one sub-tree
contain a 0, and all species in the other contain a 1. Such a tree is
called a perfect phylogeny.
• How can one reconstruct such a tree?
The 4-gamete condition
• A column i partitions the set
of species into two sets i0,
and i1
• A column is homogeneous
w.r.t a set of species, if it has
the same value for all
species. Otherwise, it is
heterogenous.
• EX: i is heterogenous w.r.t
{A,D,E}
A
i0 B
C
D
i1 E
F
i
0
0
0
1
1
1
4 Gamete Condition
• 4 Gamete Condition
 There exists a perfect phylogeny if and only if for all pairs of columns (i,j),
the following 4 rows do not exist
(0,0), (0,1), (1,0), (1,1)
Summary :No recombination leads to
correlation between sites
A
B
00000000
3
00100000
5
8
00100001
•
•
•
00101000
The different sites are linked. A 1 in position 8 implies 0
in position 5, and vice versa.
The history of a population can be expressed as a tree.
The tree can be constructed efficiently
Haplotype Phasing Problem
• Most sequencing technologies measure
genotypes not haplotypes
0101110
1100010
Pair of haplotypes
2102210
Genotype: 2 = heterozygous
• Given a set of genotypes, infer the haplotypes.
• Use parsimony assumption
 Haplotypes satisfy perfect phylogeny (Gusfield)
 Find minimum number of haplotypes that explain
observed genotypes
Linkage (Dis)-equilibrium (LD)
A
0
0
0
0
1
1
1
1
B
1
1
0
0
0
0
0
0
A
0
0
0
0
1
1
1
1
B
0
1
0
0
1
0
0
0
No recombination
Extensive Recombination
 Pr[A = 0, B=1] = 0.125
Pr[A=0,B=1] = 0.25
• Linkage equilibrium
Pr[A=0] = 0.5, Pr[B=1] = 0.25
•Linkage disequilibrium
Recombination
00000000
11111111
00011111
Recombination
• A tree is not sufficient
as a sequence may
have 2 parents
• Recombination leads
to violation of 4
gamete property.
• Recombination leads
to loss of correlation
between columns
00000000
11111111
00011111
LD can be used to map disease genes
LD
D
N
N
D
D
N
0
1
1
0
0
1
• LD decays with distance from the disease allele.
• By plotting LD, one can short list the region
containing the disease gene.
Population sub-structure can cause
problems in disease gene mapping
Population sub-structure can increase
LD
• Consider two populations
that were isolated and
evolving independently.
• They might have different
allele frequencies in some
regions.
• Pick two regions that are
far apart (LD is very low,
close to 0)
0
0
0
1
0
0
0
0
0
..
..
..
..
..
..
..
..
..
1
1
0
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
..
..
..
..
..
..
..
..
..
0
0
0
1
0
0
0
0
0
Pop. A
p1=0.1
q1=0.9
P11=0.1
D=0.01
Pop. B
p1=0.9
q1=0.1
P11=0.1
D=0.01
Recent ad-mixing of population
• If the populations came
together recently (Ex: African
and European population),
artificial LD might be created.
• D = 0.15 (instead of 0.01),
increases 10-fold
• This spurious LD might lead
one false associations
• Other genetic events can
cause LD to arise, and one
needs to be careful
0
0
0
1
0
0
0
0
0
..
..
..
..
..
..
..
..
..
1
1
0
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
..
..
..
..
..
..
..
..
..
0
0
0
1
0
0
0
0
0
Pop. A+B
p1=0.5
q1=0.5
P11=0.1
D=0.1-0.25=0.15