p j1 - PhD Alumni from The Computer Science Dept at UC Riverside

Some Algorithmic Problems Concerning the
Inference and Analysis of TagSNPs,
Haplotypes and Pedigrees
PH.D candidate: Lan Liu
Advisor: Tao Jiang
Outline



The haplotype inference problem
The tagSNP selection problem
The minimum common integer partition
problem
Outline

The haplotype inference problem
 Biological





background
Approximation and complexity of MRHC
Efficient algorithms for ZRHC
A linear-time algorithm for loop-free ZRHC
The tagSNP selection problem
The minimum common integer partition
problem
Introduction

Basic concepts

Mendelian Law: one haplotype comes from
the mother and the other comes from the father.
Genotype
Haplotype
2
2 Locus
2
1
1
2
1
1
1
2
paternal
maternal
Homozygous
2 2
1 1
Heterozygous
2 1 PS value=1
1 2
PS value=0
Example: Mendelian experiment
Notations and Recombinant
1
1
2
2
2
2
2
2
Genotype
1
2
2
2
2
1
2
2
Haplotype
Configuration
1
1
1
1
2
2
2
2
2
2
2
2
1
1
1
1
Father
Mother
2
2
2
2
Mother
1
1
1
1
2
2
2
2
Child
0 recombinant
2
2
2
2
2
2
2
2
2
2
2
2
Father
: recombinant
1
1
2
2
2
2
2
2
child
1 recombinant
Pedigree

An example: British
Royal Family
Elizabeth II of
the United Kingdom
Diana,
Camilla,
Prince Charles,
Princess of Wales Prince of Wales Duchess of Cornwall
Prince William
of Wales

Prince Henry of
Wales
Prince Philip,
Duke of Edinburgh
Commander
Captain
Princess Anne,
Mark Phillips Princess Royal Timothy Laurence
Peter Phillips
Zara Phillips
A mating loop: a cycle inside the pedigree.
Sarah
Prince Andrew,
Duke of York Margaret Ferguson
Princess
Beatrice of York
Princess
Eugenie of York
Prince Edward,
Earl of Wessex
Sophie
Rhys-Jones
Lady Louise
Windsor
Haplotype Reconstruction


- Haplotype: useful, expensive
- Genotype: cheaper to obtain
Reconstruct haplotypes from genotypes
M
1
1
M
C
2
2
1
1
1
1
2
2
2
2
1
1
C
2
2
1
1
1
1
2
2
(a)
M
2
2
1
2
C
2
1
1
1
1
1
2
2
(b)
2
2
Problem Definitions

MRHC
Given a pedigree and the genotype information for
each member, find a haplotype configuration for
each member which obeys Mendelian law, s.t. the
number of recombinants are minimized.

ZRHC:

Loop-free-ZRHC:
zero-recombinant
mating loops
zero recombinant, pedigree with no
Outline

The haplotype inference problem
Biological background
 Approximation and complexity of MRHC
 Efficient algorithms for ZRHC
 A linear-time algorithm for loop-free ZRHC



The tagSNP selection problem
The minimum common integer partition
problem
Approximation and Complexity of MRHC
 The known hardness results for MRHC
2-locus-MRHC
Tree-MRHC with
bounded #members
Tree-MRHC with
bounded #loci
Tree-MRHC


Hardness
NP-hard [LJ03]
P [LJ03]
P [DLJ03]
NP-hard [DLJ03]
2-locus-MRHC: 2 loci
Tree-MRHC: pedigree having no mating loops
Our Hardness and Approximation Results
Binary-treeMRHC
Lower bound
Hardness of approx.
ratio
Assumption
Any f(n)
P≠NP
Binary-treeMRHC*
Any f(n)
P≠NP
Tree-MRHC
Upper bound
of approx.
ratio
NP
2-locus-MRHC*
2-locus-MRHC
The lower bound
holds for
2-locus-MRHC*
(4,1)
Binary-treeMRHC*(1,1)
P≠NP
2-locus-MRHC
Any constant the Unique Games
(16,15)
Conjecture[Khot02]
P≠NP
Tree-MRHC(1,u)
Any constant the Unique Games
Tree-MRHC(u,1)
Conjecture
 Tree-MRHC: no mating loop
 Binary-tree-MRHC: 1 mate, 1 child
 Binary-tree-MRHC*: 1 mate, 1 child, missing
data
O ( log(n))
 2-locus-MRHC: 2 loci
 2-locus-MRHC*: 2 loci with missing
data
Outline

The haplotype inference problem
Biological background
 Approximation and complexity of MRHC
 Efficient algorithms for ZRHC
 A linear-time algorithm for loop-free ZRHC



The tagSNP selection problem
The minimum common integer partition
problem
The ZRHC problem

Problem definition
Given a pedigree and the genotype information
for each member, find a recombination-free
haplotype configuration for each member that
obeys the Mendelian law of inheritance.
Previous work




Li and Jiang introduced a system of linear equations over
F[2] and presented an O(m3n3) time algorithm for ZRHC
[LJ03] , where m is #loci and n is #members in pedigree.
Recently, Chan et al. proposed a linear-time algorithm in
[CCC+06], which only works for pedigree without mating
loops.
Methods based on fast matrix multiplication algorithms
could achieve an asymptotic speed of O(k2.376) on k
equations with k unknowns.
The Lanczos and conjugate gradient algorithms are only
heuristics [GV96]. The Wiedeman algorithm has expected
quadratic running time [W86].
Our Result

We present a much faster algorithm for ZRHC with running
time O mn2  n3 log 2 n loglog n .


O  mn 
O(n)
transformation
O  mn 
Ax=b
O  mn 
Ax=b
redundancy
elimination
O(n log2n log log n)
O(n)
Ax=b
The New Linear System

n, m


m : #loci
n: #members in pedigree
Unknowns

: the paternal haplotype vector of a member j.
: the scalar demonstrating inheritance info between a
parent j1 and a child j.

The New Linear System
0
1
0
0
Father
1
1
0
1
0
0
0
0
j
0
0
0
1
1
1
0
1
Child
pj1,2=1
pj1,3=0
0
1
1
1
j2
j1
j2
j1
Pj1,1
pj1,2
pj1,3
pj1,4
Mother
Pj1
hj1,j
Pj1,1 +1
pj1,2 +0
pj1,3 +0
pj1,4 +1
Pj1 +wj1
Pj2,1
pj2,2
pj2,3
pj2,4
Pj2
hj2,j
j
Pj,1
pj,2
pj,3
pj,4
Pj
Pj2,1 +0
pj2,2 +1
pj2,3 +1
pj2,4 +1
Pj2 +wj2
Pj,1 +1
pj,2 +1
pj,3 +0
pj,4 +0
Pj +wj
The Linear System
 O(mn) equations on O(mn) unknowns.
 Given a homozygous locus i on a member j (with a
child j1), pj[i] and pj1[i] are pre-determined.
Pedigree Graph

A pedigree with genotype
12
22
11
11
12
12
12
1
12
12
11
12
2
3
12
4
12
12
Pedigree graph G
12
2
1
12
12
5

12
3
11
7
6
22
12
4
5
7
6
12
8
22
9
12
22
8
22
9
12
#edges · 2n
Locus Graph
 Locus graph Gi
Gi = (V, Ei), where Ei= {(k,j)| k is a parent of j, wk[i]=1}
12
22
11
1
2
3
?
1
1
0
2
3
h1,4
12
4
12
5
12
6
11
7
1
4
1
1
5
0
6
h6,8
12
Zero-weight
22
9
(a) Genotype info
8
0
h4,9
h8,9
1
9
(b) Locus graph
Example: Locus graph for the 3rd locus
8
7
p-variables: variables
on vertices.
h-variables: variables
on edges shared by
all locus graphs.
:
An Observation
 For any cycle or any path connecting two pre-determined
vertices in a locus graph, the summation of h-variables along the
path is a constant.
We can use paths to denote constraints!
 (proof sketch)
Assume the path
connecting two pre-determined vertices j0 and jk .
Pj0[i]
…
dj1, j2
hj1, j2
dj0, j1
hj0, j1
Pj1[i]
Pj2[i]
in locus graph Gi
djk-1, jk
hjk-1, jk
Pjk-1[i]
Pjk[i]
Pj0[i]+ hj0, j1 = Pj1[i] + dj0, j1
Pj1[i]+ hj1, j2 = Pj2[i] + dj1, j2
Pj2[i]+ hj2, j2 = Pj3[i] + dj2, j3
…
Pjk-1[i]+ hjk-1, jk= Pjk[i] + djk-1, jk
a constant
Examples of Linear Constraints
?
1
0
2
1
1
4
0
3
1
5
?
2
1
1
0
6
?
h3,5
h2,5
7
1
?
4
5
3
?
1
h3,6
h2,6
?
1
9
(a) 1st locus graph
h6,8 + h8,9= 1
0
8
:
1
6
1
0
h2,4
?
2
h3,5
3
h3,6
h2,5
h6,8
h8,9
?
8
9
(b) 2nd locus graph
h3,5 + h3,6 + h2,5 + h2,6 = 0
7
?
4
?
?
5
?
6
h6,8
h4,9
1
0
8
9
(c) 3rd locus graph
h4,9 + h2,4 + h2,5 + h3,5 +
h3,6 + h6,8 = 0
7
Linear Constraints
Obviously, the linear constraints are necessary. We
can also show that these constraints are sufficient.
 Moreover, we can upper bound #constraints in each
locus graph as O(n), while the trivial analysis gives an
upper bound O(n2).
 Total #constraints = O(mn).

The ZRHC-PHASE algorithm
Algorithm ZRHC_PHASE
Traditional method
input: a pedigree G=(V,E) and genotype {gj}
 Solve h-variables and p-
output: a general solution of {pj}
begin
Step 1. Preprocessing
Step 2. Linear constraint generation on h-variables
Step 3. Solve h-variables by Gaussian Elimination
Step 4. Solve the p-variables by propagation from
pre-determined p-variables to others.
end
variables together
 O(mn) equations on O(mn)
unknowns: O(mn) p-variables
and O(n) h-variables.
Our method
 Solve h-variables and pvariables separately
 O(mn) linear equations on O(n)
h-variables.
Our Method
O  mn 
O(n)
transformation
O  mn 
Ax=b
O  mn 
Ax=b
redundancy
elimination
O(n log2n log log n)
O(n)
Ax=b
Redundant Equation Elimination

An observation
j0
Given a cycle
, assume that
there are constraints among each pair of vertices.
j1

j2
jk
…
jk-2
jk-1
j0 ~ j2
j2 ~ jk-1
j0 ~ jk-1

Key lemma
Originally, there are O(k2) constraints. Notice that
they are not independent.

We can replace the original constraints by an
equivalent set of constraints with size O(k).

Remove the redundant equations
without solving them!
Redundant Equation Elimination
Given a spanning tree, the stretch of an edge
(k, j) is defined as the length of the unique path
between k and j on the tree.

Elkin, Emeky, Spielman and Teng shows that we
can embed any graph in a low-stretch spanning
tree with average stretch O(log2n log log n).

The number of irredundant constraints can be
bounded by the sum of cycle lengths, which is
further bounded by the sum of stretches O(nlog2n
log log n).

Outline

The haplotype inference problem
Biological background
 Approximation and complexity of MRHC
 Efficient algorithms for ZRHC
 A linear-time algorithm for loop-free ZRHC



The tagSNP selection problem
The minimum common integer partition
problem
The Loop-Free ZRHC problem

Problem definition
Given a pedigree without mating loops and the
genotype information for each member, find a
recombination-free haplotype configuration for
each member that obeys the Mendelian law of
inheritance.
Constraint Graphs

Given the constraints in a pedigree graph, we can construct
the corresponding constraint graph.
Pedigree Graph
vertex v
A constraint for the path connecting
vertices j and k with the sum of
h-variables along the path being b

Constraint Graph
vertex v
An edge (j, k) with weight b
An example
1
2
3
5
4
Constraints
Sum of
path
h-variables
(1,5)
1
(1,2)
1
(2,4)
0
(2,5)
0
(a) A pedigree graph with
constrains
1
1
2
0
1
3
5
0
4
(b) Corresponding constraint graph
A Key Lemma

There exists a solution to the loop-free ZRHC problem
if and only if the weight sum of every cycle C is 0 in
the corresponding constraint graph.
 (proof sketch)
 ”=>”
 Each h-variables occurs even number of times in the constraint
set S corresponding to C.
 The sum of h-variable in S is equal to the weight sum of C.
 The weight sum of C is 0.
1
2
3
1
5
1
The constraints in S
are not independent!
0
1
3
4
(a) The pedigree graph
2
5
4
(b) Corresponding constraint graph
 ”<=” Done by a construction later.
A Mapping from Constraints to Edges
 The constraints forming a spanning forest in the
constraint graph are sufficient to represent all constraints.
 There are at most n-1 independent constraints.
 We can construct an injective mapping f from the
independent constraints to edges in the pedigree graph
1
1
3
2
0
5
0
4
Constraints
Sum of
path
h-variables
(1,2)
1
(2,4)
0
(2,5)
0
(a) A spanning forest for the
constraint graph
1
2
3
5
4
Mapping
constraints
edge
(1,2)
(2,3)
(2,4)
(3,4)
(2,5)
(4,5)
(b) The pedigree graph
Each constraint is mapped to an
edge on the path corresponding to
the constraint.
The ZRHC-PHASE algorithm
Algorithm ZRHC_PHASE
input: a pedigree G=(V,E) and genotype {gj}
output: a general solution of {pj}
begin
Step 1. Preprocessing
Step 2. Linear constraint generation on h-variables
Step 3. Solve h-variables by Gaussian Elimination
Step 4. Solve the p-variables by propagation from
pre-determined p-variables to others.
end
It takes O(n3) time!
Solving h-variables
In order to obtain a linear-time algorithm, we want to
avoid the Gaussian elimination method.
 An observation
Given a constraint along a path j0 , j1,…, jk-1 , jk

j0
j1
…
jk-1
jk
h j0 , j1 +hj1 , j2 + …+ h jk-1, j =
b
k
We can solve the constraint in the following way:

Assign the h-variables on edges (j0 , j1), (j1, j2), …, (jk-2, jk-1) arbitrarily.
Assign the h-variables on the last edge (jk-1, jk) as a fixed value to
satisfy the constraint: hj , j = hj0 , j1 + …+ h jk-2, j k-1+ b.
k-1 k

Solving h-variables Based on the Mapping f


We have constructed the infective mapping f : S -> E ,
where S is the constraint set and E is the edge set.
We solve h-variables as follows:


For each h-variable corresponding to an edge e not in f (S),
assign an arbitrary value.
For each h-variable corresponding to an edge e in f (S),
assign a fixed value based on the constraint f –1(e), such
that the constraint is satisfied.
1
2
3
: not in f(E)
: in f(E)
constraints
1
0
1
0
4
5
(1,2)
(2,4)
(2,5)
Mapping
sum of
h-variables
1
0
0
edge
(2,3)
(3,4)
(4,5)
h-variables can be
solved by a single
BFS Traversal.
Outline

The haplotype inference problem






Biological background
Approximation and complexity of MRHC
Efficient algorithms for ZRHC
A linear-time algorithm for loop-free ZRHC
The tagSNP selection problem
The minimum common integer partition
problem
Motivation


With the rapid development of genotyping technologies,
there are more than 10 million verified single-nucleotide
polymorphisms (SNPs) in dbSNP.
We aim to select a subset of informative SNPs (i.e.
tagSNPs) to save the cost for genotyping all SNPs and
performing disease association mapping.
r2 Linkage Disequilibrium Statistics



Given a pair of genetic
markers 1 and 2.
r2 statistics:
marker 2
marker 1
A
a
r2 =
B
b
pAB
paB
p.B
pAb
pab
p.b
pA.
pa.
(pAB –pA. p.B)2
pA.(1-pA.) p.B(1-p.B)
If r2 is no less than a given threshold r0, marker 1
(or marker 2) can tag marker 2 (or marker 1,
respectively).
The TagSNP Selection Problem
Given a set V of SNP markers and LD patterns
E={(vj1,vj2)| r2(vj1,vj2) is no less than a given threshold r0, vj1 and vj2 are in V},
we want to select a subset V' of minimum cardinality,
such that given any v in V, there exists a v' in V' ,
where r2(v,v') is no less than r0.

1
1
2
3
6
2
3
6
5
5
4
(a) SNP markers and their LD
patterns in a population
4
: tagSNP
(b) TagSNPs for the
population
If we define G=(V, E), a
tagSNP set is equivalent
to a dominating set on G.
TagSNP Selection across Populations


In two populations with different evolutionary histories, a
pair of SNPs having remarkably different marker
frequencies and very weak LD may show strong LD in the
admixed population.
Therefore, tagSNPs picked from the combined populations
or one of the populations might not be sufficient to capture
the variations in all populations.
Problem Definition


Given a set of SNP markers and LD patterns in
multiple populations, we want to find a minimum
common tagSNP set for each of the populations.
The above problem is called the minimum
common tagSNP selection problem (MCTS).
1
2
3
6
5
1
1
2
3
6
3
6
4
4
4
Population 1
Population 2
Population 1
(a) SNP markers and their LD
patterns in two populations.
2
3
6
5
5
5
1
2
4
: tagSNP
Population 2
(b) The minimum TagSNP set
for these two populations.
Our Algorithms


The MCTS problem can be easily formulated by
integer linear programming.
We first apply some data reduction rules, then use one of
the following algorithms



A greedy algorithm: GreedyTag
A Lagrangian relaxation algorithm: LRTag
We calculate both the upper bound (i.e. the number of the
tagSNPs obtained by our algorithms) and the lower bound
(i.e. the minimum number of tagSNPs needed).

Lower bound: GreedyTag_lb and LRTag_lb
Experimental Result


We apply our algorithms on real HapMap data (release #19,
NCBI build 34, October 2005).
There are four populations in HapMap data.





CEU: Europe descendents.
CHB: Chinese people from Beijing.
JPT: Japanese people from Tokyo.
YRI: Yoruba people of Ibadan, Nigeria.
We get tagSNPs for the following two datasets:


Encode regions: all 10 ENCODE regions with totally 10,859
markers.
Human genome: chromosomes 1 – 22 with totally 2,862,454
markers.
Experiment Result for ENCODE Regions
 We compare our GreedyTag and LRTag with MultiPop-TagSelect(MPS).
 The gap between LRTag_lb and LRTag is at most two for each ENCODE
region and totally six for all ENCODE regions with the r2 threshold being
0.5.
 There is no gap with the r2 threshold being 0.8.
Experiment Result for Human Genome
 The gap between our solution and the lower bound is 1061 SNPs with r2
threshold being 0.5, given the entire human genome with 2,862,454 SNPs.
 The gap is 142 SNPs with the r2 threshold being 0.8.
The numbers of tagSNPs selected by
our algorithms are almost optimal.
Outline

The haplotype inference problem




Biological background
Approximation and complexity of MRHC
Efficient algorithms for ZRHC
A linear-time algorithm for loop-free ZRHC
 The tagSNP selection problem

The minimum common integer partition
problem
Problem Definitions


P(n): given an integer n, a partition is a set of
integers, say {n1,n2,…, nr}, s.t.i=1r ni=n.
Example: given n=4, {2,2} is a P(4);
given n=3, {3} is a P(3).
IP(S): given a multiset S= {x1, L, xm}, an integer
partition is a disjoint union
Example: given S= {3, 3, 4}, {2,2,3,3} is an
IP({3,3,4}).
Examples

CIP(S1, S2, …, Sk): given multisets S1, S2, …, Sk ,
a common integer partition of all multisets.
Example: given S= {3, 3, 4}, T={2,2,6},
{2,2,3,3} is a CIP(S,T); {1,1,2,2,4} is also a CIP(S,T).
 MCIP(S1, S2, L, Sk): a common integer partition
with the minimum cardinality.
Example: {2,2,3,3} is a MCIP(S,T).
 #P(100)=190,569,292
 MCIP is NP-hard
Biological Applications(1)

The distance between
two strings
abcdefghijkh
hijkhefgabcd
Minimum Common
Substring Partition
abcdefghijkh
hijkhefgabcd
 Genetic distance between
two genomes
Biological Applications(2)

MCIP is a special case of Minimum Common
Substring Partition(MCSP)
MCSP(S,T)
S=
T=
MCIP(S',T')
S'= {x1, x2, L, xm}
T'= {y1, y2, L, yn}
Our Result

2- MCIP: MCIP on two input multisets
k- MCIP: MCIP on k input multisets
2-MCIP
k-MCIP (k>2)
Approximation upper
bound
5/4
{3k(k-1)}/(3k-2)
Approximation lower
bound
APX-hard
APX-hard
APX-hard: There is a
constant c, s.t. a problem
cannot be approximated
within c.
Conclusion and Future Work

The haplotype inference problem




Biological background
Approximation and complexity of MRHC
Efficient algorithms for ZRHC
A linear-time algorithm for loop-free ZRHC
 The tagSNP selection problem
 The minimum common integer partition
problem
References


L. Liu and T. Jiang. Linear-Time Reconstruction of Zero-Recombinant Medelian
Inheritance on Pedigrees without Mating Loops. In submission.
L. Liu, Y. Wu, S. Lonardi and T. Jiang. Efficient Algorithms for Genome-wide TagSNP
Selection across Populations via Linkage Disequilibrium Criterion. To appear in
proc. of 6th Annual International Conference on Computational Systems
Bioinformatics(CSB'2007).




Y. Wu, L. Liu, T. Close and S. Lonardi. Deconvoluting the BAC-gene Relationship
Using a Physical Map. To appear in proc. of 6th Annual International Conference on
Computational Systems Bioinformatics(CSB'2007).
J. Xiao, L. Liu, L. Xia and T. Jiang. Fast Elimination of Redundant Linear Equations
and Reconstruction of Recombination-Free Mendelian Inheritance on a Pedigree.
In Proc. of ACM-SIAM Symposium on Discrete Algorithms(SODA'2007), pp. 655-664.
X. Chen, L. Liu, Z. Liu and T. Jiang. On the Minimum Common Integer Partition
Problem. In proc.of the 6th Conference on Algorithms and Complexity, Rome, Italy, pp.
236-247.
L. Liu, X. Chen, J. Xiao and T. Jiang. Complexity and Approximation of the Minimum
Recombination Haplotype Configuration Problem. In Proc.of the 16th Annual
International Symposium on Algorithms and Computation (ISAAC'05), pp. 370-379. [Best
paper nominations: 5.35%]. To appear in Theoretical Computer Science.
Thanks for your time
and attention!