Pedigree Reconstruction using Integer Linear Programming

Pedigree Reconstruction using
Integer Linear Programming
Mark Bartlett
University of York
James Cussens
University of York
Nuala Sheehan
University of Leicester
Pedigree Reconstruction
• Given the genomes for a set of individuals, find a
maximum likelihood pedigree featuring those
individuals
• Must find a maximum likelihood pedigree (not
just a high likelihood one) and must guarantee
that this is a maximum likelihood pedigree
• Use simulated datasets with 15 standard forensic
microsatellite markers
Assumptions
• All individuals are known
– Any unobserved individual can be related to only one observed
individual
• All genetic data is known
– Including population allele frequencies
– No transcription errors
• Pure Mendelian inheritance
– No mutation, no null alleles, no linkage or linkage disequilibrium
• No priors on the structure
Extra-Genetic Data
• Extra-genetic data may be available
• Need to be able to incorporate this if available
or work sensibly without if it isn’t
• Typically age, sex and whether sexually
mature
Integer Linear Programming
• Form of constrained optimisation
• Quantity to optimise and the constraints must
all be expressed as linear functions of a set of
variables, some of which are integers
• Can write a problem declaratively and then
use highly optimised off-the-shelf solver to
find solution without knowing how it works
• Can add any additional constraints without
changing the algorithms
Linear Programming
Linear Programming
y>0
Linear Programming
x>0
y>0
Linear Programming
x>0
x<y-2
y>0
Linear Programming
x>0
x+y<7
x<y-2
y>0
Linear Programming
x>0
x+y<7
x<y-2
x<5
y>0
Linear Programming
x>0
x+y<7
x<y-2
x<5
x+2y=5
x+2y=13
x+2y=9
x+2y=1
y>0
Integer Linear Programming
x>0
x+y<7
x<y-2
x<5
x+2y=5
x+2y=13
x+2y=9
x+2y=1
y>0
Integer Linear Programming
x>0
x+y<7
x<y-2
x<5
x+2y=5
x+2y=13
x+2y=9
x+2y=1
y>0
Integer Linear Programming
x>0
x+y<7
x<y-2
x<5
x+2y=5
x+2y=13
x+2y=9
x+2y=1
y>0
Parent Set Representation
• Can fully specify a pedigree by specifying the
parents of each individual
• Create binary indicator variables 𝐼 𝑊 → 𝑣 for
each possible parent set 𝑊 of each individual
𝑣
• Let 𝐼 𝑊 → 𝑣 = 1 iff 𝑊 is the full set of
parents for individual 𝑣, and 0 otherwise
• ∀𝑣 ∈ 𝑉, 𝑊 ⊆ 2𝑉 , 𝑊 ≤ 2
𝐼 ∅→1 =1
𝐼 {1,2} → 4 = 1
𝐼 {1} → 4 = 0
𝐼 {2} → 1 = 0
1
2
3
4
Objective Function
• Find the parent sets for each node such that this gives the
pedigree with maximum likelihood
• Find values for each of the binary 𝐼(𝑊 → 𝑣) variables such
that they encode the maximum likelihood pedigree
𝐿=
𝜏(𝑣, 𝑃𝑎(𝑣))
𝑣
log(𝐿) =
log(𝐿) =
𝑣,𝑊
𝑣
log(𝜏(𝑣, 𝑃𝑎 𝑣 ))
log(𝜏 𝑣, 𝑊 ) 𝐼 𝑊 → 𝑣
Constraints
A pedigree is valid iff:
• Each known individual is included exactly once
– ∀𝑣,
𝑊 𝐼(𝑊
→ 𝑣) = 1
• No-one is their own ancestor
– The pedigree is acyclic
• The pedigree is sex consistent
Sex Consistency
• Introduce 𝐼𝑓 𝑣 binary variables which are 1 iff 𝑣
is female and 0 otherwise
• ∀𝑢, 𝑣, 𝑤, 𝐼 𝑢, 𝑤 → 𝑣 + 𝐼𝑓 𝑢 + 𝐼𝑓 𝑤 ≤ 2
• ∀𝑢, 𝑣, 𝑤, 𝐼 𝑢, 𝑤 → 𝑣 − 𝐼𝑓 𝑢 − 𝐼𝑓 𝑤 ≤ 0
Acyclicity Constraints
• Tried several ways to enforce this constraint
– Generation variables, e.g. 𝐼𝑔 𝑣 = 3
– Ordering variables, e.g. 𝐼 𝑢 < 𝑣 = 1
• Experiments have revealed cluster constraints to
perform best
• If a graph is acyclic then for any group of nodes (a
cluster) there exists at least one node which has
no parents in that node
• ∀𝐶 ⊆ 𝑉, 𝑣∈𝐶 𝑊:𝑊∩𝐶=∅ 𝐼(𝑊 → 𝑣) ≥ 1
Full Formulation
• Maximise
–
𝑣,𝑊 log
𝜏 𝑣, 𝑊
𝐼(𝑊 → 𝑣)
• Subject to
– ∀𝑣, 𝑊 𝐼(𝑊 → v) = 1
– ∀𝐶 ⊆ 𝑉, 𝑣∈𝐶 𝑊:𝑊∩𝐶=∅ 𝐼(𝑊 → 𝑣) ≥ 1
– ∀𝑢, 𝑣, 𝑤, 𝐼 𝑢, 𝑤 → 𝑣 + 𝐼𝑓 𝑢 + 𝐼𝑓 𝑤 ≤ 2
– ∀𝑢, 𝑣, 𝑤, 𝐼 𝑢, 𝑤 → 𝑣 − 𝐼𝑓 𝑢 − 𝐼𝑓 𝑤 ≤ 0
• Where
– ∀𝑣 ∈ 𝑉, 𝑊 ⊆ 2𝑉 , 𝑊 ≤ 2, 𝐼 𝑊 → 𝑣 ∈ 0,1
– ∀𝑣 ∈ 𝑉, 𝐼𝑓 𝑣 ∈ 0,1
Finding k Most Likely Pedigrees
• Technique easily extended to find the k most
likely pedigrees
• Gives us some idea about the uncertainty
inherent in the pedigree
• After finding the kth pedigree, just add an
additional constraint saying that pedigree is
not possible, and restart the solver to find the
(k+1)th most likely pedigree.
Adding Extra-Pedigree Data
• If we have ages or sexual maturity data, fix
𝐼 𝑊 → 𝑣 variables that are inconsistent with
it to 0
• If we have sex information, fix 𝐼𝑓 𝑣 variables
to be consistent with it, and fix any
𝐼 {𝑢, 𝑤} → 𝑣 for which 𝑢 and 𝑤 are the same
sex to 0
Adding Priors on Pedigree Structure
𝑝 𝑃 𝐺 ∝𝑝 𝐺 𝑃 ×𝑝 𝑃
log(𝑝 𝑃 𝐺 ) ∝ log(𝑝 𝐺 𝑃 × 𝑝 𝑃 )
log 𝑝 𝑃 𝐺 ∝ log 𝑝 𝐺 𝑃 + log(𝑝 𝑃 )
log 𝑝 𝑃 𝐺
∝
log 𝜏 𝑣, 𝑊
𝐼(𝑊 → 𝑣) + log 𝑝 𝑃
𝑣,𝑊
Objective function can easily be extended to
include a prior if we can write log 𝑝 𝑃 as a
linear expression of some variables
Adding Hard Priors
• If priors assign probability of 0 to some
pedigree, we can add this prior as an
additional constraint
• For example, to prevent any pedigree with a
sibling age gap of more than k years being
considered, we can add the constraint
∀𝑢, 𝑣, 𝑊: 𝐼 𝑊 → 𝑣 𝑎 𝑣 − 𝑎 𝑢 − 1 − 𝐼 𝑊 → 𝑢 𝑎(𝑣) ≤ 𝑘
where 𝑎 𝑣 is the age of individual v
Evaluation
• How fast is it?
– How well does it scale?
• How accurate is it?
– Guaranteed to find the most likely pedigree, but is
this the real one
– How like the real pedigree is the found one?
Evaluation
• How well does it scale? (How fast is it?)
– In absolute terms
– Compared to other approaches
• How accurate is it?
– Guaranteed to find the most likely pedigree, but is
this the real one
– How like the real pedigree is the found one?
Eskimo Pedigree
• Produce synthetic data based on an isolated
Eskimo pedigree
• 1614 individuals over 7 generations, with 225
founders
• Complex multi-generational pedigree with lots
of marriage chains and inbreeding
Cumulative Frequency
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
0
10
20
30
Time (min)
40
50
Evaluating Against FRANz
Number of Founders
True Founders
Number of Components
Size of Largest Component
No Marriages
One Marriages
Two Marriage
Three Marriages
Over Three Marriages
Children of 141
Marriages of 49
Descendants of 49
Sibling Pairs in Marriage Chain
Individuals in Marriage Chain
Depth
Number of Edges
Recall
Precision
Real
225
225
11
1581
563
129
21
8
0
10
4
452
3
23
7
2778
100
100
GOBNILP
171.0
171.0
7.1
1599.5
554.4
159.7
35.7
10.6
0.7
9.7
3.9
1005.9
2.8
22.6
47.8
2826.4
96.8
95.2
FRANz
159.1
158.3
7.4
1597.5
554.8
157.2
35.4
10.3
0.8
9.3
3.8
962.0
2.7
22.5
50.8
2815.2
95.4
94.1
Adding A Hard Prior
• When collected, the pedigree was constructed
so that no individual had a single parent
• This information can be used as a hard prior to
limit the search
∀ 𝑢, 𝑣: 𝐼 𝑢 → 𝑣 = 0
Time To Find Pedigree (s)
50
Median
Mean
40
30
20
10
0
0
10
20
30 40 50 60 70
kth Pedigree Found
80
90 100
Uncertainty
• Can consider the relative likelihood of the k
best pedigrees
• This suggests how confident we can be in the
most likely pedigree
– More peaked the likelihood is, the more likely the
pedigree is to be similar to the true one
• Can also use this to perform ‘model
averaging’, but takes no account of the long
tail
Likelihood Landscape
100%
90%
Shrimp (GOBNILP)
Likelihood as Percentage of Most Likely
80%
NIST (GOBNILP)
70%
Simulated (GOBNILP)
60%
50%
40%
30%
20%
10%
0%
0
100
200
300
400
500
600
Nth Pedigree Found
700
800
900
1000
Cumulative Likelihood Found
100%
90%
Cumulative Likelihood Accounted For
80%
70%
60%
50%
40%
Shrimp (GOBNILP)
30%
NIST (GOBNILP)
20%
Simulated (GOBNILP)
10%
0%
0
100
200
300
400
500
600
Nth Pedigree Found
700
800
900
1000
'Probability' of True Network
1
0.9
0.8
0.7
'Probability
0.6
0.5
Shrimp (GOBNILP)
0.4
NIST (GOBNILP)
0.3
Simulated (GOBNILP)
0.2
0.1
0
1
10
100
Nth Pedigree Found
1000
Comparison Between Real and Highest Scoring Parent Sets
1
0.98
0.96
0.94
Shrimp (GOBNILP)
F1 Score
0.92
NIST (GOBNILP)
0.9
Simulated (GOBNILP)
0.88
0.86
0.84
0.82
0.8
0
100
200
300
400
500
600
Nth Pedigree Found
700
800
900
1000