minimum common tagSNP selection

Efficient Algorithms for Genome-wide
TagSNP Selection across Populations via
the Linkage Disequilibrium Criterion
Authors: Lan Liu, Yonghui Wu,
Stefano Lonardi and Tao Jiang
Outline




Introduction
The MCTS Model
Our Algorithms
Experimental Result
Outline




Introduction
The MCTS Model
Our Algorithms
Experimental Result
Motivation


With the rapid development of genotyping technologies,
there are more than 10 million verified single-nucleotide
polymorphisms (SNPs) in dbSNP database.
We aim to select a subset of informative SNPs (i.e.
tagSNPs), in order to


Save the cost for genotyping all SNPs.
Perform disease association mapping.
TagSNP Selection

Haplotype-based methods


Require the information of the phased multilocus
haplotypes
Haplotype-free methods


Do not require haplotype information
TagSNP selection via r2 linkage disequilibrium statistics
r2 Linkage Disequilibrium Statistics



Given a pair of genetic
markers 1 and 2.
r2 statistics:
marker 2
marker 1
A
a
r2 =
B
b
pAB
paB
p.B
pAb
pab
p.b
pA.
pa.
(pAB –pA. p.B)2
pA.(1-pA.) p.B(1-p.B)
If r2 is no less than a given threshold r0, marker 1
(or marker 2) can tag marker 2 (or marker 1,
respectively).
The TagSNP Selection Problem

Instance: a set V of SNP markers and LD patterns
E={(vj1,vj2)| r2(vj1,vj2) is no less than a given threshold r0, vj1 and vj2 are in V},
Feasible solution: a subset V' , such that given any v in V, there exists a
v' in V' , where r2(v,v') is no less than r0.
Objective: minimize |V'|.
1
3
6
2
3
6
5
5
4
(a) SNP markers and their LD
patterns in a population

1
2
If we define G=(V, E), a
tagSNP set is equivalent
to a dominating set on G.
4
: tagSNP
(b) TagSNPs for the
population
This model is introduced by Carlson et al., 2004. It is
a simple and popular tagging method.
Outline




Introduction
The MCTS Model
Our Algorithms
Experimental Result
r2 Statistics in Single and Admixed Populations
 SNP 2: B, b
Population 2
 SNP 1: A, a
Population 1
A
a
B
b
0.9025
0.0475
0.95
0.0475
0.0025
0.05
0.95
0.05
r2 = 0
Admixed population:
50% population 1
50% population 2
A
a
B
A
a
B
b
0.0025
0.0475
0.05
0.0475
0.9025
0.95
0.05
0.95
r2= 0
b
0.4525
0.0475
0.5
0.0475
0.4525
0.5
0.5
0.5
r2= 0.6561
TagSNP Selection across Populations

A pair of SNPs



have remarkably different marker frequencies and very weak LD
in two populations with different evolutionary histories.
may show strong LD in the admixed population.
TagSNPs picked from the admixed populations or one of the
populations might not be sufficient to capture the variations
in all populations.
The MCTS Model


Given a set of SNP markers and LD patterns in
multiple populations, we want to find a minimum
common tagSNP set for each of the populations.
The above problem is called the minimum
common tagSNP selection problem (MCTS).
1
2
3
6
5
1
1
2
3
6
3
6
4
4
4
Population 1
Population 2
Population 1
(a) SNP markers and their LD
patterns in two populations.
2
3
6
5
5
5
1
2
4
: tagSNP
Population 2
(b) The minimum TagSNP set
for these two populations.
Outline




Introduction
The MCTS Model
Our Algorithms
Experimental Result
Our Algorithms


The MCTS problem can be easily formulated by an
integer linear programming.
We first apply some data reduction rules, then use one of
the following algorithms



A greedy algorithm: GreedyTag
A Lagrangian relaxation algorithm: LRTag
We calculate


the upper bound:the number of the tagSNPs obtained by our
algorithms
the lower bound: the minimum number of tagSNPs needed


GreedyTag_lb
LRTag_lb
Data Reduction Rules
Population 1

Pick all irreplaceable markers

Example: among markers 1, 2 and 6,
remove marker 1 and 2.
Example: between the occurrences of
markers 4 and 5 in population 2, remove
the occurrence of marker 4.
3
2
3
6
5
4
4
7
7
1
2
6
1
2
6
3
5
Remove less stringent occurrences

1
5
Remove less informative markers

2
6
Example: marker 7


1
Population 2
3
5
4
4
7
7
1
2
3
6
5
1
2
3
6
5
4
4
7
7
A Greedy Algorithm
Apply data
reduction rules
un-tagged
occurrence?
no
yes
Pick the marker
which tags the most
of the remaining
occurrences as a
tagSNP
Output the
tagSNPs
A Lagrangian Relaxation Algorithm
Introduce the
Lagrangian multipliers λ
Obtain the relaxed
integer program
iteration := 0
iteration++
no
< max_iter
yes
Initialize λ
Update λ towards the
subgradient direction
Obtain the tagSNP set
based on λ
Update the tagSNP set
based on λ
Output the
tagSNPs
Outline




Introduction
The MCTS Model
Our Algorithms
Experimental Result
Experimental Result


We apply our algorithms on real HapMap data (release #19,
NCBI build 34, October 2005).
There are four populations in HapMap data.





CEU: Europe descendents.
CHB: Chinese, Beijing.
JPT: Japanese, Tokyo.
YRI: Yoruba people of Ibadan, Nigeria.
We get tagSNPs for the following two datasets:

Encode regions


all 10 ENCODE regions

10,859 markers.

2,862,454 markers
Human genome

chromosomes 1 – 22
Experiment Result for ENCODE Regions
 We compare our GreedyTag and LRTag with MultiPop-TagSelect(MPS).

Multipop-TagSelect first generates the tagSNPs for each single population,
then combines the obtained tagSNPs together for multiple populations.
The gap between LRTag_lb and LRTag
 r2 = 0.5: at most two for each region
totally six for all regions
 r2 = 0.8: there is no gap.
Experiment Result for Human Genome
 The gap between LRTag_lb and LRTag for the whole genome
 2,862,454 SNPs in total
 r2 = 0.5: 1,061
 r2 = 0.8: 142
The numbers of tagSNPs selected by
our algorithms are almost optimal.
Running Time of Our Algorithms

Running environment





a 32-processor SGI Altix 4700 supercomputer system
1.6 GHZ CPU
64 GB shared memory
15 threads in parallel.
Running time

r2 = 0.5,



ENCODE regions: < 7 seconds for each region, < 1 minute for all regions.
Human genome: < 12 minutes for each chromosome, < 1 hour for the
genome.
r2 > 0.5, our algorithms run faster the above speed.
Outline




Introduction
The MCTS Model
Our Algorithms
Experimental Result
Thanks for your time
and attention!