Modeling the Noise of Solexa Sequences

Compressed
Genotyping
Yaniv Erlich
Hannon Lab
Cold Spring Harbor Laboratory
3/23/09
Sequencing shRNA libraries with DNA Sudoku
[email protected]
Poster in a nutshell
• Genotyping
is the process of determining the genetic
variation for a certain trait in an individual.
• It is one of the main diagnostic tools in medical genetics
- Finding carriers for rare genetic diseases such as Cystic Fibrosis
- Tissue matching in organ donation
- Forensic DNA analysis
• Until now - only serial genotyping is possible. This is
expensive and tedious.
• Taking advantage on the ‘signal sparsity’, we developed
and tested a compressed genotyping framework.
Abstract
Significant volumes of knowledge have been accumulated in recent
years linking subtle genetic variations to a wide variety of medical
disorders from cystic fibrosis to mental retardation. Nevertheless, there
are still great challenges in applying this knowledge routinely in the
clinic, largely due to the relatively tedious and expensive process of
DNA sequencing. Since the genetic polymorphisms that underlie these
disorders are relatively rare in the human population, the presence or
absence of a disease-linked polymorphism can be thought of as a
sparse signal. Using methods and ideas from compressed sensing
and group testing, we have developed a cost-effective reconstruction
protocol, called "DNA Sudoku", to retrieve useful data. In particular, we
have adapted our scheme to a recently developed class of high
throughput DNA sequencing technologies, and assembled a
mathematical framework that has some important distinctions from
'traditional' compressed sensing ideas in order to address different
biological and technical constraints.
3/23/09
Sequencing shRNA libraries with DNA Sudoku
[email protected]
The genotyping problem
Input: Thousands of specimens
Output: Genotype of each specimen
Genotype
Genotyping as a sparse graph reconstruction
Samples
Genotyping is
equivalent to reveal
the edges of the
bipartite graph
Alleles
An example of carrier screen for Cystic
Fibrosis. There are two allele nodes, the
Wild Type (WT) and the and the Cystic
Fibrosis mutation. Samples 1, 2, 3, 5 are
WT, while specimen 4 is a carrier. The
specimen labeled with ’X’ is affected and
does not enter to the screen. Genotyping
is equivalent of finding the edges in the
graph.
THE GRAPH IS SPARSE
1. Number of carriers is very
low
2. No affected individuals
3. The degree of every
sample node is always two
(human genome is diploid)
The main idea – pooled processing
One could reveal the graph edges by DNA sequence
each sample
- expensive, tedious, and slow
Better:
Pool the samples and then sequence the pools
3/23/09
Sequencing shRNA libraries with DNA Sudoku
[email protected]
Mathematically speaking
Allele
 

5 
6
 
7
Pool
Pool
1

1
 0
1
0
1
1
1
1
0
1
1
1
0
0
What the observer sees
0

1  0
 0
0
1

1 
 0
The pooling design
A binary matrix (‘1’ – in
the pool, ‘0’ – otherwise)
3/23/09
Sequencing shRNA libraries with DNA Sudoku
2

2
2

1

2

Specimen
Specimen
Allele
The biadjacency matrix of
the graph
What the observer wants
[email protected]
What is a good pooling design
Attribute
Trivial compressed
sensing demands
Why
Decodability
Small number of pools
Less genotyping assays
Constant column weight The robot can pull several
specimens every step
Biological oriented
requirements
Low column weight
Less robotics efforts
Low row weight
Reducing
the
biological noise
chance
We need a light-weight d-disjunct matrix
for
Light Chinese Design
Inputs: N (number of specimens)
Column Weight (robotics efforts)
Algorithm:
The algorithm reaches the
bound derived by Kautz &
Singleton (1964)
1. Find W numbers {x1,x2,…,xw} such that:
(a)
Bigger than  N
(b)
Pairwise coprime
2. Generate W modular equations:
Specimen  Pool (mod x1 )

Specimen  Pool (mod xW )
3. Construct the pooling matrix upon the modular equations
Output: Pooling matrix
Example of a pooling matrix
Decoding the genotyping results by Belief Propagation
Specimens
Pools
A-priori
biological
information
Genotyping results
The pooled results can be decoded as
using Belief Propagation
Example of Belief Propagation
Specimens
A B
C
D
#1
2. I can’t
be B
1.You can be
either A, C, or D
Pools
A C D
C
D
#2
A B C
D
#3
A B
D
#4
A B
B C A
Possible genotypes:
C
A B
D
#5
A B C D
#6
A B C D
#7
A B
Specimen is
in a pool
03/06/09
C
C
B D C
3.Specimen
#3, #6 and
#7: One of
you guys
should be B
Simulation results
1000 specimens
W=5
Total pools = 180
Number of carriers
Real results – biotechnology application
40,000 specimens
W=5
Total pools = 1900
Work in progress
References & Acknowledgments
• Compressed Genotyping. Yaniv Erlich, Assaf Gordon, Michael Brand,
Gregory J. Hannon & Partha P. Mitra. Submitted to IEEE Trans. Info. Theory.
2009.
• DNA Sudoku - harnessing high-throughput sequencing for
multiplexed specimen analysis. Yaniv Erlich, Kenneth Chang, Assaf
Gordon, Roy Ronen, Oron Navon, Michelle Rooks & Gregory J. Hannon.
Genome Research. 2009.
Lindsay-Goldberg Fellowship