Shotgun assembly of labeled graphs

Shotgun assembly of labeled graphs
Nathan Ross (University of Melbourne)
Joint work with: Elchanan Mossel (U of Penn, UC Berkeley, MIT)
DNA Shotgun Sequencing
Figure: From “Whole genome shotgun sequencing versus Hierarchical
shotgun sequencing” by Commins, Toft, and Fares (2009).
Theoretical Setup Q1: Deterministic
I
Sequence of letters (A, C, G, T or other) of length N.
I
All “reads” of length r are given.
Example: N = 14, r = 3:
AT G G G C ACTGAGCC
Reads:
{AT G , T G G , G G G , G G C , G C A, C AC ,
ACT , CTG , TGA, GAG , AGC , GCC }
Combinatorial Question:
When does this multi-set uniquely determine the sequence?
Theoretical Setup Q1: Deterministic
Ans (Ukkonen-Pevzner):
Identifiability is possible if and only if none of the following
blocking patterns appear:
I
Rotation:
xαy βx ⇐⇒ y βxαy
I
Triple repeat:
· · · xαxβx · · · ⇐⇒ · · · xβxαx · · ·
I
Interleaved repeat:
· · · xαy · · · xβy · · · ⇐⇒ · · · xβy · · · xαy · · ·
[x, y are (r − 1)-tuples and α, β are non-equal strings]
Theoretical Setup Q1: Deterministic
Proof is based on creating a de Bruijn graph:
DNA Physical Mapping and Alternating Eulerian Cycles in Colored Graphs
q-gram composition 9
ATG
AGC
CT (
ACT
TGG
GAG
GGG
GGC
GCC
CAC
CTG
AC
.?
~
T~
GA
D
AG
Figure: From “DNA Physical Mapping and AC
Alternating Eulerian Cycles in
Colored Graphs” by Pevzner (1996).
CA
c3 c
9 ACTGAGCC
AT GGGC
AT
GG.
CC
....I )
GA
O
D * AG
87
Theoretical Setup Q1: Deterministic
Proof is based on creating a de Bruijn graph:
DNA Physical Mapping and Alternating Eulerian Cycles in Colored Graphs
q-gram composition 9
ATG
AGC
CT (
ACT
TGG
GAG
GGG
GGC
GCC
CAC
CTG
~
AC
.?
T~
GA
D
AG
Figure: From “DNA Physical Mapping and AC
Alternating Eulerian Cycles in
Colored Graphs” by Pevzner (1996).
CA
c3 c
Identifiability is possible if and9 only if a unique Eulerian
path
O
AT
GG.
CC
(though not circuit).
....I )
GA
D * AG
87
Theoretical Setup Q1: Deterministic
Proof is based on creating a de Bruijn graph:
DNA Physical Mapping and Alternating Eulerian Cycles in Colored Graphs
q-gram composition 9
ATG
AGC
CT (
ACT
TGG
GAG
GGG
GGC
GCC
CAC
CTG
AC
.?
~
T~
GA
D
AG
Figure: From “DNA Physical Mapping and AC
Alternating Eulerian Cycles in
Colored Graphs” by Pevzner (1996).
CA
c3 c
ATG G GC
9 AC TG AGC C
AT
GG.
ATG AGC AC TG G GC C
....I )
GA
D * AG
O
CC
87
Theoretical Setup Q2: Randomized
Random sequence, entries independent and uniform on q letters.
I
What is the probability of identifiability?
I
Criteria on growth of r = rN as N → ∞ such that the chance
sequence is identifiable tends to zero or one?
Ukkonen-Pevzner useful – understand the probability of the
appearance of the blocking patterns.
I
If r / log(N) > 2/ log(q) eventually, then probability of
identifiability tends to one.
I
If r / log(N) < 2/ log(q) eventually, then probability of
identifiability tends to zero.
These are matching bounds/critical behavior.
Theoretical Setup Q3: Sampling
Assuming a sequence that is identifiable, sample reads randomly
with replacement. How many samples to reconstruct with good
probability?
I
Need algorithms to reconstruct from samples.
I
Random problem may have generic features to exploit.
I
For DNA shotgun assembly, Q1 and Q2 well-understood
largely thanks to Ukkonen-Pevzner ( ⇐⇒ ).
I
Further, in the regime where the chance of identifiability tends
to one, the chance the reads are all distinct also tends to one
and so the sampling question Q1 is also straightforward for
the random problem (greedy algorithms).
A general setup
1. G is a (fixed or random) graph,
2. with a (fixed or random) labeling of the vertices,
3. vertex v has a neighborhood Nr (v ) of “radius” r .
A general setup
1. G is a (fixed or random) graph,
2. with a (fixed or random) labeling of the vertices,
3. vertex v has a neighborhood Nr (v ) of “radius” r .
DNA shotgun example:
1. G is the line graph,
2. labels are from the alphabet {A, C , G , T },
3. The r -neighborhoods are the reads of length r .
A general setup
1. G is a (fixed or random) graph,
2. with a (fixed or random) labeling of the vertices,
3. vertex v has a neighborhood Nr (v ) of “radius” r .
DNA shotgun example:
1. G is the line graph,
2. labels are from the alphabet {A, C , G , T },
3. The r -neighborhoods are the reads of length r .
Some wrinkles:
1. Geometry/symmetries,
2. Kelly/Harary reconstruction conjecture.
What is this talk about?
The general setup:
1. G is a (fixed or random) graph,
2. with a (fixed or random) labeling of the vertices,
3. vertex v has a neighborhood Nr (v ) of “radius” r .
Address the Q1, Q2, Q3 in general or in examples:
Q1. Conditions for identifiability of graph G with labels from the
multiset of r -neighborhoods?
Q2. Upper and lower bounds on the probability of identifiability in
random examples? Asymptotically matching?
Q3. Given identifiability, algorithms and bounds on the chance of
reconstructing from samples.
Motivating Application
Graph shotgun assembly appears in the neuroscience literature for
reconstructing neural connectivity networks:
1. G is the (unknown) graph formed by a collection of neurons
with edges between “linked” neurons,
2. labels may be neuron types,
3. different types of stimulus show structure and labels of small
sub-networks.
Want to recover G with labels from the sub-networks.
Outline of the rest of the talk:
1. Some general results for Q1: Deterministic and Q3: Sampling.
2. Apply general results and look at Q2: Random in 2 examples.
Q1: Deterministic
Identifiability is impossible if there is a blocking configuration
This is one direction of Ukkonen-Pevzner; e.g., Interleaved repeat:
· · · xαy · · · xβy · · · ⇐⇒ · · · xβy · · · xαy · · ·
Q1: Deterministic
Identifiability is assured if there are uniqueness of overlaps:
I
There is an efficient algorithm for recovering the graph from
r-neighborhoods if
Nr −1 (v ) 6= Nr −1 (w ) for all vertices v 6= w .
Q3: Sampling
If uniqueness of overlaps then coupon collecting says that
N log(N) − N log(ε)
samples are enough to recover with probability at least 1 − ε.
Example 1: Labeled lattice models
1. G is the n × n lattice,
2. labels are drawn uniformly at random from q letters,
3. the (n − r + 1)2 neighborhoods are the r × r squares.
Example 1: Labeled lattice models
Blocking configuration
Example 1: Labeled lattice models
Blocking configuration
I
if r 2 / log(n) < 1/ log(q) eventually, then the probability of
identifiability tends to zero.
Uniqueness of overlaps
I
if r 2 / log(n) > 4/ log(q) eventually, then the probability of
identifiability tends to one.
Not matching bounds. No Ukkonen-Pevzner conditions to lean on.
Example 2: Erdős-Rényi random graph with no labels
1. G is the random graph on N vertices where edges appear
independently between vertices with probability pN ,
2. no labels (or all vertices have same label),
3. r -neighborhoods Nr (v ) are the subgraphs induced by vertices
at distance no greater than r from v .
Example 2a: Sparse Erdős-Rényi random graph
Structure of the Erdős-Rényi graph depends behavior of N × pN .
1. Sparse Case
I
Assume pN = λ/N for some λ > 0.
I
Here there are many isolated trees with order log(N) vertices.
Example 2a: Sparse Erdős-Rényi random graph
Blocking configuration for r -nhoods (line graph has 2r + 1 verts)
Since has same r-neighborhoods as
Example 2a: Sparse Erdős-Rényi random graph
Blocking configuration
I
if r / log(N) < [λ − log(λ)]−1 eventually, then the probability
of identifiability tends to zero.
Diameter
I
For λ 6= 1, the diameter of the sparse Erdős-Rényi random
graph is of order log(N) (different constants than that above).
Have order right again, but not matching bounds.
Example 2b: Not sparse Erdős-Rényi random graph
Structure of the Erdős-Rényi graph depends on behavior of N × pN .
2. Not sparse Case
I
Assume NpN / log(N)2 → ∞.
I
In this case the graph is highly connected.
Example 2b: Not sparse Erdős-Rényi random graph
Uniqueness of overlaps
I
If r = 3, then the probability of identifiability tends to one.
Why?
I
2-neighborhoods become unique since each vertex has so
many neighbors.
I
Actually just the multiset of degrees of neighbors of each
vertex become unique.
Another example: jigsaw puzzle
I
A puzzle has n × n pieces that form a square.
I
Adjacent pieces receive one of q cuts, uniformly at random
and independently.
Given n2 pieces, probability the puzzle can be uniquely assembled?
I
I
Framework and methods above give non-matching bounds.
More detailed analysis by
1. Bordenave, Feige, Mossel show that if q ≥ n1+ε , then
probability of unique assembly tends to one,
2. Martinsson show that if q ≤ √2e n − ω(log(n)), then probability
of multiple assemblies bounded away from zero.
Open Problems
I
Is there an analog of Ukkonen-Pevzner for the lattice
problem? Or other examples? (Very difficult in general;
Kelly/Harary “graph reconstruction conjecture”.)
I
Efficient algorithms for reconstruction from full/statistical
sample? For DNA shotgun assembly there are competing
algorithms.
I
Are there matching bounds in the examples above? (critical
behavior)
I
d-dimensional lattice: how do thresholds for identifiability rely
on the dimension?
I
Sparse Erdős-Rényi with labels. Get the threshold’s reliance
on the labels correct.
I
Random models where the labels and graph structure interact.
E. Mossel and N. Ross. Shotgun assembly of labeled graphs
(2015). http://arxiv.org/abs/1504.07682
Thank you!