III/ Genome Rearrangements

III/ Genome Rearrangements
1/ Evolution of matchings (cuts and joins),
alternating permutations and labeled trees
2/ Evolution of permutations (reversals) and a
clicking game on F2
3/ Shuffling a deck of cards
Result of sequence searches
Genes
G1
G2
Genomes can be circular or linear, possibliy with several chromosomes
Beware of the orientations, deduced from the double strand structure of genomes
Dotplot between 2 E. Coli strains
Genomes can be
- signed strings
- signed permutations
- matchings on gene extremities
General reference for genome rearrangements
Fertin et al, Combinatorics of genome rearrangements, 2009
1/ Genomes as matchings
2 more recent references on this part:
Ouangraoua and Bergeron, JCB, 2010
Miklos and Tannier, arxiv, 2013
Modelisation of a genome
Genes
Modelisation of a genome
adjacencies
Genome 1
Genes
Modelisation of a genome
Genome 2
Genes
The comparison graph
Genome 2
Genome 1
Genes
The comparison graph
Genome 2
Genome 1
The comparison graph
Cycles,
RedBlue paths,
RedRed paths,
BlueBlue paths
1.1 Single cut or join
Allowed operations: Remove an edge or add an edge
- How many operations are required (trivial)
- How to count or sample solutions
- The relation with alternating permutations
Alternating permutation (André, 1881):
c1,...,cn permutation of 1..n such that
c2i-1 < c2i and c2i> c2i+1
For example, 1,3,2,4 but not 1,3,4,2
An = number or alternating permutations
Transforming Red into Blue
The number of scenarios sorting an even
(RedBlue) path with n edges is E(n)=An
The number of scenarios sorting an odd
RedRed path is R(n)=An
The number of scenarios sorting an odd
BlueBlue path is B(n)=An
The number of scenarios sorting a cycle is
C(n)=1/2*n*An-1
Mix the paths and cycles with a multinomial coefficient
An inversion is 4 SCJ
We would like to model more realistic operations
Non reciprocal translocation (DCJ)
Double cut-and-join (DCJ)
Reciprocal Translocation (DCJ)
Fission (SCJ et DCJ)
Fusion (SCJ et DCJ)
Inversion (DCJ)
1.2 Double cut-and-join
- How many operations are required (easy)
- How to count or sample solutions
- The relation with parking functions, labeled trees, cycles
in permutations
- How to estimate the number of steps given the distance
Number of cycles
Number of genes
d(Red,Blue) = n – (c + i/2)
Minimum number of DCJ
Number of paths with an
odd number of vertices
(including isolated vertices)
n = number of genes, 2n = number of vertices
Two identical genomes only have size 2 cycles and size 1 paths
A size 2k cycle needs k-1 DCJ
A size 2k+1 path needs k DCJ
A size 2k path needs k DCJ
Good DCJs are transforming
- 1 cycle into 2 cycles
- 1 even path into 1 cycle and one even path
- 1 odd path into 1 cycle and one odd path
- 2 even paths into 2 odd paths
Number of ways to sort a cycle with 2n edges by DCJ =
Number of ways to decompose the permutation cycle
(1..n) into transpositions
Number of ways to sort a cycle with 2n edges by DCJ =
Number of ways to decompose the permutation cycle
(1..n) into transpositions =
Number of parking functions of length n-1 =
n^{n-2}
Slides from Richard Stanley
Slides from Richard Stanley
Slides from Richard Stanley
Slides from Richard Stanley
Slides from Richard Stanley
Identify a DCJ with a transposition (ab) a<b
Say a is the base and b is the top
1/ The sequence of bases of a DCJ scenario
sorting a cycle is a parking function
(a number after the base cannot be reused after a base)
2/ The sequence of bases uniquely determines the
DCJ scenario
(a number at the top cannot be reused as a base)
Parking function: 48122324
1
9
2
3
2
7
4
5
8
3
6
7
8
1
4
6
5
(123456789)(45)(89)(18)(26)(27)(36)(28)(46)=(1)(2)(3)(4)(5)(6)(7)(8)(9)
Choose a
top for
each base,
starting
from the
highest
8->9
4->5
4->6
3->6
2->6
2->7
2->8
1->8
Labeled trees (Cayley, 1889)
Number of unrooted node-labeled
trees with n vertices =
Number of rooted edge-labeled
trees with n vertices =
n^(n-2)
Relations in Stanley, 1997
For odd paths, it is still possible to enumerate
Use a multinomial to mix cycles and odd paths
Not in presence of even paths (open problem)
Final remarks on genomes as matching, and
rearrangements as SCJ or DCJ
SCJ saturates faster, is less precise but is
computationally feasible and close to the
subtitution models.
Probabilistic models for DCJ need Monte Carlo
methods to explore solution spaces, while for SCJ
they could have analytical solutions.
Estimation of a number of events, given the shortest path
For random walks on graphs, the expected SCJ distance
after k SCJ is N/2(1-(1-2/N)^t) (N edges, n vertices)
n=100, starting N=2000
Estimated number of SCJ
SCJ distance
For random walks on matchings, estimate the SCJ or DCJ
distance after k SCJ or DCJ...
2/ Genomes as permutations
Sorting by Reversals
0
7
5
3
-1
-6
-2
4
8
A permutation is a particular matching
An inversion is a particular DCJ
Sometimes it is reasonable to consider only permutations and inversion
0
1
2
3
4
5
6
7
8
Sorting by Reversals
0
7
5
3
-1
-6
-2
4
8
0
1
-3
-5 -7
-6
-2
4
8
0
1
2
5
6
7
8
3
4
Sorting by Reversals
0
7
5
3
-1
-6
-2
4
8
0
1
-3
-5 -7
-6
-2
4
8
0
1
-3
-5 -4
2
6
7
8
0
1
2
5
6
7
8
3
4
Sorting by Reversals
0
7
5
3
-1
-6
-2
4
8
0
1
-3
-5 -7
-6
-2
4
8
0
1
-3
-5 -4
2
6
7
8
0
1
-3
-2
5
6
7
8
0
1
2
3
4
4
5
6
7
8
Sorting by Reversals
0
7
5
3
-1
-6
-2
4
8
0
1
-3
-5 -7
-6
-2
4
8
0
1
-3
-5 -4
2
6
7
8
0
1
-3
-2
5
6
7
8
0
1
2
3
4
4
5
6
7
8
2/ Reversals
An example where the inversion distance is the DCJ
distance (the previous example)
An example where it is not
(reverse -3 -5 -7 -6 -2 in the previous example
as a second step)
Sorting by Reversals
0
7
5
3
-1
-6
-2
4
8
0
1
-3
-5 -7
-6
-2
4
8
0
1
-3
-5 -4
2
6
7
8
0
1
2
5
6
7
8
3
4
Sorting by Reversals
0
7
5
3
-1
-6
-2
4
8
0
1
-3
-5 -7
-6
-2
4
8
0
1
2
6
7
5
3
4
8
0
1
2
3
4
5
6
7
8
The overlap graph of P1 and P2
A vertex is an edge of the comparison graph belonging to P2
2 vertices are linked if the edges cross when the graph is written under P1
A vertex is oriented if the edge spans an interval with an odd number of
vertices (not that a vertex is oriented iff it has odd degree)
The effect of a reversal on the overlap graph
"local complementation"
The effect of a reversal on the adjacency matrix
of the overlap graph
0011000
0110101
1111101
1010101=A
0111010
0000101
0111010
0011000
0000000
1001000
1 0 1 0 1 0 1 = A + v1v1^T
0001111
0000101
0001111
A component is oriented if it has a black vertex, unoriented otherwise
Sorting an unoriented component is done in as many reversals as the
rank of the adjacency matrix of the overlap graph over F2
Rank is n-c, where n is the size of the matrix and c the number of
cycles. (A cycle has rank n-1, as in a matroid)
Danger: creating
unoriented
components
There is always a black vertex such that clicking on it does not
create unoriented components.
Proof: Take the oriented vertex v which maximizes
Number of unoriented neighbors – Number of oriented neighbors
It does not create unoriented components. Indeed, if it does create
one unoriented component C, C has an oriented vertex w adjacent
to v. Calculate its score: an unoriented neighbor of v is a neighbor
of w, and an oriented neighbor of w is a neighbor of v.
Sorting unoriented components: hurdles and fortresses
Hurdles: minimal unoriented components.
One inversion cuts 2 hurdles (c-1, h-2) or 1 hurdle (c,h-1)
Fortress: odd number of hurdles but additional unoriented
components
Sorting unoriented components: hurdles and fortresses
Fortress
Number of cycles
Number of genes
d(Red,Blue) = n – c + h + f
1 if the permutation
is a fortress, 0
otherwise
Minimum number of reversals
Number of hurdles
Counting, sampling or enumerating
sorting by reversals scenarios is almost open
Estimating the expected number of reversals
given the distance is a possible homework
subject.
The use of sorting by reversals in a controversial study: refuting a
« random breakage model ».
Argument: any scenario has to break at least 2d times on n
breakpoints. If c is low, d is close to n and each breakpoint is used
twice.
Relating the breakpoint sizes and the total intergene size, it is seen
to be too distant from the result of a uniform random model.
3/ Shuffling genomes or cards
One operation of suffling a deck of card:
Cut the deck into 2, then take randomly a card from the left and right subdecks
Tandem duplication and random loss in genomes:
Copy the genome in two exemplars, and remove randomly one exemplar of each gene
Riffle shuffle
(3 7 1 5 8 2 6 4)
(3 7 1 5 8 2 6 4 3 7 1 5 8 2 6 4)
(1 5 2 6 3 7 8 4)
(1 5 2 6 3 7 8 4 1 5 2 6 3 7 8 4)
(1 2 3 4 5 6 7 8)
Tandem duplication and losses
(3 7 1 5 8 2 6 4)
(1 5 2 6)(3 7 8 4)
(1 5 2 6 3 7 8 4)
(1 2 3 4)(5 6 7 8)
(1 2 3 4 5 6 7 8)
A chain in a permutation is a maximal subsequence of
consecutive numbers
(3 7 1 5 8 2 6 4) has 4 chains 12,34,56,78
A k-TDL is an operation copying a permutation k times
and applying losses (usual TDL is 2-TDL)
Observation: A permutation is sorted in 1 k-TDL iff k is
at least the number of chains
Theorem 1: if c<=k is the number of chains of a
permutation p, then there are choose(n+k-c,n) ways to
sort p with one k-TDL
Proof: (123456789...), the identity permutation, has to
be cut into k pieces, each piece corresponding to a
copy of p (we have to place k-1 cuts). c-1 cuts are
compulsory. Ex: if p=(371582649), then 2 and 3 cannot
go to the same copy, so Id is cut into (12|34|56|789).
There remains k-c cuts to place, and repetition is
allowed, yielding the result.
Theorem 2: There are as many ways to sort a
permutation with one k1*k2-TDL as ways to sort a
permutation with one k1-TDL followed by one k2-TDL.
Proof: (=>) k=k1*k2. Take a k-scenario, and label all k
copies by coordinates ab, a in (1,..,k2) and b in
(1,...,k1) in increasing lexicographic order. Each
element of p is labeled by (a,b). Make k1 copies and
sort according to the b coordinate. Then make k2
copies and sort according to the a coordinate. (<=) any
k1-TDL scenario followed by a k2-scenario produces a
coordinate system, which translates into a k-TDL
Corollaries:
The minimum number of TDL to sort a permutation with
c chains is ┌log_2(c)┐
The number of minimum size scenarios is
choose(n+2^(┌log_2(c)┐)-c,n)
Note that the TDL distance is not symmetric.
It is the first non symmetric distance we have
encountered so far.
This forces to go back to the evolutionary principle:
define a TDL problem if two permutations are extant
genomes.
General notes on rearrangements in bioinformatics
Computational problems are far more complex than with
substitutions and indels (compare the distance computation
and estimation between two sequences)
Very often the evolutionary studies use the parsimony
principle, and are limited to few genomes with few genes,
while statistical models of sequence evolution with
subsitutions can handle hundreds of complete genomes with
more accuracy (see part 3)
General notes on rearrangements in bioinformatics
In evolutionary studies with real data, most often DCJ or
SCJ are used, and accessorily TDRL for some animal
mitochondria.
Computationally, there exists a lot of variants of the
rearrangement problem : sort a permutation, or a sequence,
with an allowed operation, or a combination of operations
- transpositions
- pancakes
- gains, losses, duplications
- block interchanges
- whole genome duplications
- ...