Decisions, decisions: Circuits of a graph

DECISIONS, DECISIONS:
OVERLAPPING TO ORDER
By Richard Pham
Question : how to obtain the correct
reassembled sequence via Eulerian
Cycle?
■ Relevance to computational biology : reassembly
– Complex De-Bruijin graphs have numerous Eulerian cycles, which means there are
numerous reassembled sequences, and hence a high probability of obtaining a
wrong sequence.
■ Definition : Eulerian cycle
– A path which covers each edge of graph exactly once
■ Required conditions for guaranteed Eulerian cycle :
– Either zero or two odd-degree nodes (absolute degree : in/out)
Prerequisites to efficient, accurate
reassembly algorithms
There are, of course, more alternatives to obtaining the Eulerian cycle than the
standard ones in this presentation.
To make code as dynamic as possible, create a Graph class alongside with
GraphNode class and GraphEdge class.
Prerequisites to Efficient, Accurate
Reassembly Algorithms (cont.)
Graph class should have an attribute called adjacency list
Element in adjacency list should in the form : [node_value, [adj1, adj2, …, adjN] ] where adjX is some value
of an adjacent node.
Highly useful for GET_NEXT() functions, such as obtaining a next safe edge.
GraphNode
:
Some recommended attributes :
Start (time)
Finish (time)
Color
Classification
In-degree
Out-degree
Distance (from source)
Aspects to This Problem
Parallel edges allowed.
Parallel edges not allowed.
Duplicate edges allowed.
Duplicate edges not allowed.
My approach focuses solely on the overlap graph (easier parsing and translation of
k-mer data.
Top-Level Steps
■ Create class<Graph> alongside class<GraphNode> and class<GraphEdge>.
■ (unnecessary) Create standard traversal functions such as breadth-first search,
depth-first search, Dijkstra’s algorithm, and Kruskal’s algorithm as well as
implementing standard merge-sort for sorting edges by weight.
■ Obtain Eulerian circuit via the two most popular methods :
– Fleury’s algorithm
– Hierholzer’s algorithm
■ Record frequency of these circuits:
– Whichever circuit has the highest frequency is returned (returns list if multiple
circuits).
■ Failed steps :
– Obtain all Eulerian circuits via modification of Fleury’s algorithm and Hierholzer’s
algorithm.
Fleury’s Algorithm (undirected):
Overview
■ Initiate unvisited_edges cache.
■ Given some start node, (if two odd-degree nodes, choose one), next edge is the one
adjacent to start node.
■ Definition : bridge
– Given edge := (u, v), where u is current node of path, then (u, v) is bridge if v has no
remaining edges to traverse back to it.
■ Choose next edge according to this policy:
1.
2.
if any non-bridge adjacent edges exist, choose it.
else, choose arbitrary next edge.
■ Continue getting next edge, update current node to the end point of said edge, pop
edge out of unvisited_edges cache until no more edges.
Hierholzer’s Algorithm (directed):
Overview
■ Given that graph passes condition : every node has an equal in-degree and outdegree, algorithm is as follows :
– Initiate unvisited edges cache.
– Obtain cycle starting at source node (remember to pop next edge out of cache).
– Scan this cycle for any nodes which have unvisited edges.
– If there exists, obtain another cycle starting from this node, and paste it back into
original cycle
– Continue this procedure until no more unvisited edges or some case condition is
violated.
Accuracy Tests via Fleury/Hierholzer
■
–
■
–
Results for test string 1 :
20-40% range, with average of 30%.
For virtually all other strings:
25%.
Accuracy Tests for Separate
Fleury/Hierholzer Algorithms
Both algorithms display volatility in the rounds of accuracy tests when tested
separated.
In some set of rounds, the individual algorithms experienced 100% accuracy, and in other set of rounds,
0%.
Hierholzer’s algorithm, in particular, was quite volatile for graphs with complex overlapping cycles.
Perhaps it was the implementation…
Graph of sequence:
AACTAACTATATG
Results of Previous String’s Eulerian
Reassembly
Individual Fleury’s :
~15%
Individual Hierholzer’s :
~ 0%
Average (both):
~6%
Flaws of Fleury’s and Hierholzer’s
Where is the start ? Is there any guarantee that the algorithm’s
calculated start is the actual start ?
Perhaps with some more background information, the accuracy of
these algorithms can be amplified.
Ex :
A new subspecies of algae has been discovered.
Using the genomic sequences of related subsequences, use “ATGTGCGTGCAGAG” as start sequence.
Decisions, decisions… why does algorithm need to choose its own
destiny ?! Why is fairness, in this case, unfair ?!?!
An Approach Towards Finding All Eulerian
Cycles
A carefully constructed recurrence relation
General idea :
From some arbitrary first possible node, obtain the path, record any indices in this path where alternative
options exist into list : indices .
Add this path to PATHS_CACHE and iterating backwards in indices, obtain a different path at each
alternative options index.
After all possible paths from this arbitrary first possible node have been obtained, continue above
procedure for all other first possible nodes.
Some variants of Fleury’s and Hierholzer’s are recommended.
Problems with Approach
Problems :
Each reconstructed path from the original may also have another alternative options list.
Where to store these alternative options list ? Oh, the memory ?!?!?
Algorithms depend on retrieving from list unvisited edges; how to reconstruct these unvisited edges lists
for alternative paths ?
Personal implementation occasionally does not churn out a sequence of the proper length (unreliable !!).
Some Wild Turns In this Venture
Each edge may have a weight. Perhaps one could cater these
weights so Eulerian cycle will steer towards one option more than
others at a specific point in time ?
Idea is related to traffic engineering.
I personally implemented Dijkstra’s Algorithm and Kruskal’s algorithm, but alas, I could not come up with a
theoretically sound idea in this respect.
Sorts based on attributes such as frequency and weight :
Merge-Sort
Can be used to sort not only standard Python types, but also classes (based on some given attribute)
Some Wild Turns In this Venture
Breadth-first search and Depth-first search :
Former starts from some source node, and spreads out its search, from all adjacent nodes to all adjacent
nodes.
Latter starts from some source node, traverses into the deepest realms of the graph, recording time and
hop distance.
Application:
Can be used for parallelization of reassembled sequence (useful for long samples)
For instance, if “ACTAGA” was 10002344 hops away from start sequence “ACCCCG”, then split algorithm
into two processes : one starting from the start sequence, the other starting from “ACTAGA”, with 10002344
spaces between them.
One Needed Improvement
Better constructed test strings
Better storage of logging.