Decisions, decisions: Circuits of a graph

DECISIONS, DECISIONS:
OVERLAPPING TO ORDER
By Richard Pham
Question : how to obtain the correct
reassembled sequence via Eulerian
Cycle?
■ Relevance to computational biology : reassembly
– Complex De-Bruijin graphs have numerous Eulerian cycles, which means there are
numerous reassembled sequences, and hence a high probability of obtaining a
wrong sequence.
■ Definition : Eulerian cycle
– A path which covers each edge of graph exactly once
■ Required conditions for guaranteed Eulerian cycle :
– Either zero or two odd-degree nodes (absolute degree : in/out)
Prerequisites to efficient, accurate
reassembly algorithms
There are, of course, more alternatives to obtaining the Eulerian cycle than the
standard ones in this presentation.
To make code as dynamic as possible, create a Graph class alongside with
GraphNode class and GraphEdge class.
Prerequisites to Efficient, Accurate
Reassembly Algorithms (cont.)
Graph class should have an attribute called adjacency list
Element in adjacency list should in the form : [node_value, [adj1, adj2, …, adjN] ] where adjX is some value
of an adjacent node.
Highly useful for GET_NEXT() functions, such as obtaining a next safe edge.
GraphNode
:
Some recommended attributes :
Start (time)
Finish (time)
Color
Classification
In-degree
Out-degree
Distance (from source)
Aspects to This Problem
Parallel edges allowed.
Parallel edges not allowed.
Duplicate edges allowed.
Duplicate edges not allowed.
My approach focuses solely on the overlap graph (easier parsing and translation of
k-mer data.
Top-Level Steps
■ Create class<Graph> alongside class<GraphNode> and class<GraphEdge>.
■ (unnecessary) Create standard traversal functions such as breadth-first search,
depth-first search, Dijkstra’s algorithm, and Kruskal’s algorithm as well as
implementing standard merge-sort for sorting edges by weight.
■ Obtain Eulerian circuit via the two most popular methods :
– Fleury’s algorithm
– Hierholzer’s algorithm
■ Record frequency of these circuits:
– Whichever circuit has the highest frequency is returned (returns list if multiple
circuits).
■ Failed steps :
– Obtain all Eulerian circuits via modification of Fleury’s algorithm and Hierholzer’s
algorithm.
Fleury’s Algorithm (undirected):
Overview
■ Initiate unvisited_edges cache.
■ Given some start node, (if two odd-degree nodes, choose one), next edge is the one
adjacent to start node.
■ Definition : bridge
– Given edge := (u, v), where u is current node of path, then (u, v) is bridge if v has no
remaining edges to traverse back to it.
■ Choose next edge according to this policy:
1.
2.
if any non-bridge adjacent edges exist, choose it.
else, choose arbitrary next edge.
■ Continue getting next edge, update current node to the end point of said edge, pop
edge out of unvisited_edges cache until no more edges.
Hierholzer’s Algorithm (directed):
Overview
■ Given that graph passes condition : every node has an equal in-degree and outdegree, algorithm is as follows :
– Initiate unvisited edges cache.
– Obtain cycle starting at source node (remember to pop next edge out of cache).
– Scan this cycle for any nodes which have unvisited edges.
– If there exists, obtain another cycle starting from this node, and paste it back into
original cycle
– Continue this procedure until no more unvisited edges or some case condition is
violated.
Accuracy Tests via Fleury/Hierholzer
■
–
■
–
Results for test string 1 :
20-40% range, with average of 30%.
For virtually all other strings:
25%.
Accuracy Tests for Separate
Fleury/Hierholzer Algorithms
Both algorithms display volatility in the rounds of accuracy tests when tested
separated.
In some set of rounds, the individual algorithms experienced 100% accuracy, and in other set of rounds,
0%.
Hierholzer’s algorithm, in particular, was quite volatile for graphs with complex overlapping cycles.
Perhaps it was the implementation…
Graph of sequence:
AACTAACTATATG
Results of Previous String’s Eulerian
Reassembly
Individual Fleury’s :
~15%
Individual Hierholzer’s :
~ 0%
Average (both):
~6%
Flaws of Fleury’s and Hierholzer’s
Where is the start ? Is there any guarantee that the algorithm’s
calculated start is the actual start ?
Perhaps with some more background information, the accuracy of
these algorithms can be amplified.
Ex :
A new subspecies of algae has been discovered.
Using the genomic sequences of related subsequences, use “ATGTGCGTGCAGAG” as start sequence.
Decisions, decisions… why does algorithm need to choose its own
destiny ?! Why is fairness, in this case, unfair ?!?!
An Approach Towards Finding All Eulerian
Cycles
A carefully constructed recurrence relation
General idea :
From some arbitrary first possible node, obtain the path, record any indices in this path where alternative
options exist into list : indices .
Add this path to PATHS_CACHE and iterating backwards in indices, obtain a different path at each
alternative options index.
After all possible paths from this arbitrary first possible node have been obtained, continue above
procedure for all other first possible nodes.
Some variants of Fleury’s and Hierholzer’s are recommended.
Problems with Approach
Problems :
Each reconstructed path from the original may also have another alternative options list.
Where to store these alternative options list ? Oh, the memory ?!?!?
Algorithms depend on retrieving from list unvisited edges; how to reconstruct these unvisited edges lists
for alternative paths ?
Personal implementation occasionally does not churn out a sequence of the proper length (unreliable !!).
Some Wild Turns In this Venture
Each edge may have a weight. Perhaps one could cater these
weights so Eulerian cycle will steer towards one option more than
others at a specific point in time ?
Idea is related to traffic engineering.
I personally implemented Dijkstra’s Algorithm and Kruskal’s algorithm, but alas, I could not come up with a
theoretically sound idea in this respect.
Sorts based on attributes such as frequency and weight :
Merge-Sort
Can be used to sort not only standard Python types, but also classes (based on some given attribute)
Some Wild Turns In this Venture
Breadth-first search and Depth-first search :
Former starts from some source node, and spreads out its search, from all adjacent nodes to all adjacent
nodes.
Latter starts from some source node, traverses into the deepest realms of the graph, recording time and
hop distance.
Application:
Can be used for parallelization of reassembled sequence (useful for long samples)
For instance, if “ACTAGA” was 10002344 hops away from start sequence “ACCCCG”, then split algorithm
into two processes : one starting from the start sequence, the other starting from “ACTAGA”, with 10002344
spaces between them.
One Needed Improvement
Better constructed test strings
Better storage of logging.

Download Report

Decisions, decisions: Circuits of a graph

Paperzz.com

Your Paperzz