DECISIONS, DECISIONS: OVERLAPPING TO ORDER By Richard Pham Question : how to obtain the correct reassembled sequence via Eulerian Cycle? ■ Relevance to computational biology : reassembly – Complex De-Bruijin graphs have numerous Eulerian cycles, which means there are numerous reassembled sequences, and hence a high probability of obtaining a wrong sequence. ■ Definition : Eulerian cycle – A path which covers each edge of graph exactly once ■ Required conditions for guaranteed Eulerian cycle : – Either zero or two odd-degree nodes (absolute degree : in/out) Prerequisites to efficient, accurate reassembly algorithms There are, of course, more alternatives to obtaining the Eulerian cycle than the standard ones in this presentation. To make code as dynamic as possible, create a Graph class alongside with GraphNode class and GraphEdge class. Prerequisites to Efficient, Accurate Reassembly Algorithms (cont.) Graph class should have an attribute called adjacency list Element in adjacency list should in the form : [node_value, [adj1, adj2, …, adjN] ] where adjX is some value of an adjacent node. Highly useful for GET_NEXT() functions, such as obtaining a next safe edge. GraphNode : Some recommended attributes : Start (time) Finish (time) Color Classification In-degree Out-degree Distance (from source) Aspects to This Problem Parallel edges allowed. Parallel edges not allowed. Duplicate edges allowed. Duplicate edges not allowed. My approach focuses solely on the overlap graph (easier parsing and translation of k-mer data. Top-Level Steps ■ Create class<Graph> alongside class<GraphNode> and class<GraphEdge>. ■ (unnecessary) Create standard traversal functions such as breadth-first search, depth-first search, Dijkstra’s algorithm, and Kruskal’s algorithm as well as implementing standard merge-sort for sorting edges by weight. ■ Obtain Eulerian circuit via the two most popular methods : – Fleury’s algorithm – Hierholzer’s algorithm ■ Record frequency of these circuits: – Whichever circuit has the highest frequency is returned (returns list if multiple circuits). ■ Failed steps : – Obtain all Eulerian circuits via modification of Fleury’s algorithm and Hierholzer’s algorithm. Fleury’s Algorithm (undirected): Overview ■ Initiate unvisited_edges cache. ■ Given some start node, (if two odd-degree nodes, choose one), next edge is the one adjacent to start node. ■ Definition : bridge – Given edge := (u, v), where u is current node of path, then (u, v) is bridge if v has no remaining edges to traverse back to it. ■ Choose next edge according to this policy: 1. 2. if any non-bridge adjacent edges exist, choose it. else, choose arbitrary next edge. ■ Continue getting next edge, update current node to the end point of said edge, pop edge out of unvisited_edges cache until no more edges. Hierholzer’s Algorithm (directed): Overview ■ Given that graph passes condition : every node has an equal in-degree and outdegree, algorithm is as follows : – Initiate unvisited edges cache. – Obtain cycle starting at source node (remember to pop next edge out of cache). – Scan this cycle for any nodes which have unvisited edges. – If there exists, obtain another cycle starting from this node, and paste it back into original cycle – Continue this procedure until no more unvisited edges or some case condition is violated. Accuracy Tests via Fleury/Hierholzer ■ – ■ – Results for test string 1 : 20-40% range, with average of 30%. For virtually all other strings: 25%. Accuracy Tests for Separate Fleury/Hierholzer Algorithms Both algorithms display volatility in the rounds of accuracy tests when tested separated. In some set of rounds, the individual algorithms experienced 100% accuracy, and in other set of rounds, 0%. Hierholzer’s algorithm, in particular, was quite volatile for graphs with complex overlapping cycles. Perhaps it was the implementation… Graph of sequence: AACTAACTATATG Results of Previous String’s Eulerian Reassembly Individual Fleury’s : ~15% Individual Hierholzer’s : ~ 0% Average (both): ~6% Flaws of Fleury’s and Hierholzer’s Where is the start ? Is there any guarantee that the algorithm’s calculated start is the actual start ? Perhaps with some more background information, the accuracy of these algorithms can be amplified. Ex : A new subspecies of algae has been discovered. Using the genomic sequences of related subsequences, use “ATGTGCGTGCAGAG” as start sequence. Decisions, decisions… why does algorithm need to choose its own destiny ?! Why is fairness, in this case, unfair ?!?! An Approach Towards Finding All Eulerian Cycles A carefully constructed recurrence relation General idea : From some arbitrary first possible node, obtain the path, record any indices in this path where alternative options exist into list : indices . Add this path to PATHS_CACHE and iterating backwards in indices, obtain a different path at each alternative options index. After all possible paths from this arbitrary first possible node have been obtained, continue above procedure for all other first possible nodes. Some variants of Fleury’s and Hierholzer’s are recommended. Problems with Approach Problems : Each reconstructed path from the original may also have another alternative options list. Where to store these alternative options list ? Oh, the memory ?!?!? Algorithms depend on retrieving from list unvisited edges; how to reconstruct these unvisited edges lists for alternative paths ? Personal implementation occasionally does not churn out a sequence of the proper length (unreliable !!). Some Wild Turns In this Venture Each edge may have a weight. Perhaps one could cater these weights so Eulerian cycle will steer towards one option more than others at a specific point in time ? Idea is related to traffic engineering. I personally implemented Dijkstra’s Algorithm and Kruskal’s algorithm, but alas, I could not come up with a theoretically sound idea in this respect. Sorts based on attributes such as frequency and weight : Merge-Sort Can be used to sort not only standard Python types, but also classes (based on some given attribute) Some Wild Turns In this Venture Breadth-first search and Depth-first search : Former starts from some source node, and spreads out its search, from all adjacent nodes to all adjacent nodes. Latter starts from some source node, traverses into the deepest realms of the graph, recording time and hop distance. Application: Can be used for parallelization of reassembled sequence (useful for long samples) For instance, if “ACTAGA” was 10002344 hops away from start sequence “ACCCCG”, then split algorithm into two processes : one starting from the start sequence, the other starting from “ACTAGA”, with 10002344 spaces between them. One Needed Improvement Better constructed test strings Better storage of logging.
© Copyright 2026 Paperzz