Prioritize Organism Selection for the Genomic Encyclopedia Project to Optimize Phylogenetic Diversity Dongying Wu April 10, 2007 Special Thanks: Sourav Chatterji Jason Raymond The lack of phylogenetic diversity is evident in the current whole genome databases Certain phyla have been heavily sampled others have only sparse representatives Many phyla have been ignored The missing gaps in the current genome data are the obstacles for us: Proteobacteria Getting the full picture of the “tree of life” Understanding of a full range of ecosystems and biological mechanisms Anchoring Metagenomic sequencing data Firmicutes Solutions Tree of Life the Genomic Encyclopedia Project for Bacteria and Archaea Greengenes ssu rRNAs: 134423 sequence entries ATCC: 18000 strains in more than 750 genera DSMZ: 13000 cultures representing 6900 species and 1400 genera (1207 bacteria and 77 archaea genera) Prioritize Organism Selection to Optimize Phylogenetic Diversity Phylogenetic diversity (PD): if T is a tree whose leaf labels comprise a set X of species, and whose edges have non-negative real-valued lengths, then for a subset Y of X, the PD score of Y is the sum of the lengths of the edges of the minimal subtree of T that connects Y Input: + A tree (optional: a sub tree) 2 A number (N) Output: A list of N taxa that gives the maximum PD for the sub-tree Algorithm: Greedy Algorithm Reference: Vincent Moulton, Charles Semple and Mike Steel, Optimizing phylogenetic deverusyt under constraints, Journal of Theoretical Biology, doi:10.1016/j.jtbi.2006.12.021,2006 Take a tree and a sub-tree Calculate the added PD for each taxon to the subtree Grown the subtree to the taxon that adds the maximum PD Repeat the above steps N times, the resulting subtree is the one gives the maximum PD given the imposed constrains Glory Details How tree structure is store in PERL ? Two Dimension Matrix. A Node1 B C Node2 D A B C D Node1 Node2 A B C D Node1 Node2 1 1 1 1 1 Build Subtree: Base upon Index Paths Chose any taxon from the original sub-tree as a reference taxon, index all the paths connect the reference taxon and other taxa. A Node1 B C Node2 D A is the reference taxon B: C: D: B, Node1, A C, Node2, Node1, A D, Node2, Node1, A C A Node1 B B: C: D: Node2 B, Node1, A C, Node2, Node1, A D, Node2, Node1, A D Build subtree: combine the paths Subtree A-B-C: B: C: B, Node1, A C, Node2, Node1, A Calculate and grow subtree: Follow each path Calculate added PD if subtree grows to D: D, Node2, Node1, A If no starting subtree is defined, the program will identify the longest path as the starting subtree Step 1: pick any taxon, identify the farthest taxon Step 2: Start from the taxon picked from step 1, identify the longest path. It is the longest path for the whole tree. A A C C B B D D Run the program: On Bobcat: /home/dwu/dwu_scripts/public_scripts/maxPD.pl -t input_tree -n number -o output -l input_list(optional) -i: input tree -n: the number of taxa that the user need for the output list -o: output -l: input list, the user can define a list of taxa, that must be included in the PD calculations (for example, species the user have to include) -gml: yes or no, output gml file option Output Format: Taxon ID ID00032 ID99033 ID23890 PD Addition to the subtree 2.3960 0.6701 0.5024 Results Visualization Free software to visualize network/tree structure: yEd http://www.yworks.com/en/products_yed_about.ht m GML Input format: graph [ node [id 1 label "A" graphics [ w 50 h 50 type "circle" fill "#AA0000"]] node [id 2 label "B" graphics [ w 50 h 50 type "circle" fill "#666666"]] node [id 3 label "C" graphics [ w 50 h 50 type "circle" fill "#AA0000"]] node [id 4 label "D" graphics [ w 50 h 50 type "circle" fill "#666666"]] node [id 5 label "node1" graphics [ w 3 h 3 type "circle" fill "#666666"]] node [id 6 label "node2" graphics [ w 3 h 3 type "circle" fill "#666666"]] edge [source edge [source edge [source edge [source edge [source ] 1 target 5 graphics [ fill "#AA0000" width 4 ]] 2 target 5 graphics [ fill "#666666" width 4 ]] 5 target 6 graphics [ fill "#AA0000" width 4 ]] 6 target 3 graphics [ fill "#AA0000" width 4 ]] 6 target 4 graphics [ fill "#666666" width 4 ]] Select 300 out of 30000 based upon a ssu-RNA neighbor join tree Y - Added PD X- added taxon (30000 picks /30000 taxa) Y – PD of subtree X – added taxon (30000 picks / 30000 taxa)
© Copyright 2025 Paperzz