Phylogenetic diversity

Prioritize Organism Selection for
the Genomic Encyclopedia Project to
Optimize Phylogenetic Diversity
Dongying Wu
April 10, 2007
Special Thanks:
Sourav Chatterji
Jason Raymond
The lack of phylogenetic diversity is evident
in the current whole genome databases
Certain phyla have been heavily sampled
others have only sparse representatives



Many phyla have been ignored

The missing gaps in the current genome data
are the obstacles for us:
Proteobacteria

Getting the full picture of the “tree of life”
Understanding of a full range of ecosystems and
biological mechanisms
Anchoring Metagenomic sequencing data
Firmicutes
Solutions
Tree of Life
the Genomic Encyclopedia Project for Bacteria and Archaea
Greengenes ssu rRNAs: 134423 sequence entries
ATCC: 18000 strains in more than 750 genera
DSMZ: 13000 cultures representing 6900
species and 1400 genera (1207 bacteria
and 77 archaea genera)
Prioritize Organism Selection to Optimize Phylogenetic Diversity
Phylogenetic diversity (PD): if T is a tree whose leaf labels comprise a set X of species,
and whose edges have non-negative real-valued lengths, then for a subset Y of X,
the PD score of Y is the sum of the lengths of the edges of the minimal subtree of
T that connects Y
Input:
+
A tree
(optional: a sub tree)
2
A number (N)
Output: A list of N taxa that gives the maximum PD for the sub-tree
Algorithm: Greedy Algorithm
Reference: Vincent Moulton, Charles Semple and Mike Steel, Optimizing phylogenetic
deverusyt under constraints, Journal of Theoretical Biology,
doi:10.1016/j.jtbi.2006.12.021,2006
Take a tree and a sub-tree
Calculate the added PD for
each taxon to the subtree
Grown the subtree to the taxon that
adds the maximum PD
Repeat the above steps N times,
the resulting subtree is the one gives
the maximum PD given the imposed constrains
Glory Details
How tree structure is store in PERL ?
Two Dimension Matrix.
A
Node1
B
C
Node2
D
A B C D Node1 Node2
A
B
C
D
Node1
Node2
1 1
1 1
1
Build Subtree: Base upon Index Paths
Chose any taxon from the original sub-tree as a reference taxon,
index all the paths connect the reference taxon and other taxa.
A
Node1
B
C
Node2
D
A is the reference taxon
B:
C:
D:
B, Node1, A
C, Node2, Node1, A
D, Node2, Node1, A
C
A
Node1
B
B:
C:
D:
Node2
B, Node1, A
C, Node2, Node1, A
D, Node2, Node1, A
D
Build subtree: combine the paths
Subtree A-B-C:
B:
C:
B, Node1, A
C, Node2, Node1, A
Calculate and grow subtree: Follow each path
Calculate added PD if subtree grows to D:
D, Node2, Node1, A
If no starting subtree is defined, the program will identify the
longest path as the starting subtree
Step 1: pick any taxon, identify the
farthest taxon
Step 2: Start from the taxon picked
from step 1, identify the longest path.
It is the longest path for the whole tree.
A
A
C
C
B
B
D
D
Run the program:
On Bobcat:
/home/dwu/dwu_scripts/public_scripts/maxPD.pl -t input_tree -n number -o output -l
input_list(optional)
-i:
input tree
-n: the number of taxa that the user need for the output list
-o: output
-l:
input list, the user can define a list of taxa, that must be included in the PD
calculations (for example, species the user have to include)
-gml: yes or no, output gml file option
Output Format:
Taxon ID
ID00032
ID99033
ID23890
PD Addition to the subtree
2.3960
0.6701
0.5024
Results Visualization
Free software to visualize network/tree structure:
yEd
http://www.yworks.com/en/products_yed_about.ht
m
GML Input format:
graph [
node [id 1 label "A" graphics [ w 50 h 50 type "circle" fill "#AA0000"]]
node [id 2 label "B" graphics [ w 50 h 50 type "circle" fill "#666666"]]
node [id 3 label "C" graphics [ w 50 h 50 type "circle" fill "#AA0000"]]
node [id 4 label "D" graphics [ w 50 h 50 type "circle" fill "#666666"]]
node [id 5 label "node1" graphics [ w 3 h 3 type "circle" fill "#666666"]]
node [id 6 label "node2" graphics [ w 3 h 3 type "circle" fill "#666666"]]
edge [source
edge [source
edge [source
edge [source
edge [source
]
1 target 5 graphics [ fill "#AA0000" width 4 ]]
2 target 5 graphics [ fill "#666666" width 4 ]]
5 target 6 graphics [ fill "#AA0000" width 4 ]]
6 target 3 graphics [ fill "#AA0000" width 4 ]]
6 target 4 graphics [ fill "#666666" width 4 ]]
Select 300
out of 30000
based upon a
ssu-RNA neighbor
join tree
Y - Added PD X- added taxon (30000 picks /30000 taxa)
Y – PD of subtree
X – added taxon (30000 picks / 30000 taxa)