DendroPy Tutorial
Release 3.12.1
Jeet Sukumaran and Mark T. Holder
March 22, 2014
Contents
i
ii
CHAPTER 1
Phylogenetic Data in DendroPy
1.1 Introduction to Phylogenetic Data Objects
1.1.1 Types of Phylogenetic Data Objects
Phylogenetic data in DendroPy is represented by one or more objects of the following classes:
Taxon A representation of an operational taxonomic unit, with an attribute, label, corresponding to
the taxon label.
TaxonSet A collection of Taxon objects representing a distinct definition of taxa (for example, as
specified explicitly in a NEXUS “TAXA” block, or implicitly in the set of all taxon labels used
across a Newick tree file).
Tree A collection of Node and Edge objects representing a phylogenetic tree. Each Tree object
maintains a reference to a TaxonSet object in its attribute, taxon_set, which specifies the set
of taxa that are referenced by the tree and its nodes. Each Node object has a taxon attribute
(which points to a particular Taxon object if there is an operational taxonomic unit associated with
this node, or is None if not), a parent_node attribute (which will be None if the Node has no
parent, e.g., a root node), a Edge attribute, as well as a list of references to child nodes, a copy of
which can be obtained by calling child_nodes.
TreeList A list of Tree objects. A TreeList object has an attribute, taxon_set, which specifies the set of taxa that are referenced by all member Tree elements. This is enforced when a Tree
object is added to a TreeList, with the TaxonSet of the Tree object and all Taxon references
of the Node objects in the Tree mapped to the TaxonSet of the TreeList.
CharacterMatrix Representation of character data, with specializations for different data
types: DnaCharacterMatrix, RnaCharacterMatrix, ProteinCharacterMatrix,
StandardCharacterMatrix,
ContinuousCharacterMatrix,
etc.
A
CharacterMatrix can treated very much like a dict object, with Taxon objects as
keys and character data as values associated with those keys.
DataSet A meta-collection of phylogenetic data, consisting of lists of multiple TaxonSet objects (taxon_sets), TreeList objects (tree_lists), and CharacterMatrix objects
(char_matrices).
1.1.2 Creating New (Empty) Objects
All of the above names are imported into the the the dendropy namespace, and so to instantiate new, empty objects
of these classes, you would need to import dendropy:
1
DendroPy Tutorial, Release 3.12.1
>>>
>>>
>>>
>>>
>>>
import dendropy
tree1 = dendropy.Tree()
tree_list11 = dendropy.TreeList()
dna1 = dendropy.DnaCharacterMatrix()
dataset1 = dendropy.DataSet()
Or import the names directly:
>>>
>>>
>>>
>>>
>>>
from dendropy import Tree, TreeList, DnaCharacterMatrix, DataSet
tree1 = Tree()
tree_list1 = TreeList()
dna1 = DnaCharacterMatrix()
dataset1 = DataSet()
1.1.3 Reading and Writing Phylogenetic Data
DendroPy provides a rich set of tools for reading and writing phylogenetic data in various formats, such as NEXUS,
Newick, PHYLIP, etc. These are covered in detail in the following “Reading Phylogenetic Data” and “Writing Phylogenetic Data” chapters respectively.
1.2 Reading Phylogenetic Data
1.2.1 Creating and Populating New Objects
The Tree, TreeList, CharacterMatrix-derived, and DataSet classes all support “get_from_*” factory
methods that allow for the simultaneous instantiation and population of the objects from a data source:
get_from_stream(src, schema, **kwargs) Takes a file or file-like object opened for reading the data source as the first argument, and a string specifying the schema as the second.
get_from_path(src, schema, **kwargs) Takes a string specifying the path to the the data
source file as the first argument, and a string specifying the schema as the second.
get_from_string(src, schema, **kwargs) Takes a string containing the source data as the
first argument, and a string specifying the schema as the second.
All these methods minimally take a source and schema specification string as arguments and return a new object of
the given type populated from the given source:
>>>
>>>
>>>
>>>
>>>
>>>
import dendropy
tree1 = dendropy.Tree.get_from_string("((A,B),(C,D))", schema="newick")
tree_list1 = dendropy.TreeList.get_from_path("pythonidae.mcmc.nex", schema="nexus")
dna1 = dendropy.DnaCharacterMatrix.get_from_stream(open("pythonidae.fasta"), "dnafasta")
std1 = dendropy.StandardCharacterMatrix.get_from_path("python_morph.nex", "nexus")
dataset1 = dendropy.DataSet.get_from_path("pythonidae.nex", "nexus")
The schema specification string can be one of: “nexus”, “newick”, “nexml”, “fasta”, or “phylip”. Not all
formats are supported for reading, and not all formats make sense for particular objects (for example, it would not
make sense to try and instantiate a Tree or TreeList object from a FASTA-formatted data source).
Alternatively, you can also pass a file-like object and a schema specification string to the constructor of these classes
using the keyword arguments stream and schema respectively:
>>> import dendropy
>>> tree1 = dendropy.Tree(stream=open("mle.tre"), schema="newick")
>>> tree_list1 = dendropy.TreeList(stream=open("pythonidae.mcmc.nex"), schema="nexus")
2
Chapter 1. Phylogenetic Data in DendroPy
DendroPy Tutorial, Release 3.12.1
>>> dna1 = dendropy.DnaCharacterMatrix(stream=open("pythonidae.fasta"), schema="dnafasta")
>>> std1 = dendropy.StandardCharacterMatrix(stream=open("python_morph.nex"), schema="nexus")
>>> dataset1 = dendropy.DataSet(stream=open("pythonidae.nex"), schema="nexus")
Various keyword arguments can also be passed to these methods which customize or control how the data is parsed
and mapped into DendroPy object space. These are discussed below.
1.2.2 Reading and Populating (or Repopulating) Existing Objects
The Tree, TreeList, CharacterMatrix-derived, and DataSet classes all support a suite of
“read_from_*” instance methods that parallels the “get_from_*” factory methods described above:
read_from_stream(src, schema, **kwargs) Takes a file or file-like object opened for reading the data source as the first argument, and a string specifying the schema as the second.
read_from_path(src, schema, **kwargs) Takes a string specifying the path to the the data
source file as the first argument, and a string specifying the schema as the second.
read_from_string(src, schema, **kwargs) Takes a string specifying containing the
source data as the first argument, and a string specifying the schema as the second.
When called on an existing TreeList or DataSet object, these methods add the data from the data source to the
object, whereas when called on an existing Tree or CharacterMatrix object, they replace the object’s data with
data from the data source. As with the “get_from_*” methods, the schema specification string can be any supported
and type-apppropriate schema, such as “nexus”, “newick”, “nexml”, “fasta”, “phylip”, etc.
For example, the following accumulates post-burn-in trees from several different files into a single TreeList object:
>>> import dendropy
>>> post_trees = dendropy.TreeList()
>>> post_trees.read_from_path("pythonidae.nex.run1.t", "nexus",
>>> print(post_trees.description())
TreeList object at 0x550990 (TreeList5573008): 801 Trees
>>> post_trees.read_from_path("pythonidae.nex.run2.t", "nexus",
>>> print(post_trees.description())
TreeList object at 0x550990 (TreeList5573008): 1602 Trees
>>> post_trees.read_from_path("pythonidae.nex.run3.t", "nexus",
>>> print(post_trees.description())
TreeList object at 0x550990 (TreeList5573008): 2403 Trees
>>> post_trees.read_from_path("pythonidae.nex.run4.t", "nexus",
>>> print(post_trees.description())
TreeList object at 0x5508a0 (TreeList5572768): 3204 Trees
tree_offset=200)
tree_offset=200)
tree_offset=200)
tree_offset=200)
The TreeList object automatically handles taxon management, and ensures that all appended Tree objects share
the same TaxonSet reference. Thus all the Tree objects created and aggregated from the data sources in the
example will all share the same TaxonSet and Taxon objects, which is important if you are going to be carrying
comparisons or operations between multiple Tree objects.
In contrast to the aggregating behavior of read_from_* of TreeList and DataSet objects, the read_from_*
methods of Tree- and CharacterMatrix-derived objects show replacement behavior. For example, the following
changes the contents of a Tree by re-reading it:
>>> import dendropy
>>> t = dendropy.Tree()
>>> t.read_from_path(’pythonidae.mle.nex’, ’nexus’)
>>> print(t.description())
Tree object at 0x79c70 (Tree37413776: ’0’): (’Python molurus’:0.0779719244,((’Python sebae’:0.1414715
>>> t.read_from_path(’pythonidae.mcmc-con.nex’, ’nexus’)
1.2. Reading Phylogenetic Data
3
DendroPy Tutorial, Release 3.12.1
>>> print(t.description())
Tree object at 0x79c70 (Tree37414064: ’con 50 majrule’): (’Python regius’:0.212275,(’Python sebae’:0.
As with the get_from_* methods, keyword arguments can be used to provide control on the data source parsing.
1.2.3 Specifying the Data Source Format
All the get_from_* and read_from_* methods take a schema specification string using the schema argument
which specifies the format of the data source.
The string can be one of the following:
“nexus“ To read Tree, TreeList, CharacterMatrix, or DataSet objects from a NEXUSformatted source.
“newick“ To read Tree, TreeList, or DataSet objects from a Newick-formatted source.
“fasta“ To read CharacterMatrix or DataSet objects from a FASTA-formatted source. FASTAsources require the additional keyword, data_type, that describes the type of data: “dna”,
“rna”, “protein”, “standard“” (discrete data represented as binary 0/1), “restriction”
(restriction sites), or “infinite” (infinite sites).
“phylip“ To read CharacterMatrix or DataSet objects from a PHYLIP-formatted source.
You would typically use a specific CharacterMatrix class depending on the data type: e.g.
DnaCharacterMatrix, ContinuousCharacterMatrix etc. If you use a more general
class, e.g. DataSet, then for PHYLIP-sources you need to specify the additional keyword argument, data_type, that describes the type of data: “dna”, “rna”, “protein”, “standard“”
(discrete data represented as binary 0/1), “restriction” (restriction sites), or “infinite” (infinite sites).
“beast-summary-tree“ To read Tree or TreeList objects from a BEAST annotated
consensus tree source.
Each node on the resulting tree(s) will have the following attributes: “height”, “height_median”, “height_95hpd”, “height_range”,
“length”, “length_median”, “length_95hpd”, “length_range”, “posterior’.
Scalar values will be of ‘‘float type, while ranges (e.g., “height_95hpd”,
“height_range”, “length_95hpd”, “length_range”) will be two-element lists of
float.
1.2.4 Customizing Data Creation and Reading
When specifying a data source from which to create or populate data objects using the get_from_*,
read_from_*, or passing a data source stream to a constructor, you can also specify keyword arguments that provide
fine-grained control over how the data source is parsed. Some of these keyword arguments apply generally, regardless
of the format of the data source or the data object being created, while others are specific to the data object type or the
data source format.
All Formats
attached_taxon_set If True when reading into a DataSet object, then a new TaxonSet object
will be created and added to the taxon_sets list of the DataSet object, and the DataSet
object will be placed in “attached” (or single) taxon set mode, i.e., all taxa in any data sources
parsed or read will be mapped to the same TaxonSet object. By default, this is False, resulting in
a multi-taxon set mode DataSet object.
4
Chapter 1. Phylogenetic Data in DendroPy
DendroPy Tutorial, Release 3.12.1
taxon_set If passed a TaxonSet object, then this TaxonSet will be used to manage all taxon references in the data source. When creating a new Tree, TreeList or CharacterMatrix object
from a data source, the TaxonSet object passed by this keyword will be used as the TaxonSet
associated with the object. When reading into a DataSet object, if the data source defines multiple collections of taxa (as is possible with, for example, the NEXML schema, or the Mesquite
variant of the NEXUS schema), then multiple new TaxonSet object will be created. By passing
a TaxonSet object through the taxon_set keyword, you can force DendroPy to use the same
TaxonSet object for all taxon references.
exclude_trees If True, then all tree data in the data source will be skipped. Default value is False,
i.e., all tree data will be included.
exclude_chars If True, then all character data in the data source will be skipped. Default value is
False, i.e., all character data will be included.
NEXUS/Newick
The following snippets serve to show all the keywords that can be passed to various “read_from_*” or
“get_from_*” methods when the schema is NEXUS or Newick:
# Reading a mixed (trees and characters) NEXUS data source, and you want all
# the trees *and* all the characters.
data = dendropy.DataSet.get_from_path(
"data.nex",
"nexus",
taxon_set=None,
exclude_trees=False,
exclude_chars=False,
as_rooted=None,
as_unrooted=None,
default_as_rooted=None,
default_as_unrooted=None,
edge_len_type=float,
extract_comment_metadata=False,
store_tree_weights=False,
case_sensitive_taxon_labels=False,
preserve_underscores=False,
suppress_internal_node_taxa=False,
allow_duplicate_taxon_labels=False,
hyphens_as_tokens=False)
# Reading a tree-only NEXUS or NEWICK data source, or you want just the trees
# from a mixed NEXUS data source. For NEWICK format data, replace "nexus" with
# "newick" below. You can use ‘‘collection_offset‘‘ to specify a particular
# tree block to read (integer; first block offset = 0), and ‘‘tree_offset‘‘ to
# skip to a particular tree (integer; offset of 0 skips no trees, offset of 1
# skips the first tree, etc.)
trees = dendropy.TreeList.get_from_path(
"data.nex",
"nexus",
taxon_set=None,
exclude_trees=False,
exclude_chars=True,
as_rooted=None,
as_unrooted=None,
default_as_rooted=None,
default_as_unrooted=None,
edge_len_type=float,
1.2. Reading Phylogenetic Data
5
DendroPy Tutorial, Release 3.12.1
extract_comment_metadata=False,
store_tree_weights=False,
case_sensitive_taxon_labels=False,
preserve_underscores=False,
suppress_internal_node_taxa=False,
allow_duplicate_taxon_labels=False,
hyphens_as_tokens=False,
collection_offset=0,
tree_offset=0)
# Reading character-only NEXUS data source, or you want just characters from a
# mixed data source. You can use ‘‘matrix_offset‘ to specify a particular
# character block to read (integer; first matrix offset = 0).
dna = dendropy.DnaCharacterSet.get_from_path(
"data.nex",
"nexus",
taxon_set=None,
exclude_trees=True,
exclude_chars=False,
as_rooted=None,
as_unrooted=None,
default_as_rooted=None,
default_as_unrooted=None,
edge_len_type=float,
extract_comment_metadata=False,
store_tree_weights=False,
case_sensitive_taxon_labels=False,
preserve_underscores=False,
suppress_internal_node_taxa=False,
allow_duplicate_taxon_labels=False,
hyphens_as_tokens=False,
matrix_offset=0)
The special keywords supported for reading NEXUS-formatted or NEWICK-formatted data include:
is_rooted, is_unrooted, default_as_rooted, default_as_unrooted
When reading into a Tree, TreeList, or DataSet object, this keyword determines how
trees in the data source will be rooted. The rooting state of a Tree object is set by the
is_rooted property. When parsing NEXUS- and Newick-formatted data, the rooting states
of the resulting Tree objects are given by [&R] (for rooted) or [&U] (for unrooted) comment
tags preceding the tree definition in the data source. If these tags are not present, then the trees
are assumed to be unrooted. This behavior can be changed by specifying keyword arguments
to the get_from_*, or read_from_* methods of both the Tree and TreeList classes,
or the constructors of these classes when specifying a data source from which to construct the
tree:
The as_rooted keyword argument, if True, forces all trees to be interpreted as rooted,
regardless of whether or not the [&R]/[&U] comment tags are given. Conversely, if
False, all trees will be interpreted as unrooted. For semantic clarity, you can also specify
as_unrooted to be True to force all trees to be unrooted.
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
# tree assumed to be unrooted unless ’[&R]’ is specified
tree = dendropy.Tree.get_from_path("pythonidae.mle.nex", "nexus")
7
6
Chapter 1. Phylogenetic Data in DendroPy
DendroPy Tutorial, Release 3.12.1
8
9
# forces tree to be rooted
tree = dendropy.Tree.get_from_path("pythonidae.mle.nex", "nexus", as_rooted=True)
10
11
12
# forces tree to be unrooted
tree = dendropy.Tree.get_from_path("pythonidae.mle.nex", "nexus", as_rooted=False)
13
14
15
# forces tree to be unrooted
tree = dendropy.Tree.get_from_path("pythonidae.mle.nex", "nexus", as_unrooted=True)
16
17
18
# forces tree to be rooted
tree = dendropy.Tree.get_from_path("pythonidae.mle.nex", "nexus", as_unrooted=False)
19
20
21
22
23
24
# also applies to constructors ...
tree = dendropy.Tree(
stream=open("pythonidae.mle.nex", "rU"),
schema="nexus",
as_rooted=True)
25
26
27
28
# and ’read_from_*’ methods
tree = dendropy.Tree()
tree.read_from_path("pythonidae.mle.nex", "nexus", as_rooted=True)
29
30
31
32
33
34
# and TreeList constructor, ’get_from_*’, and ’read_from_*’ methods
tree_list = dendropy.TreeList(
stream=open("pythonidae.mcmc.nex", "rU"),
schema="nexus",
as_rooted=True)
35
36
37
38
tree_list = dendropy.TreeList.get_from_path(
"pythonidae.mcmc.nex",
"nexus", as_rooted=True)
39
40
41
tree_list = dendropy.TreeList()
tree_list.read_from_path("pythonidae.mcmc.nex", "nexus", as_rooted=True)
In addition, you can specify a default_as_rooted keyword argument, which, if
True, forces all trees to be interpreted as rooted, if the [&R]/[&U] comment tags
are not given. Otherwise the rooting will follow the [&R]/[&U] commands. Conversely, if default_as_rooted is False, all trees will be interpreted as unrooted if the
[&R]/[&U] comment tags are not given. Again, for semantic clarity, you can also specify
default_as_unrooted to be True to assume all trees are unrooted if not explicitly specified, though, as this is default behavior, this should not be neccessary.
edge_len_type
Specifies the type of the edge lengths (int or float).
extract_comment_metadata
If True, comments that begin with ‘&’ or ‘&&’ associated with items will be processed and
stored as part of the annotation set of the object (annotations) If False, this will be skipped.
Defaults to False.
store_tree_weights
If True, process the tree weight (“[&W 1/2]”) comment associated with each tree, if any.
encode_splits
Specifies whether or not split bitmasks will be calculated and attached to the edges.
1.2. Reading Phylogenetic Data
7
DendroPy Tutorial, Release 3.12.1
finish_node_func
Is a function that will be applied to each node after it has been constructed.
case_sensitive_taxon_labels
If True, then taxon labels are case sensitive (different cases = different taxa); defaults to False.
allow_duplicate_taxon_labels
if True, allow duplicate labels on trees
preserve_underscores
If True, unquoted underscores in labels will not converted to spaces. Defaults to False: all
underscores not protected by quotes will be converted to spaces.
suppress_internal_node_taxa
If False, internal node labels will be instantantiatd into Taxon objects. Defaults to True: internal node labels will not be treated as taxa.
allow_duplicate_taxon_labels
If True, then multiple identical taxon labels will be allowed. Defaults to False: treat multiple
identical taxon labels as an error.
hyphens_as_tokens
If True, hyphens will be treated as special punctuation characters. Defaults to False, hyphens
not treated as special punctuation characters.
FASTA
The following snippets show all the keywords that can be passed to various “read_from_*” or “get_from_*”
methods when the schema is FASTA:
#
#
#
#
d
When explicitly reading into
a CharacterMatrix of a particular
type, the ‘data_type‘ argument
is not needed.
= dendropy.DnaCharacterMatrix.get_from_path(
"data.fas",
"fasta",
taxon_set=None,
row_type=’rich’)
# Otherwise ...
d = dendropy.DataSet.get_from_path(
"data.fas",
"fasta",
taxon_set=None,
data_type="dna",
row_type=’rich’)
# Or ..
d = dendropy.DataSet.get_from_path(
"data.fas",
"fasta",
taxon_set=None,
char_matrix_type=dendropy.DnaCharacterMatrix,
row_type=’rich’)
8
Chapter 1. Phylogenetic Data in DendroPy
DendroPy Tutorial, Release 3.12.1
The special keywords supported for reading FASTA-formatted data include:
data_type As noted above, if not reading into a CharacterMatrix of a particular type, the FASTA
format requires specification of the type of data using the data_type argument, which takes
one of the following strings: “dna”, “rna”, “protein”, “standard“”, “restriction”, or
“infinite”.
row_type Defaults to ‘rich‘: characters will be read into the full DendroPy character data object
model. Alternately, ‘str‘ can be specified: characters will be read as simple strings.
PHYLIP
The following snippets illustrate typical keyword arguments and their defaults when reading data from PHYLIPformatted sources:
#
#
#
#
d
When explicitly reading into
a CharacterMatrix of a particular
type, the ‘data_type‘ argument
is not needed.
= dendropy.DnaCharacterMatrix.get_from_path(
"data.dat",
"phylip",
taxon_set=None,
strict=False,
interleaved=False,
multispace_delimiter=False,
underscores_to_spaces=False,
ignore_invalid_chars=False)
# Otherwise ...
d = dendropy.DataSet.get_from_path(
"data.dat",
"phylip",
taxon_set=None,
datatype="dna",
strict=False,
interleaved=False,
multispace_delimiter=False,
underscores_to_spaces=False,
ignore_invalid_chars=False)
# Or ..
d = dendropy.DataSet.get_from_path(
"data.dat",
"phylip",
taxon_set=None,
char_matrix_type=dendropy.DnaCharacterMatrix,
strict=False,
interleaved=False,
multispace_delimiter=False,
underscores_to_spaces=False,
ignore_invalid_chars=False)
The special keywords supported for reading PHYLIP-formatted data include:
data_type As noted above, if not reading into a CharacterMatrix of a particular type, the
PHYLIP format requires specification of the type of data using the data_type argument, which
1.2. Reading Phylogenetic Data
9
DendroPy Tutorial, Release 3.12.1
takes one of the following strings: “dna”, “rna”, “protein”, “standard“”, “restriction”,
or “infinite”.
strict By default, the PHYLIP parser works in “relaxed” mode, which means that taxon labels can
be of arbitrary length, and taxon labels and corresponding sequences are separated by one or more
spaces. By specifying strict=True, the parse will behave in strict mode, i.e., where taxon labels
are limited to 10 characters in length, and sequences start on column 11.
interleaved By default, the PHYLIP parsers assumes that the data source is in sequential format. If
the data is in interleaved format, you should specify interleaved=True.
multispace_delimiter The default “relaxed” mode of the PHYLIP parser assumes that taxon
labels are separated from sequence characters by one or more spaces.
By specifying
multispace_delimiter=True, the parser will require two or more spaces to separate taxon
labels from sequence characters, thus allowing you to use single spaces in your taxon labels.
ignore_invalid_chars By default, the PHYLIP parser will fail with an error if invalid characters
are found in a sequence. By specifying ignore_invalid_chars=True, the parser will simply
ignore these characters.
underscores_to_spaces In the default relaxed PHYLIP format mode, since the first occurrence
of a space in the data format is taken to denote the end of the taxon label, spaces are not permitted
within taxon labels. A common convention is to use underscores in place of spaces in cases like
this. By specifying underscores_to_spaces=True, the parser will automatically substitute
any underscores found in taxon labels with spaces, thus allowing for correspondence with the same
taxa represented in other formats that allow spaces, such as NEXUS or Newick.
BEAST Summary Trees
ignore_missing_node_info If any nodes are missing annotations (given as a NEXUS-style
square-bracket wrapped comment string, with the first character of the comment string an ampersand), then by default the parser will throw an exception. If ignore_missing_node_info is
True, then missing annotations are silently ignored and all relevant attribute values will be set to to
None.
1.3 Writing Phylogenetic Data
1.3.1 Writing to Streams, Filepaths, or Strings
The Tree, TreeList, CharacterMatrix-derived, and DataSet classes all support the following instance
methods for writing data:
write_to_stream(dest, schema, **kwargs) Takes a file or file-like object opened for writing the data as the first argument, and a string specifying the schema as the second.
write_to_path(dest, schema, **kwargs) Takes a string specifying the path to the file as
the first argument, and a string specifying the schema as the second.
as_string(schema, **kwargs) Takes a string specifying the schema as the first argument, and
returns a string containing the formatted-representation of the data.
1.3.2 Specifying the Data Writing Format
The schema specification string can be one of the following:
10
Chapter 1. Phylogenetic Data in DendroPy
DendroPy Tutorial, Release 3.12.1
“nexus“ To write Tree, TreeList, CharacterMatrix, or DataSet objects in NEXUS format.
“newick“ To write Tree, TreeList, or DataSet objects in Newick format. With DataSet objects, only tree data will be written.
“fasta“ To write CharacterMatrix or DataSet objects in FASTA format. With DataSet objects, only character data will be written.
“phylip“ To write CharacterMatrix or DataSet objects in PHYLIP format. With DataSet
objects, only character data will be written.
1.3.3 Customizing the Data Writing Format
The writing of data can be controlled or fine-tuned using keyword arguments. As with reading, some of these arguments apply generally, while others are only available or make sense for a particular format.
All Formats
taxon_set When writing a DataSet object, if passed a specific TaxonSet, then only TreeList
and CharacterMatrix objects associated with this TaxonSet will be written. By default, this
is None, meaning that all data in the DataSet object will be written.
exclude_trees When writing a DataSet object, if True, then no tree data will be written (i.e.,
all TreeList objects in the DataSet will be skipped in the output). By default, this is False,
meaning that all tree data will be written.
exclude_chars When writing a DataSet object, if True, then no characer data will be written (i.e.,
all CharacterMatrix objects in the DataSet will be skipped in the output). By default, this is
False, meaning that all character data will be written.
NEXUS and Newick
The following code fragment shows a typical invocation of a NEXUS-format write operation using all supported
keywords with their defaults:
d.write_to_path(
’data.nex’,
’nexus’,
taxon_set=None,
exclude_trees=False,
exclude_chars=False,
simple=False,
suppress_taxa_block=True,
exclude_trees=False,
exclude_chars=False,
preamble_blocks=[],
supplemental_blocks=[],
file_comments=None,
suppress_leaf_taxon_labels=False,
suppress_leaf_node_labels=True,
suppress_internal_taxon_labels=False,
suppress_internal_node_labels=False,
suppress_rooting=False,
suppress_edge_lengths=False,
unquoted_underscores=False,
preserve_spaces=False,
1.3. Writing Phylogenetic Data
11
DendroPy Tutorial, Release 3.12.1
store_tree_weights=False,
suppress_annotations=False,
annotations_as_nhx=False,
suppress_item_comments=False,
node_label_element_separator=’ ’,
node_label_compose_func=None,
edge_label_compose_func=None)
The following code fragment shows a typical invocation of a Newick-format write operation using all supported
keyword arguments with their default values:
d.write_to_path(
’data.tre’,
’newick’,
taxon_set=None,
suppress_leaf_taxon_labels=False,
suppress_leaf_node_labels=True,
suppress_internal_taxon_labels=False,
suppress_internal_node_labels=False,
suppress_rooting=False,
suppress_edge_lengths=False,
unquoted_underscores=False,
preserve_spaces=False,
store_tree_weights=False,
suppress_annotations=True,
annotations_as_nhx=False,
suppress_item_comments=True,
node_label_element_separator=’ ’,
node_label_compose_func=None)
The special keywords supported for writing NEXUS-formatted output include:
simple When writing NEXUS-formatted data, if True, then character data will be represented as a
single “DATA” block, instead of separate “TAXA” and “CHARACTERS” blocks. By default this is
False.
block_titles When writing NEXUS-formatted data, if False, then title statements will not be added
to the various NEXUS blocks (i.e., “TAXA”, “CHARACTERS”, and “TREES”). By default, this is
True, i.e., block titles will be written.
suppress_taxa_block If True, do not write a “TAXA” block. Default is False.
exclude_trees When writing NEXUS-formatted data, if True, then no tree data will be written (i.e.,
all TreeList objects in the DataSet will be skipped in the output). By default, this is False,
meaning that all tree data will be written.
exclude_chars When writing NEXUS-formatted data, if True, then no characer data will be written
(i.e., all CharacterMatrix objects in the DataSet will be skipped in the output). By default,
this is False, meaning that all character data will be written.
preamble_blocks When writing NEXUS-formatted data, a list of other blocks (or strings) to be
written at the beginning of the file.
supplemental_blocks When writing NEXUS-formatted data, a list of other blocks (or strings) to
be written at the end of the file.
file_comments When writing NEXUS-formatted data, then the contents of this variable (a string or
a list of strings) will be added as a NEXUS comment to the file (at the top). By default, this is None.
The special keywords supported for writing both NEXUS- or Newick-formatted trees include:
12
Chapter 1. Phylogenetic Data in DendroPy
DendroPy Tutorial, Release 3.12.1
suppress_leaf_taxon_labels If True, then taxon labels will not be printed for leaves. Default
is False.
suppress_leaf_node_labels If False, then node labels (if available) will be printed for leaves.
Defaults to True. Note that DendroPy distinguishes between taxon labels and node labels. In a
typical NEWICK string, taxon labels are printed for leaf nodes, while leaf node labels are ignored
(hence the default ‘True‘ setting, to ignore leaf node labels).
suppress_internal_taxon_labels If True, then taxon labels will not be printed for internal
nodes. Default is False. NOTE: this replaces the internal_labels argument which has been
deprecated.
suppress_internal_node_labels If True, internal node labels will not be written. Default is
False. NOTE: this replaces the internal_labels argument which has been deprecated.
suppress_rooting If True, will not write rooting statement. Default is False. NOTE: this keyword
argument replaces the write_rooting argument which has now been deprecated.
suppress_edge_lengths If True, will not write edge lengths. Default is False. NOTE: this keyword argument replaces the edge_lengths argument which has now been deprecated.
unquoted_underscores If True, labels with underscores will not be quoted, which will mean that
they will be interpreted as spaces if read again (“soft” underscores). If False, then labels with
underscores will be quoted, resulting in “hard” underscores. Default is False. NOTE: this keyword
argument replaces the quote_underscores argument which has now been deprecated.
preserve_spaces If True, spaces not mapped to underscores in labels. Default is False.
store_tree_weights If True, tree weights are written. Default is False.
suppress_annotations If True, will not write annotated attributes as comments. Default is False
if writing in NEXUS format and simple is False; otherwise, if writing in NEWICK format or
NEXUS format with simple set to True, then defaults to True.
annotations_as_nhx If True and suppress_annotations is True, then annotations will be
written in NHX format (‘[&&field=value:field=value]’), as opposed to a more generic format with
only one leading ampersand (‘[&field=value,field=value,field={value,value}]’). Defaults to False.
suppress_item_comments If True, will not write any additional comments associated with (tree)
items. Default is False if writing in NEXUS format and simple is False; otherwise, if writing in
NEWICK format or NEXUS format with simple set to True, then defaults to True.
node_label_element_separator If
both
suppress_leaf_taxon_labels
and
suppress_leaf_node_labels are False, then this will be the string used to join them.
Defaults to ‘ ‘.
node_label_compose_func If not None, should be a function that takes a Node object as an argument and returns the string to be used to represent the node in the
tree statement.
The return value from this function is used unconditionally to print
a node representation in a tree statement, by-passing the default labelling function (and
thus ignoring suppress_leaf_taxon_labels, suppress_leaf_node_labels=True,
suppress_internal_taxon_labels, suppress_internal_node_labels, etc.).
Defaults to None.
edge_label_compose_func If not None, should be a function that takes an Edge object as an
argument, and returns the string to be used to represent the edge length in the tree statement.
1.3. Writing Phylogenetic Data
13
DendroPy Tutorial, Release 3.12.1
FASTA
The following code fragment shows a typical invocation of a FASTA-format write operation using all supported keywords with their defaults:
d.write_to_path(
’data.fas’,
’fasta’,
taxon_set=None,
wrap=False,
wrap_width=70)
The special keywords supported for writing FASTA-formatted data include:
wrap If True, then sequences will be wrapped at wrap_width characters. Defaults to False. Output is
prettier, but writing operations are considerably slower.
wrap_width If wrap is True, then sequences will be wrapped at these many characters. Defaults to
70.
PHYLIP
The following code fragment shows a typical invocation of a PHYLIP-format write operation using all supported
keywords with their defaults:
d.write_to_path(
’data.day’,
’phylip’,
taxon_set=None,
strict=False,
space_to_underscores=False,
force_unique_taxon_labels=False)
The special keywords supported for writing PHYLIP-formatted data include:
strict If True, write in “strict” PHYLIP format, i.e., with taxon labels truncated to 10-characters, and
sequence characters beginning on column 11. Defaults to False: writes in “relaxed” format (taxon
labels not truncated, and separated from sequence characters by more two consecutive spaces).
spaces_to_underscores If True, replace all spaces in taxon labels with underscores; useful if
writing in relaxed mode, where spaces are used to delimit the beginning of sequence characters.
Defaults to False: labels not changed.
force_unique_taxon_labels If True, then identical taxon labels (or labels that are identical due
to truncation) will be disambiguated through the appending of indexes.
1.4 Examining Data Objects
High-level summaries of the contents of DendroPy phylogenetic data objects are given by the description instance
method of the Tree, TreeList, CharacterMatrix-derived, and DataSet classes. This method optionally
takes a numeric value as its first argument that determines the level of detail (or depth) of the summary:
>>> import dendropy
>>> d = dendropy.DataSet.get_from_path(’pythonidae.nex’, ’nexus’)
>>> print(d.description())
DataSet object at 0x79dd0: 1 Taxon Sets, 0 Tree Lists, 1 Character Matrices
>>> print(d.description(3))
14
Chapter 1. Phylogenetic Data in DendroPy
DendroPy Tutorial, Release 3.12.1
DataSet object at 0x79dd0: 1 Taxon Sets, 0 Tree Lists, 1 Character Matrices
[Taxon Sets]
[0] TaxonSet object at 0x5a4a20 (TaxonSet5917216): 29 Taxa
[0] Taxon object at 0x22c0fd0 (Taxon36442064): ’Python regius’
[1] Taxon object at 0x22c0f10 (Taxon36441872): ’Python sebae’
[2] Taxon object at 0x22c0ed0 (Taxon36441808): ’Python brongersmai’
[3] Taxon object at 0x22c0f70 (Taxon36441968): ’Antaresia maculosa’
[4] Taxon object at 0x22c0f30 (Taxon36441904): ’Python timoriensis’
[5] Taxon object at 0x22c0f50 (Taxon36441936): ’Python molurus’
[6] Taxon object at 0x22c0ff0 (Taxon36442096): ’Morelia carinata’
[7] Taxon object at 0x23ae050 (Taxon37412944): ’Morelia boeleni’
[8] Taxon object at 0x23ae030 (Taxon37412912): ’Antaresia perthensis’
[9] Taxon object at 0x23ae070 (Taxon37412976): ’Morelia viridis’
[10] Taxon object at 0x23ae090 (Taxon37413008): ’Aspidites ramsayi’
[11] Taxon object at 0x23ae0b0 (Taxon37413040): ’Aspidites melanocephalus’
[12] Taxon object at 0x22c0fb0 (Taxon36442032): ’Morelia oenpelliensis’
[13] Taxon object at 0x23ae0d0 (Taxon37413072): ’Bothrochilus boa’
[14] Taxon object at 0x23ae130 (Taxon37413168): ’Morelia bredli’
[15] Taxon object at 0x23ae110 (Taxon37413136): ’Morelia spilota’
[16] Taxon object at 0x23ae150 (Taxon37413200): ’Antaresia stimsoni’
[17] Taxon object at 0x23ae0f0 (Taxon37413104): ’Antaresia childreni’
[18] Taxon object at 0x23ae1b0 (Taxon37413296): ’Leiopython albertisii’
[19] Taxon object at 0x23ae170 (Taxon37413232): ’Python reticulatus’
[20] Taxon object at 0x23ae190 (Taxon37413264): ’Morelia tracyae’
[21] Taxon object at 0x23ae1d0 (Taxon37413328): ’Morelia amethistina’
[22] Taxon object at 0x23ae230 (Taxon37413424): ’Morelia nauta’
[23] Taxon object at 0x23ae250 (Taxon37413456): ’Morelia kinghorni’
[24] Taxon object at 0x23ae210 (Taxon37413392): ’Morelia clastolepis’
[25] Taxon object at 0x23ae290 (Taxon37413520): ’Liasis fuscus’
[26] Taxon object at 0x23ae2b0 (Taxon37413552): ’Liasis mackloti’
[27] Taxon object at 0x23ae270 (Taxon37413488): ’Liasis olivaceus’
[28] Taxon object at 0x23ae2f0 (Taxon37413616): ’Apodora papuana’
[Character Matrices]
[0] DnaCharacterMatrix object at 0x22c0f90 (DnaCharacterMatrix36442000): 29 Sequences
[Taxon Set]
TaxonSet object at 0x5a4a20 (TaxonSet5917216): 29 Taxa
[Characters]
[0] Python regius : 1114 characters
[1] Python sebae : 1114 characters
[2] Python brongersmai : 1114 characters
[3] Antaresia maculosa : 1114 characters
[4] Python timoriensis : 1114 characters
[5] Python molurus : 1114 characters
[6] Morelia carinata : 1114 characters
[7] Morelia boeleni : 1114 characters
[8] Antaresia perthensis : 1114 characters
[9] Morelia viridis : 1114 characters
[10] Aspidites ramsayi : 1114 characters
[11] Aspidites melanocephalus : 1114 characters
[12] Morelia oenpelliensis : 1114 characters
[13] Bothrochilus boa : 1114 characters
[14] Morelia bredli : 1114 characters
[15] Morelia spilota : 1114 characters
[16] Antaresia stimsoni : 1114 characters
[17] Antaresia childreni : 1114 characters
[18] Leiopython albertisii : 1114 characters
[19] Python reticulatus : 1114 characters
[20] Morelia tracyae : 1114 characters
1.4. Examining Data Objects
15
DendroPy Tutorial, Release 3.12.1
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
Morelia amethistina : 1114 characters
Morelia nauta : 1114 characters
Morelia kinghorni : 1114 characters
Morelia clastolepis : 1114 characters
Liasis fuscus : 1114 characters
Liasis mackloti : 1114 characters
Liasis olivaceus : 1114 characters
Apodora papuana : 1114 characters
If you want to see the data in a particular schema, you can call the as_string method, passing it a schemaspecification string (“nexus”, “newick”, “fasta”, “phylip”, etc.), as well as other optional arguments specific to varous
formats:
>>>
>>>
>>>
>>>
>>>
import dendropy
d = dendropy.DataSet.get_from_path(’pythonidae.nex’, ’nexus’)
print(d.as_string("nexus"))
print(d.as_string("fasta"))
print(d.as_string("phylip"))
It is also possible to get an ASCII text plot of a tree object:
>>> import dendropy
>>> t = dendropy.Tree.get_from_path(’pythonidae.mle.nex’, ’nexus’)
>>> print(t.as_ascii_plot())
/---------- Python regius
|
/----------------------+
/--- Python sebae
|
|
/--+
|
\---+ \--- Python molurus
|
|
|
\------ Python curtus
|
|
/--- Morelia bredli
|
/----------------+
|
|
\--- Morelia spilota
|
|
|
|
/------------- Morelia tracyae
|
/--+
|
|
| |
/--+
/------ Morelia clastolepis
|
| |
| | /---+
|
| |
| | |
| /--- Morelia kinghorni
|
| |
| \--+
\--+
|
| \---+
|
\--- Morelia nauta
|
|
|
|
|
|
|
\---------- Morelia amethistina
|
/--+
|
|
| |
\---------------- Morelia oenpelliensis
|
| |
/--+
| |
/---------- Antaresia maculosa
| |
| |
/--+
| |
| |
| |
/------ Antaresia perthensis
| |
| |
| \---+
| |
| |
|
| /--- Antaresia stimsoni
| |
| \---------+
\--+
| |
|
|
\--- Antaresia childreni
| |
|
|
| |
|
|
/------ Morelia carinata
| |
|
\------+
| |
|
| /--- Morelia viridisN
16
Chapter 1. Phylogenetic Data in DendroPy
DendroPy Tutorial, Release 3.12.1
| | /---+
\--+
| | |
|
\--| | |
|
| | |
|
/--| | |
|
/--+
| | |
|
| \--| | |
|
/---+
| | |
|
|
| /--| | |
|
|
\--+
+ | |
|
/--+
\--| | |
|
| |
| \--+
|
| |
/--|
|
|
/--+ \------+
|
|
|
| |
\--|
|
|
| |
|
|
\---------+ |
/--|
|
| \---------+
|
|
|
\--|
|
|
|
|
\---------------|
|
|
|
/--|
\--------------------------+
|
\--|
|
/--|--------------------------------+
|
\--|
\------------------------------------
Morelia viridisS
Apodora papuana
Liasis olivaceus
Liasis fuscus
Liasis mackloti
Antaresia melanocephalus
Antaresia ramsayi
Liasis albertisii
Bothrochilus boa
Morelia boeleni
Python timoriensis
Python reticulatus
Xenopeltis unicolor
Candola aspera
Loxocemus bicolor
1.5 Converting Between Data Formats
Any data in a schema that can be read by DendroPy, can be saved to files in any schema that can be written by
DendroPy. Converting data between formats is simply a matter of calling readers and writers of the appropriate type.
Converting from FASTA schema to NEXUS:
>>> import dendropy
>>> cytb = dendropy.DnaCharacterMatrix.get_from_path("pythonidae_cytb.fasta", "dnafasta")
>>> cytb.write_to_path("pythonidae_cytb.nexus", "nexus")
Converting a collection of trees from NEXUS schema to Newick:
>>> import dendropy
>>> mcmc = dendropy.TreeList.get_from_path("pythonidae.mcmc.nex", "nexus")
>>> mcmc.write_to_path("pythonidae.mcmc.newick", "newick")
Converting a single tree from Newick schema to NEXUS:
>>> import dendropy
>>> mle = dendropy.Tree.get_from_path("pythonidae.mle.newick", "newick")
>>> mle.write_to_path("pythonidae.mle.nex", "nexus")
Collecting data from multiple sources and writing to a NEXUS-formatted file:
1.5. Converting Between Data Formats
17
DendroPy Tutorial, Release 3.12.1
>>>
>>>
>>>
>>>
>>>
>>>
>>>
import dendropy
ds = dendropy.DataSet()
ds.read_from_path("pythonidae_cytb.fasta", "dnafasta")
ds.read_from_path("pythonidae_aa.nex", "nexus", taxon_set=ds.taxon_sets[0])
ds.read_from_path("pythonidae_morphological.nex", "nexus", taxon_set=ds.taxon_sets[0])
ds.read_from_path("pythonidae.mle.tre", "nexus", taxon_set=ds.taxon_sets[0])
ds.write_to_path("pythonidae_combined.nex", "nexus")
Note how, after the first data source has been loaded, the resulting TaxonSet (i.e., the first one) is passed to the
subsequent read_from_path statements, to ensure that the same taxa are referenced as objects corresponding to
the additional data sources are created. Otherwise, as each data source is read, a new TaxonSet will be created, and
this will result in multiple TaxonSet objects in the DataSet, with the data from each data source associated with
their own, distinct TaxonSet.
A better way to do this is to use the “attached taxon set” mode DataSet object:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
18
import dendropy
ds = dendropy.DataSet(attached_taxon_set=True)
ds.read_from_path("pythonidae_cytb.fasta", "dnafasta")
ds.read_from_path("pythonidae_aa.nex", "nexus")
ds.read_from_path("pythonidae_morphological.nex", "nexus")
ds.read_from_path("pythonidae.mle.tre", "nexus")
ds.write_to_path("pythonidae_combined.nex", "nexus")
Chapter 1. Phylogenetic Data in DendroPy
CHAPTER 2
Working with Taxa
2.1 Taxa and Taxon Management
Operational taxonomic units in DendroPy are represented by Taxon objects, and distinct collections of operational
taxonomic units are represented by TaxonSet objects.
Every time a definition of taxa is encountered in a data source, for example, a “TAXA” block in a NEXUS file, a
new TaxonSet object is created and populated with Taxon objects corresponding to the taxa defined in the data
source. Some data formats do not have explicit definition of taxa, e.g. a Newick tree source. These nonetheless can be
considered to have an implicit definition of a collection of operational taxonomic units given by the aggregate of all
operational taxonomic units referenced in the data (i.e., the set of all distinct labels on trees in a Newick file).
Every time a reference to a taxon is encountered in a data source, such as a taxon label in a tree or matrix statement in
a NEXUS file, the current TaxonSet object is searched for corresponding Taxon object with a matching label (see
below for details on how the match is made). If found, the Taxon object is used to represent the taxon. If not, a new
Taxon object is created, added to the TaxonSet object, and used to represent the taxon.
DendroPy maps taxon definitions encountered in a data source to Taxon objects by the taxon label. The labels have
to match exactly for the taxa to be correctly mapped
Some special formats, such as NEXUS or Newick, treat the taxa labels as case-insensitive: “Python regius”,
“PYTHON REGIUS” and “python regius” will all be considered the same taxon (this can be turned off by specifying
the appropriate False to case_insensitive_taxon_labels when reading the data in NEXUS or Newick formats). Otherwise, in general, most other formats (e.g., PHYLIP, Fasta, NExML) treat the taxa labels as case-sensitive:
“Python regius”, “PYTHON REGIUS” and “python regius” will all be considered the different taxa.
Further quirks may arise due to some schema-specific idiosyncracies. For example, the NEXUS standard dictates that
an underscore (“_”) should be substituted for a space in all labels. Thus, when reading a NEXUS or Newick source,
the taxon labels “Python_regius” and “Python regius” are exactly equivalent, and will be mapped to the same Taxon
object.
However, this underscore-to-space mapping does not take place when reading, for example, a FASTA schema file.
Here, underscores are preserved, and thus “Python_regius” does not map to “Python regius”. This means that if you
were to read a NEXUS file with the taxon label, “Python_regius”, and later a read a FASTA file with the same taxon
label, i.e., “Python_regius”, these would map to different taxa! This is illustrated by the following:
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
nexus1 = """
#NEXUS
7
19
DendroPy Tutorial, Release 3.12.1
8
9
10
11
begin taxa;
dimensions ntax=2;
taxlabels Python_regius Python_sebae;
end;
12
13
14
15
16
17
18
19
20
21
begin characters;
dimensions nchar=5;
format datatype=dna gap=- missing=? matchchar=.;
matrix
Python_regius ACGTA
Python_sebae
ACGTA
;
end;
"""
22
23
24
25
26
27
28
fasta1 = """
>Python_regius
AAAA
>Python_sebae
ACGT
"""
29
30
31
32
33
34
d = dendropy.DataSet()
d.attach_taxon_set()
d.read_from_string(nexus1, "nexus")
d.read_from_string(fasta1, "dnafasta")
print(d.taxon_sets[0].description(2))
Which produces the following, almost certainly incorrect, result:
TaxonSet object at 0x43b4e0 (TaxonSet4437216): 4 Taxa
[0] Taxon object at 0x22867b0 (Taxon36202416): ’Python regius’
[1] Taxon object at 0x2286810 (Taxon36202512): ’Python sebae’
[2] Taxon object at 0x22867d0 (Taxon36202448): ’Python_regius’
[3] Taxon object at 0x2286830 (Taxon36202544): ’Python_sebae’
Even more confusingly, if this file is written out in NEXUS schema, it would result in the space/underscore substitution
taking place, resulting in two pairs of taxa with the same labels.
If you plan on mixing sources from different formats, it is important to keep in mind the space/underscore substitution
that takes place by default with NEXUS/Newick formats, but does not take place with other formats.
You could simply avoid underscores and use only spaces instead:
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
nexus1 = """
#NEXUS
7
8
9
10
11
begin taxa;
dimensions ntax=2;
taxlabels ’Python regius’ ’Python sebae’;
end;
12
13
14
15
begin characters;
dimensions nchar=5;
format datatype=dna gap=- missing=? matchchar=.;
20
Chapter 2. Working with Taxa
DendroPy Tutorial, Release 3.12.1
16
17
18
19
20
21
matrix
’Python regius’ ACGTA
’Python sebae’
ACGTA
;
end;
"""
22
23
24
25
26
27
28
fasta1 = """
>Python regius
AAAA
>Python sebae
ACGT
"""
29
30
31
32
33
34
d = dendropy.DataSet()
d.attach_taxon_set()
d.read_from_string(nexus1, "nexus")
d.read_from_string(fasta1, "dnafasta")
print(d.taxon_sets[0].description(2))
Which results in:
TaxonSet object at 0x43b4e0 (TaxonSet4437216): 2 Taxa
[0] Taxon object at 0x22867b0 (Taxon36202416): ’Python_regius’
[1] Taxon object at 0x2286810 (Taxon36202512): ’Python_sebae’
Or use underscores in the NEXUS-formatted data, but spaces in the non-NEXUS data:
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
nexus1 = """
#NEXUS
7
8
9
10
11
begin taxa;
dimensions ntax=2;
taxlabels Python_regius Python_sebae;
end;
12
13
14
15
16
17
18
19
20
21
begin characters;
dimensions nchar=5;
format datatype=dna gap=- missing=? matchchar=.;
matrix
Python_regius ACGTA
Python_sebae
ACGTA
;
end;
"""
22
23
24
25
26
27
28
fasta1 = """
>Python regius
AAAA
>Python sebae
ACGT
"""
29
30
31
d = dendropy.DataSet()
d.attach_taxon_set()
2.1. Taxa and Taxon Management
21
DendroPy Tutorial, Release 3.12.1
32
33
34
d.read_from_string(nexus1, "nexus")
d.read_from_string(fasta1, "dnafasta")
print(d.taxon_sets[0].description(2))
Which results in the same as the preceding example:
TaxonSet object at 0x43b4e0 (TaxonSet4437216): 2 Taxa
[0] Taxon object at 0x22867b0 (Taxon36202416): ’Python regius’
[1] Taxon object at 0x2286810 (Taxon36202512): ’Python sebae’
You can also wrap the underscore-bearing labels in the NEXUS/Newick source in quotes, which preserves them from
being substituted for spaces:
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
nexus1 = """
#NEXUS
7
8
9
10
11
begin taxa;
dimensions ntax=2;
taxlabels ’Python_regius’ ’Python_sebae’;
end;
12
13
14
15
16
17
18
19
20
21
begin characters;
dimensions nchar=5;
format datatype=dna gap=- missing=? matchchar=.;
matrix
’Python_regius’ ACGTA
’Python_sebae’
ACGTA
;
end;
"""
22
23
24
25
26
27
28
fasta1 = """
>Python_regius
AAAA
>Python_sebae
ACGT
"""
29
30
31
32
33
34
d = dendropy.DataSet()
d.attach_taxon_set()
d.read_from_string(nexus1, "nexus")
d.read_from_string(fasta1, "dnafasta")
print(d.taxon_sets[0].description(2))
Which will result in:
TaxonSet object at 0x43c780 (TaxonSet4441984): 2 Taxa
[0] Taxon object at 0x2386770 (Taxon37250928): ’Python_regius’
[1] Taxon object at 0x2386790 (Taxon37250960): ’Python_sebae’
Finally, you can also override the default behavior of DendroPy’s NEXUS/Newick parser by passing the keyword
argument preserve_underscores=True to any read_from_*, get_from_* or stream-parsing constructor.
For example:
22
Chapter 2. Working with Taxa
DendroPy Tutorial, Release 3.12.1
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
nexus1 = """
#NEXUS
7
8
9
10
11
begin taxa;
dimensions ntax=2;
taxlabels Python_regius Python_sebae;
end;
12
13
14
15
16
17
18
19
20
21
begin characters;
dimensions nchar=5;
format datatype=dna gap=- missing=? matchchar=.;
matrix
Python_regius ACGTA
Python_sebae
ACGTA
;
end;
"""
22
23
24
25
26
27
28
fasta1 = """
>Python_regius
AAAA
>Python_sebae
ACGT
"""
29
30
31
32
33
34
d = dendropy.DataSet()
d.attach_taxon_set()
d.read_from_string(nexus1, "nexus", preserve_underscores=True)
d.read_from_string(fasta1, "dnafasta")
print(d.taxon_sets[0].description(2))
will result in:
TaxonSet object at 0x43c780 (TaxonSet4441984): 2 Taxa
[0] Taxon object at 0x2386770 (Taxon37250928): ’Python_regius’
[1] Taxon object at 0x2386790 (Taxon37250960): ’Python_sebae’
This may seem the simplest solution, in so far as it means that you need not maintain lexically-different taxon labels
across files of different formats, but a gotcha here is that if writing to NEXUS/Newick schema, any label with underscores will be automatically quoted to preserve the underscores (again, as dictated by the NEXUS standard), which
will mean that: (a) your output file will have quotes, and, as a result, (b) the underscores in the labels will be “hard”
underscores if the file is read by PAUP* or DendroPy. So, for example, continuing from the previous example, the
NEXUS-formatted output would look like:
>>> print(d.as_string(’nexus’))
#NEXUS
BEGIN TAXA;
TITLE TaxonSet5736800;
DIMENSIONS NTAX=2;
TAXLABELS
’Python_regius’
’Python_sebae’
;
2.1. Taxa and Taxon Management
23
DendroPy Tutorial, Release 3.12.1
END;
BEGIN CHARACTERS;
TITLE DnaCharacterMatrix37505040;
LINK TAXA = TaxonSet5736800;
DIMENSIONS NCHAR=5;
FORMAT DATATYPE=DNA GAP=- MISSING=? MATCHCHAR=.;
MATRIX
’Python_regius’
ACGTA
’Python_sebae’
ACGTA
;
END;
BEGIN CHARACTERS;
TITLE DnaCharacterMatrix37504848;
LINK TAXA = TaxonSet5736800;
DIMENSIONS NCHAR=4;
FORMAT DATATYPE=DNA GAP=- MISSING=? MATCHCHAR=.;
MATRIX
’Python_regius’
AAAA
’Python_sebae’
ACGT
;
END;
Note that the taxon labels have changed semantically between the input and the NEXUS output, as, according to
the NEXUS standard, “Python_regius”, while equivalent to “Python regius”, is not equivalent to “‘Python_regius”’.
To control this, you can pass the keyword argument quote_underscores=False to any write_to_*, or
as_string method, which will omit the quotes even if the labels contain underscores:
>>> print(d.as_string(’nexus’, quote_underscores=False))
#NEXUS
BEGIN TAXA;
TITLE TaxonSet5736800;
DIMENSIONS NTAX=2;
TAXLABELS
Python_regius
Python_sebae
;
END;
BEGIN CHARACTERS;
TITLE DnaCharacterMatrix37505040;
LINK TAXA = TaxonSet5736800;
DIMENSIONS NCHAR=5;
FORMAT DATATYPE=DNA GAP=- MISSING=? MATCHCHAR=.;
MATRIX
Python_regius
ACGTA
Python_sebae
ACGTA
;
END;
BEGIN CHARACTERS;
TITLE DnaCharacterMatrix37504848;
LINK TAXA = TaxonSet5736800;
DIMENSIONS NCHAR=4;
FORMAT DATATYPE=DNA GAP=- MISSING=? MATCHCHAR=.;
MATRIX
24
Chapter 2. Working with Taxa
DendroPy Tutorial, Release 3.12.1
Python_regius
Python_sebae
;
END;
AAAA
ACGT
2.2 Partitions of Taxon Sets
A number of different applications require a specification of a partition of a set of taxa. For example, when calculating
population genetic summary statistics for a multi-population sample or numbers of deep coalescences given a particular
monophyletic groupings of taxa. The TaxonSetPartition object describes a partitioning of a TaxonSet into
an exhaustive set of mutually-exclusive TaxonSet subsets.
There are four different ways to specify a partitioning scheme: by using a function, attribute, dictionary or list. The
first three of these rely on providing a mapping of a Taxon object to a subset membership identifier, i.e., a string,
integer or some other type of value that identifies the grouping. The last explicitly describes the grouping as a list of
lists.
For example, consider the following:
>>>
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
>>>
>>>
seqstr = """\
#NEXUS
BEGIN TAXA;
DIMENSIONS NTAX=13;
TAXLABELS a1 a2 a3 b1 b2 b3 c1 c2 c3 c4 c5 d1 d2;
END;
BEGIN CHARACTERS;
DIMENSIONS NCHAR=7;
FORMAT DATATYPE=DNA MISSING=? GAP=- MATCHCHAR=.;
MATRIX
a1 ACCTTTG
a2 ACCTTTG
a3 ACCTTTG
b1 ATCTTTG
b2 ATCTTTG
b3 ACCTTTG
c1 ACCCTTG
c2 ACCCTTG
c3 ACCCTTG
c4 ACCCTTG
c5 ACCCTTG
d1 ACAAAAG
d2 ACCAAAG
;
END
"""
seqs = DnaCharacterMatrix.get_from_string(seqstr, ’nexus’)
taxon_set = seqs.taxon_set
Here we have sequences sampled from four populations, with the population identified by the first character of the
taxon label. To create a parition of the TaxonSet resulting from parsing the file, we call the partition method.
This method takes one of the following four keyword arguments:
membership_func A function that takes a Taxon object as an argument and returns a a population
membership identifier or flag (e.g., a string, an integer) .
2.2. Partitions of Taxon Sets
25
DendroPy Tutorial, Release 3.12.1
membership_attr_name Name of an attribute of Taxon objects that serves as an identifier for
subset membership.
membership_dict A dictionary with Taxon objects as keys and population membership identifier
or flag as values (e.g., a string, an integer).
membership_lists A container of containers of Taxon objects, with every Taxon object in
taxon_set represented once and only once in the sub-containers.
For example, using the membership function approach, we define a function that returns the first character of the taxon
label, and pass it to the partition using the membership_func keyword argument:
>>> def mf(t):
...
return t.label[0]
...
>>> tax_parts = taxon_set.partition(membership_func=mf)
Or, as would probably be done with such a simple membership function in practice:
>>> tax_parts = taxon_set.partition(membership_func=lambda x: x.label[0])
Either way, we would get the following results:
>>> for s in tax_parts.subsets():
...
print(s.description())
...
TaxonSet object at 0x101116838 (TaxonSet4312885304:
TaxonSet object at 0x101116788 (TaxonSet4312885128:
TaxonSet object at 0x101116730 (TaxonSet4312885040:
TaxonSet object at 0x1011167e0 (TaxonSet4312885216:
’a’):
’c’):
’d’):
’b’):
3
5
2
3
Taxa
Taxa
Taxa
Taxa
We could also add a population identification attribute to each Taxon object, and use the
membership_attr_name keyword to specify that subsets should be created based on the value of this
attribute:
>>> for t in taxon_set:
...
t.population = t.label[0]
...
>>> tax_parts = taxon_set.partition(membership_attr_name=’population’)
The results are identical to that above:
>>> for s in tax_parts.subsets():
...
print(s.description())
...
TaxonSet object at 0x1011166d8 (TaxonSet4312884952:
TaxonSet object at 0x1011165d0 (TaxonSet4312884688:
TaxonSet object at 0x1011169f0 (TaxonSet4312885744:
TaxonSet object at 0x101116680 (TaxonSet4312884864:
’a’):
’c’):
’d’):
’b’):
3
5
2
3
Taxa
Taxa
Taxa
Taxa
The third approach involves constructing a dictionary that maps Taxon objects to their identification label and passing
this using the membership_dict keyword argument:
>>> tax_pop_label_map = {}
>>> for t in taxon_set:
...
tax_pop_label_map[t] = t.label[0]
...
>>> tax_parts = taxon_set.partition(membership_dict=tax_pop_label_map)
Again, the results are the same:
26
Chapter 2. Working with Taxa
DendroPy Tutorial, Release 3.12.1
>>> for s in tax_parts.subsets():
...
print(s.description())
...
TaxonSet object at 0x1011166e8 (TaxonSet4312884952:
TaxonSet object at 0x1011165f0 (TaxonSet4312884688:
TaxonSet object at 0x1011169f1 (TaxonSet4312885744:
TaxonSet object at 0x101116620 (TaxonSet4312884864:
’a’):
’c’):
’d’):
’b’):
3
5
2
3
Taxa
Taxa
Taxa
Taxa
Finally, a list of lists can be constructed and passed using the membership_lists argument:
>>>
>>>
>>>
>>>
>>>
>>>
pops = []
pops.append(taxon_set[0:3])
pops.append(taxon_set[3:6])
pops.append(taxon_set[6:11])
pops.append(taxon_set[11:13])
tax_parts = taxon_set.partition(membership_lists=pops)
Again, a TaxonSetPartition object with four TaxonSet subsets is the result, only this time the subset labels
are based on the list indices:
>>> subsets = tax_parts.subsets()
>>> print(subsets)
set([<TaxonSet object at 0x10069f838>, <TaxonSet object at 0x10069fba8>, <TaxonSet object at 0x101116
>>> for s in subsets:
...
print(s.description())
...
TaxonSet object at 0x10069f838 (TaxonSet4301912120: ’0’): 3 Taxa
TaxonSet object at 0x10069fba8 (TaxonSet4301913000: ’1’): 3 Taxa
TaxonSet object at 0x101116520 (TaxonSet4312884512: ’3’): 2 Taxa
TaxonSet object at 0x1011164c8 (TaxonSet4312884424: ’2’): 5 Taxa
2.2. Partitions of Taxon Sets
27
DendroPy Tutorial, Release 3.12.1
28
Chapter 2. Working with Taxa
CHAPTER 3
Working with Trees and Tree Lists
3.1 Trees and Collections of Trees
3.1.1 Trees and Tree Lists
Trees
Trees in DendroPy are represented by the class Tree. Every Tree object has a seed_node attribute. If the tree
is rooted, then this is the root node. If the tree is not rooted, however, then this is an artificial node that serves as the
“starting point” for the tree. The seed_node, like every other node on the tree, is a Node object. Every Node object
maintains a list of its immediate child Node objects as well as a reference to its parent Node object. You can request
a shallow-copy list of child Node objects using the child_nodes method, and you can access the parent Node
object directly through the parent_node attribute. By definition, the seed_node has no parent node, leaf nodes
have no child nodes, and internal nodes have both parent nodes and child nodes.
Tree Lists
TreeList objects are lists of Tree objects constrained to sharing the same TaxonSet. Any Tree object added
to a TreeList will have its taxon_set attribute assigned to the TaxonSet object of the TreeList, and all
referenced Taxon objects will be mapped to the same or corresponding Taxon objects of this new TaxonSet, with
new Taxon objects created if no suitable match is found.
3.1.2 Tree and TreeList Creation and Reading
Creating a New Tree or TreeList from a Data Source
Both the Tree and TreeList classes support the get_from_stream, get_from_path, and
get_from_string factory class methods for simultaneously instantiating and populating objects, taking a
data source as the first argument and a schema specification string (“nexus”, “newick”, “nexml”, “fasta”, or
“phylip”, etc.) as the second:
>>> import dendropy
>>> tree = dendropy.Tree.get_from_path(’pythonidae.mle.nex’, ’nexus’)
>>> treelist = dendropy.TreeList.get_from_path(’pythonidae.mcmc.nex’, ’nexus’)
In addition, fine-grained control over the parsing of the data source is available through various keyword arguments.
29
DendroPy Tutorial, Release 3.12.1
Reading into an Existing Tree or TreeList from a Data Source
The read_from_stream, read_from_path, and read_from_string instance methods for populating existing objects are also supported, taking the same arguments (i.e., a data source, a schema specification string, as well
as optional keyword arguments to customize the parse behavior):
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
import dendropy
tree = dendropy.Tree()
tree.read_from_path(’pythonidae.mle.nex’, ’nexus’)
treelist = dendropy.TreeList()
treelist.read_from_path(’pythonidae_cytb.mb.run1.t’,
treelist.read_from_path(’pythonidae_cytb.mb.run2.t’,
treelist.read_from_path(’pythonidae_cytb.mb.run3.t’,
treelist.read_from_path(’pythonidae_cytb.mb.run4.t’,
’nexus’)
’nexus’)
’nexus’)
’nexus’)
In the case of Tree objects, calling read_from_* repopulates (i.e., redefines) the Tree with data from the data
source, while in the case of TreeList objects, calling read_from_* appends the tree definitions in the data source
to the TreeList.
Cloning an Existing Tree or TreeList
You can also clone existing Tree and TreeList objects by passing them as arguments to their respective constructors.
For example, to create a clone of a Tree object:
>>> import dendropy
>>> tree1 = dendropy.Tree.get_from_path(’pythonidae.mle.tree’, ’nexus’)
>>> tree2 = dendropy.Tree(tree1)
With this, tree2 will be an exact clone of tree1, and can be independentally manipulated (e.g., derooted, branches
pruned, splits collapsed, etc.) without effecting tree1. Note, however, that the Taxon objects remain linked:
changing the label, for example, of a Taxon object on tree2 will result in the label of the corresponding Taxon
object in tree1 being similarly affected.
To create a clone of a TreeList object:
>>> import dendropy
>>> treelist1 = dendropy.TreeList.get_from_path(’pythonidae.mcmc.nex’, ’nexus’)
>>> treelist2 = dendropy.TreeList(treelist1)
Here, treelist2 will be a deep-copy of treelist1, i.e., with each Tree object in treelist2 being a clone
of the corresponding Tree object in treelist1. The same constraint regarding Taxon object applies: i.e., the
cloning does not extend to Taxon objects, and these are shared across all Tree objects in both treelist1 and
treelist2, as well as the TreeList objects themselves.
3.1.3 Tree and TreeList Saving and Writing
Writing to Files
The write_to_stream, and write_to_path instance methods allow you to write the data of Tree and
TreeList objects to a file-like object or a file path respectively. These methods take a file-like object (in the case of
write_to_stream) or a string specifying a filepath (in the case of write_to_path) as the first argument, and
a schema specification string as the second argument.
30
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
The following example aggregates the post-burn in MCMC samples from a series of NEXUS-formatted files, and
saves the collection as a Newick-formatted file:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
import dendropy
treelist = dendropy.TreeList()
treelist.read_from_path(’pythonidae_cytb.mb.run1.t’, ’nexus’,
treelist.read_from_path(’pythonidae_cytb.mb.run2.t’, ’nexus’,
treelist.read_from_path(’pythonidae_cytb.mb.run3.t’, ’nexus’,
treelist.read_from_path(’pythonidae_cytb.mb.run4.t’, ’nexus’,
treelist.write_to_path(’pythonidae_cytb.mcmc-postburnin.tre’,
tree_offset=200)
tree_offset=200)
tree_offset=200)
tree_offset=200)
’newick’)
Fine-grained control over the output format can be specified using keyword arguments.
Composing a String
If you do not want to actually write to a file, but instead simply need a string representing the data in a particular
format, you can call the instance method as_string, passing a schema specification string as the first argument:
>>>
>>>
>>>
>>>
>>>
>>>
import dendropy
tree = dendropy.Tree()
tree.read_from_path(’pythonidae.mle.nex’, ’nexus’)
s = tree.as_string(’newick’)
print(s)
(Python_molurus:0.0779719244,((Python_sebae:0.1414715009,(((((Morelia_tracyae:0.0435011998,(Morel
As above, fine-grained control over the output format can be specified using keyword arguments.
3.1.4 Taxon Management with Trees and Tree Lists
Taxon Management with Trees
It is important to recognize that, by default, DendroPy will create new TaxonSet object every time a data source is
parsed (and, if the data source has multiple taxon objects, there may be more than one TaxonSet created).
Consider the following example:
>>> import dendropy
>>> t1 = dendropy.Tree.get_from_string(’((A,B),(C,D))’, schema=’newick’)
>>> t2 = dendropy.Tree.get_from_string(’((A,B),(C,D))’, schema=’newick’)
>>> print(t1.description(2))
Tree object at 0x64b130 (Tree6599856): 7 Nodes, 7 Edges
[Taxon Set]
TaxonSet object at 0x64c270 (TaxonSet6603376): 4 Taxa
[Tree]
((A,B),(C,D))
>>> print(t2.description(2))
Tree object at 0x64b190 (Tree6600560): 7 Nodes, 7 Edges
[Taxon Set]
TaxonSet object at 0x64c1e0 (TaxonSet6603232): 4 Taxa
[Tree]
((A,B),(C,D))
We now have two distinct Tree objects, each associated with a distinct TaxonSet objects, each with its own set of
Taxon objects that, while having the same labels, are distinct from one another:
3.1. Trees and Collections of Trees
31
DendroPy Tutorial, Release 3.12.1
>>> t1.leaf_nodes()[0].taxon == t2.leaf_nodes()[0].taxon
False
>>> t1.leaf_nodes()[0].taxon.label == t2.leaf_nodes()[0].taxon.label
True
This means that even though the tree shape and structure is identical between the two trees, they exist in different
universes as far as DendroPy is concerned, and many operations that involving comparing trees will fail:
>>> from dendropy import treecalc
>>> treecalc.robinson_foulds_distance(t1, t2)
-----------------------------------------------------------Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jeet/Documents/Projects/dendropy/dendropy/treecalc.py", line 263, in robinson_foulds_d
value_type=float)
File "/Users/jeet/Documents/Projects/dendropy/dendropy/treecalc.py", line 200, in splits_distance
% (hex(id(tree1.taxon_set)), hex(id(tree2.taxon_set))))
TypeError: Trees have different TaxonSet objects: 0x101f630 vs. 0x103bf30
The solution is to explicitly specify the same taxon_set when creating the trees. In DendroPy all phylogenetic data
classes that are associated with TaxonSet objects have constructors, factory methods, and read_from_* methods
take a specific TaxonSet object as an argument using the taxon_set a keyword. For example:
>>>
>>>
>>>
>>>
0.0
taxa = dendropy.TaxonSet()
t1 = dendropy.Tree.get_from_string(’((A,B),(C,D))’, schema=’newick’, taxon_set=taxa)
t2 = dendropy.Tree.get_from_string(’((A,B),(C,D))’, schema=’newick’, taxon_set=taxa)
treecalc.robinson_foulds_distance(t1, t2)
Taxon Management with Tree Lists
The TreeList class is designed to manage collections of Tree objects that share the same TaxonSet. As Tree
objects are appended to a TreeList object, the TreeList object will automatically take care of remapping the
TaxonSet and associated Taxon objects:
>>> t1 = dendropy.Tree.get_from_string(’((A,B),(C,D))’, schema=’newick’)
>>> t2 = dendropy.Tree.get_from_string(’((A,B),(C,D))’, schema=’newick’)
>>> print(repr(t1.taxon_set))
<TaxonSet object at 0x1243a20>
>>> repr(t1.taxon_set)
’<TaxonSet object at 0x1243a20>’
>>> repr(t2.taxon_set)
’<TaxonSet object at 0x12439f0>’
>>> trees = dendropy.TreeList()
>>> trees.append(t1)
>>> trees.append(t2)
>>> repr(t1.taxon_set)
’<TaxonSet object at 0x1243870>’
>>> repr(t2.taxon_set)
’<TaxonSet object at 0x1243870>’
>>> treecalc.robinson_foulds_distance(t1, t2)
0.0
The same applies when using the read_from_* method of a TreeList object: all trees read from the data source
will be assigned the same TaxonSet object, and the taxa referenced in the tree definition will be mapped to corre32
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
sponding Taxon objects, identified by label, in the TaxonSet, with new Taxon objects created if no suitable match
is found.
While TreeList objects ensure that all Tree objects created, read or added using them all have the same
TaxonSet object reference, if two TreeList objects are independentally created, they will each have their own,
distinct, TaxonSet object reference. For example, if you want to read in two collections of trees and compare trees
between the two collections, the following will not work:
>>> import dendropy
>>> mcmc1 = dendropy.TreeList.get_from_path(’pythonidae.mcmc1.nex’, ’nexus’)
>>> mcmc2 = dendropy.TreeList.get_from_path(’pythonidae.mcmc2.nex’, ’nexus’)
Of course, reading both data sources into the same TreeList object will work insofar as ensuring all the Tree
objects have the same TaxonSet reference, but then you will lose the distinction between the two sources, unless
you keep track of the indexes of where one source begins and the other ends, which error-prone and tedious. A better
approach would be simply to create a TaxonSet object, and pass it to the factory methods of both TreeList
objects:
>>>
>>>
>>>
>>>
import dendropy
taxa = dendropy.TaxonSet()
mcmc1 = dendropy.TreeList.get_from_path(’pythonidae.mcmc1.nex’, ’nexus’, taxon_set=taxa)
mcmc2 = dendropy.TreeList.get_from_path(’pythonidae.mcmc2.nex’, ’nexus’, taxon_set=taxa)
Now both mcmc1 and mcmc2 share the same TaxonSet, and thus so do the Tree objects created within them,
which means the Tree objects can be compared both within and between the collections.
You can also pass the TaxonSet to the constructor of TreeList. So, for example, the following is logically
identical to the previous:
>>>
>>>
>>>
>>>
>>>
>>>
import dendropy
taxa = dendropy.TaxonSet()
mcmc1 = dendropy.TreeList(taxon_set=taxa)
mcmc1.read_from_path(’pythonidae.mcmc1.nex’, ’nexus’)
mcmc2 = dendropy.TreeList(taxon_set=taxa)
mcmc2.read_from_path(’pythonidae.mcmc2.nex’, ’nexus’)
3.1.5 Efficiently Iterating Over Trees in a File
If you need to process a collection of trees defined in a file source, you can read the trees into a TreeList object and
iterate over the resulting collection:
>>> import dendropy
>>> trees = dendropy.TreeList.get_from_path(’pythonidae.beast-mcmc.trees’, ’nexus’)
>>> for tree in trees:
...
print(tree.as_string(’newick’))
In the above, the entire data source is parsed and stored in the trees object before being processed in the subsequent
lines. In some cases, you might not need to maintain all the trees in memory at the same time. For example, you might
be interested in calculating the distribution of a statistic over a collection of trees, but have no need to refer to any of the
trees after the statistic has been calculated. In this case, it might be more efficient to use the tree_source_iter
function. This takes a file-like object as its first argument and a schema specification as the second and returns an
iterator over the trees in the file. Additional keyword arguments to customize the parsing are the same as that for the
general get_from_* and read_from_* methods. For example, the following script reads a model tree from a
file, and then iterates over a collection of MCMC trees in another file, calculating a storing the symmetric distance
between the model tree and each of the MCMC trees one at time:
3.1. Trees and Collections of Trees
33
DendroPy Tutorial, Release 3.12.1
1
#! /usr/bin/env python
2
3
4
5
import dendropy
from dendropy import tree_source_iter
from dendropy import treecalc
6
7
8
9
10
11
12
13
14
15
16
17
distances = []
taxa = dendropy.TaxonSet()
mle_tree = dendropy.Tree.get_from_path(’pythonidae.mle.nex’, ’nexus’, taxon_set=taxa)
for mcmc_tree in tree_source_iter(
stream=open(’pythonidae.mcmc.nex’, ’rU’),
schema=’nexus’,
taxon_set=taxa,
tree_offset=200):
distances.append(treecalc.symmetric_difference(mle_tree, mcmc_tree))
print("Mean symmetric distance between MLE and MCMC trees: %d"
% float(sum(distances)/len(distances)))
Note how a TaxonSet object is created and passed to both the get_from_path and the tree_source_iter
functions using the taxon_set keyword argument. This is to ensure that the corresponding taxa in both sources
get mapped to the same Taxon objects in DendroPy object space, so as to enable comparisons of the trees. If this
was not done, then each tree would have its own distinct TaxonSet object (and associated Taxon objects), making
comparisons impossible.
Also note how the tree_offset keyword is used to skip over the burn-in trees from the MCMC sample.
If you want to iterate over trees in multiple sources, you can use the multi_tree_source_iter. This takes a
list of file-like objects or a list of filepath strings as its first argument, and a schema-specification string as its second
argument. Again, other keyword arguments supported by the general get_from_* and read_from_* methods
are also available.
For example:
1
#! /usr/bin/env python
2
3
4
5
import dendropy
from dendropy import multi_tree_source_iter
from dendropy import treecalc
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
distances = []
taxa = dendropy.TaxonSet()
mle_tree = dendropy.Tree.get_from_path(’pythonidae.mle.nex’, ’nexus’, taxon_set=taxa)
mcmc_tree_file_paths = [’pythonidae.mb.run1.t’,
’pythonidae.mb.run2.t’,
’pythonidae.mb.run3.t’,
’pythonidae.mb.run4.t’]
for mcmc_tree in multi_tree_source_iter(
mcmc_tree_file_paths,
schema=’nexus’,
taxon_set=taxa):
distances.append(treecalc.symmetric_difference(mle_tree, mcmc_tree))
print("Mean symmetric distance between MLE and MCMC trees: %d"
% float(sum(distances)/len(distances)))
3.1.6 Viewing and Displaying Trees
Sometimes it is useful to get a visual representation of a Tree.
34
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
For quick inspection, the print_plot will write an ASCII text plot to the standard output stream:
>>> t = dendropy.Tree.get_from_string("(A,(B,(C,D)))", "newick")
>>> t.print_plot()
/----------------------------------------------- A
+
|
/------------------------------ B
\----------------+
|
/------------------- C
\----------+
\------------------- D
If you need to store this representation as a string instead, you can use as_ascii_plot:
>>> s = t.as_ascii_plot()
>>> print(s)
/----------------------------------------------+
|
/-----------------------------\----------------+
|
/------------------\----------+
\-------------------
A
B
C
D
While the write_to_path, write_to_stream and as_string methods provide for a rich and flexible way
to write representations of a Tree in various formats to various destinations, the print_newick provides a quickand-dirty way to get a snapshot NEWICK string of the tree:
>>> t.print_newick()
(A,(B,(C,D)))
3.2 Tree Traversal and Navigation
3.2.1 Tree Traversal
Iterating Over Nodes
The following example shows how you might evolve a continuous character on a tree by recursively visting each node,
and setting the value of the character to one drawn from a normal distribution centered on the value of the character of
the node’s ancestor and standard deviation given by the length of the edge subtending the node:
1
#! /usr/bin/env python
2
3
4
import random
import dendropy
5
6
7
8
9
10
11
12
13
14
def process_node(node, start=1.0):
if node.parent_node is None:
node.value = start
else:
node.value = random.gauss(node.parent_node.value, node.edge.length)
for child in node.child_nodes():
process_node(child)
if node.taxon is not None:
print("%s : %s" % (node.taxon, node.value))
15
3.2. Tree Traversal and Navigation
35
DendroPy Tutorial, Release 3.12.1
16
17
mle = dendropy.Tree.get_from_path(’pythonidae.mle.nex’, ’nexus’)
process_node(mle.seed_node)
While the previous example works, it is probably clearer and more efficient to use one of the pre-defined node iterator
methods:
preorder_node_iter Iterates over nodes in a Tree object in a depth-first search pattern, i.e., “visiting” a node before visiting the children of the node. This is the same traversal order as the previous
example. This traversal order is useful if you require ancestral nodes to be processed before descendent nodes, as, for example, when evolving sequences over a tree.
postorder_node_iter Iterates over nodes in a Tree object in a postorder search pattern, i.e., visiting the children of the node before visiting the node itself. This traversal order is useful if you
require descendent nodes to be processed before ancestor nodes, as, for example, when calculating
ages of nodes.
level_order_node_iter Iterates over nodes in a Tree object in a breadth-first search pattern, i.e.,
every node at a particular level is visited before proceeding to the next level.
leaf_iter Iterates over the leaf or tip nodes of a Tree object.
The previous example would thus be better implemented as follows:
1
#! /usr/bin/env python
2
3
4
import random
import dendropy
5
6
7
8
9
10
11
12
def evolve_char(tree, start=1.0):
for node in tree.preorder_node_iter():
if node.parent_node is None:
node.value = 1.0
else:
node.value = random.gauss(node.parent_node.value, node.edge.length)
return tree
13
14
15
16
17
mle = dendropy.Tree.get_from_path(’pythonidae.mle.nex’, ’nexus’)
evolve_char(mle)
for node in mle.leaf_iter():
print("%s : %s" % (node.taxon, node.value))
The nodes returned by each of these iterators can be filtered if a filter function is passed as a second argument to the
iterator. This filter function should take a Node object as an argument, and return True if the node is to be returned or
False if it is not. For example, the following iterates over all nodes that have more than two children:
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
7
8
mle = dendropy.Tree.get_from_path(’pythonidae.mle.nex’, ’nexus’)
multifurcating = lambda x: True if len(x.child_nodes()) > 2 else False
for nd in mle.postorder_node_iter(multifurcating):
print(nd.description(0))
Iterating Over Edges
The Edge objects associated with each Node can be accessed through the edge attribute of the Node object. So it is
possible to iterate over every edge on a tree by iterating over the nodes and referencing the edge attribute of the node
36
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
when processing the node. But it is clearer and probably more convenient to use one of the Edge iterators:
preorder_edge_iter Iterates over edges in a Tree object in a depth-first search pattern, i.e., “visiting” an edge before visiting the edges descending from that edge. This is the same traversal order
as the previous example. This traversal order is useful if you require ancestral edges to be processed
before descendent edges, as, for example, when calculating the sum of edge lengths from the root.
postorder_edge_iter Iterates over edges in a Tree object in a postorder search pattern, i.e., visiting the descendents of the edge before visiting the edge itself. This traversal order is useful if you
require descendent edges to be processed before ancestral edges, as, for example, when calculating
the sum of edge lengths from the tip
level_order_edge_iter Iterates over edges in a Tree object in a breadth-first search pattern, i.e.,
every edge at a particular level is visited before proceeding to the next level.
The following example sets the edge lengths of a tree to the proportions of the total tree length that they represent:
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
7
8
9
mle = dendropy.Tree.get_from_path(’pythonidae.mle.nex’, ’nexus’)
mle_len = mle.length()
for edge in mle.postorder_edge_iter():
edge.length = float(edge.length)/mle_len
print(mle.as_string("newick"))
While this one removes the edge lengths entirely:
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
7
8
9
mle = dendropy.Tree.get_from_path(’pythonidae.mle.nex’, ’nexus’)
mle_len = mle.length()
for edge in mle.postorder_edge_iter():
edge.length = None
print(mle.as_string("newick"))
Like the node iterators, the edge iterators also optionally take a filter function as a second argument, except here the
filter function should take an Edge object as an argument. The following example shows how you might iterate over
all edges with lengths less than some value:
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
7
8
mle = dendropy.Tree.get_from_path(’pythonidae.mle.nex’, ’nexus’)
short = lambda edge: True if edge.length < 0.01 else False
for edge in mle.postorder_edge_iter(short):
print(edge.length)
3.2.2 Finding Nodes on Trees
Nodes with Particular Taxa
To retrieve a node associated with a particular taxon, we can use the find_taxon_node method, which takes a
filter function as an argument. The filter function should take a Taxon object as an argument and return True if the
3.2. Tree Traversal and Navigation
37
DendroPy Tutorial, Release 3.12.1
taxon is to be returned. For example:
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
7
8
tree = dendropy.Tree.get_from_path("pythonidae.mle.nex", "nexus")
filter = lambda taxon: True if taxon.label==’Antaresia maculosa’ else False
node = tree.find_node_with_taxon(filter)
print(node.description())
Because we might find it easier to refer to Taxon objects by their labels, a convenience method that wraps the retrieval
of nodes associated with Taxon objects of particular label is provided:
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
7
tree = dendropy.Tree.get_from_path("pythonidae.mle.nex", "nexus")
node = tree.find_node_with_taxon_label(’Antaresia maculosa’)
print(node.description())
Most Recent Common Ancestors
The MRCA (most recent common ancestor) of taxa or nodes can be retrieved by the instance method mrca. This
method takes a list of Taxon objects given by the taxa keyword argument, or a list of taxon labels given by the
taxon_labels keyword argument, and returns a Node object that corresponds to the MRCA of the specified taxa.
For example:
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
7
8
9
10
11
tree = dendropy.Tree.get_from_path("pythonidae.mle.nex", "nexus")
taxon_labels=[’Python sebae’,
’Python regius’,
’Python curtus’,
’Python molurus’]
mrca = tree.mrca(taxon_labels=taxon_labels)
print(mrca.description())
Note that this method is inefficient when you need to resolve MRCA’s for multiple sets or pairs of taxa. In this context,
the PatristicDistanceMatrix offers a more efficient approach, and should be preferred for applications such
as calculating the patristic distances between all pairs of taxa.
3.3 Tree Statistics, Metrics, and Calculations
3.3.1 Tree Length
The length method returns the sum of edge lengths of a Tree object, with edges that do not have any length
assigned being treated as edges with length 0. The following example shows how to identify the “critical” value for
an Archie-Faith-Cranston or PTP test from a sample of Tree objects, i.e. a tree length equal to or greater than 95%
of the trees in the sample:
38
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
7
8
9
trees = dendropy.TreeList.get_from_path("pythonidae.random.bd0301.tre", "nexus")
tree_lengths = [tree.length() for tree in trees]
tree_lengths.sort()
crit_index_95 = int(0.95 * len(tree_lengths))
crit_index_99 = int(0.99 * len(tree_lengths))
10
11
12
print("95%% critical value: %s" % tree_lengths[crit_index_95])
print("99%% critical value: %s" % tree_lengths[crit_index_99])
3.3.2 Node Ages
The calc_node_ages method calculates the age of a node (i.e., the sum of edge lengths from the node to a tip)
and assigns it to the age attribute. The following example iterates through the post-burn-in of an MCMC sample of
ultrametric trees, calculating the age of the MRCA of two taxa, and reports the mean age of the node.
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
7
8
9
10
11
12
13
14
15
16
trees = dendropy.TreeList.get_from_path("pythonidae.beast-mcmc.trees",
"nexus",
tree_offset=200)
maculosa_childreni_ages = []
for idx, tree in enumerate(trees):
tree.calc_node_ages()
node1 = tree.find_node_with_taxon_label(label="Antaresia maculosa")
node2 = tree.find_node_with_taxon_label(label="Antaresia childreni")
mrca = dendropy.Tree.mrca(node1, node2)
maculosa_childreni_ages.append(mrca.age)
print("Mean age of MRCA of ’Antaresia maculosa’ and ’Antaresia childreni’: %s" \
% (float(sum(maculosa_childreni_ages))/len(maculosa_childreni_ages)))
3.3.3 Pybus-Harvey Gamma
The Pybus-Harvey Gamma statistic is given by the pybus_harvey_gamma instance method. The following example iterates through the post-burn-in of an MCMC sample of trees, reporting the mean Pybus-Harvey Gamma statistic:
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
7
8
9
10
11
12
trees = dendropy.TreeList.get_from_path("pythonidae.beast-mcmc.trees",
"nexus",
tree_offset=200)
pbhg = []
for idx, tree in enumerate(trees):
pbhg.append(tree.pybus_harvey_gamma())
print("Mean Pybus-Harvey-Gamma: %s" \
% (float(sum(pbhg))/len(pbhg)))
3.3. Tree Statistics, Metrics, and Calculations
39
DendroPy Tutorial, Release 3.12.1
3.3.4 Patristic Distances
The PatristicDistanceMatrix is the most efficient way to calculate the patristic distances between taxa or
leaves on a tree, when doing multiple such calculations. Its constructor takes a Tree object as an argument, and the
object return is callable, taking two Taxon objects as arguments and returning the sum of edge lengths between the
two. The following example reports the pairwise distances between all taxa on the input tree:
1
#! /usr/bin/env python
2
3
4
import dendropy
from dendropy import treecalc
5
6
7
8
9
10
tree = dendropy.Tree.get_from_path("pythonidae.mle.nex", "nexus")
pdm = treecalc.PatristicDistanceMatrix(tree)
for i, t1 in enumerate(tree.taxon_set):
for t2 in tree.taxon_set[i+1:]:
print("Distance between ’%s’ and ’%s’: %s" % (t1.label, t2.label, pdm(t1, t2)))
3.3.5 Probability Under the Coalescent Model
The coalescent module provides a range of methods for simulations and calculations under Kingman’s coalescent
framework and related models. For example:
log_probability_of_coalescent_tree Given a Tree object as the first argument, and the
haploid population size as the second, returns the log probability of the Tree under the neutral
coalescent.
kl_divergence_coalescent_trees Reports the Kullback-Leilber divergence of a list of trees
from the theoretical distribution of neutral coalescent trees. Requires the de Hoon statistics package
package to be installed.
3.3.6 Numbers of Deep Coalescences
reconciliation_discordance Given two Tree objects sharing the same leaf-set, this returns
the number of deep coalescences resulting from fitting the first tree (e.g., a gene tree) to the second
(e.g., a species tree). This is based on the algorithm described Goodman, et al. (Goodman, et al.,
1979. Fitting the gene lineage into its species lineage,a parsimony strategy illustrated by cladograms
constructed from globin sequences. Syst. Zool. 19: 99-113).
Changed in version 3.3.0: Renamed and moved to reconcile module.
monophyletic_partition_discordance Given a Tree object as the first argument, and a list
of lists of Taxon objects representing the expected monophyletic partitioning of the TaxonSet
of the Tree as the second argument, this returns the number of deep coalescences found in the
relationships implied by the Tree object, conditional on the taxon groupings given by the second
argument. This statistic corresponds to the Slatkin and Maddison (1989) s statistic, as described
here.
Changed in version 3.3.0: Renamed and moved to reconcile module.
3.3.7 Number of Deep Coalescences when Embedding One Tree in Another (e.g.
Gene/Species Trees)
Imagine we wanted to generate the distribution of the number of deep coalescences under two scenarios: one in which
a population underwent sequential or step-wise vicariance, and another when there was simultaneous fragmentation.
40
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
In this case, the containing tree and the embedded trees have different leaf sets, and there is a many-to-one mapping
of embedded tree taxa to containing tree taxa.
The ContainingTree class is designed to allow for counting deep coalescences in cases like this. It requires a TaxonSetMapping object, which provides an association between the embedded taxa and the containing taxa. The easiest way to get a TaxonSetMapping object is to call the special factory function
create_contained_taxon_mapping. This will create a new TaxonSet to manage the gene taxa, and create
the associations between the gene taxa and the containing tree taxa for you. It takes two arguments: the TaxonSet
of the containing tree, and the number of genes you want sampled from each species. If the gene-species associations
are more complex, e.g., different numbers of genes per species, we can pass in a list of values as the second argument
to ~dendropy.dataobject.taxon.TaxonSetMapping.create_contained_taxon_mapping(). This approach should be used
with caution if we cannot be certain of the order of taxa (as is the case with data read in Newick formats). In these
case, and in more complex cases, we might need to directly instantiate the TaxonSetMapping object. The API to
describe the associations when constructing this object is very similar to that of the TaxonSetPartition object:
you can use a function, attribute or dictionary.
The ContainingTree class has its own native contained coalescent simulator, embed_contained_kingman,
which simulates and embeds a contained coalescent tree at the same time.
#! /usr/bin/env python
import dendropy
from dendropy import treesim
from dendropy import reconcile
# simulation parameters and output
num_reps = 10
# population tree descriptions
stepwise_tree_str = "[&R](A:120000,(B:80000,(C:40000,D:40000):40000):40000):100000"
frag_tree_str = "[&R](A:120000,B:120000,C:120000,D:120000):100000"
# taxa and trees
containing_taxa = dendropy.TaxonSet()
stepwise_tree = dendropy.Tree.get_from_string(
stepwise_tree_str,
"newick",
taxon_set=containing_taxa)
frag_tree = dendropy.Tree.get_from_string(
frag_tree_str,
"newick",
taxon_set=containing_taxa)
# taxon set association
genes_to_species = dendropy.TaxonSetMapping.create_contained_taxon_mapping(
containing_taxon_set=containing_taxa,
num_contained=8)
# convert to containing tree
stepwise_tree = reconcile.ContainingTree(stepwise_tree,
contained_taxon_set=genes_to_species.domain_taxon_set,
contained_to_containing_taxon_map=genes_to_species)
frag_tree = reconcile.ContainingTree(frag_tree,
contained_taxon_set=genes_to_species.domain_taxon_set,
contained_to_containing_taxon_map=genes_to_species)
# for each rep
for rep in range(num_reps):
3.3. Tree Statistics, Metrics, and Calculations
41
DendroPy Tutorial, Release 3.12.1
stepwise_tree.embed_contained_kingman(default_pop_size=40000)
frag_tree.embed_contained_kingman(default_pop_size=40000)
# write results
# returns dictionary with contained trees as keys
# and number of deep coalescences as values
stepwise_deep_coals = stepwise_tree.deep_coalescences()
stepwise_out = open("stepwise.txt", "w")
for tree in stepwise_deep_coals:
stepwise_out.write("%d\n" % stepwise_deep_coals[tree])
# returns dictionary with contained trees as keys
# and number of deep coalescences as values
frag_deep_coals = frag_tree.deep_coalescences()
frag_out = open("frag.txt", "w")
for tree in frag_deep_coals:
frag_out.write("%d\n" % frag_deep_coals[tree])
If you have used some other method to simulate your trees, you can use embed_tree to embed the trees and count
then number of deep coalescences.
#! /usr/bin/env python
import dendropy
from dendropy import treesim
from dendropy import reconcile
# simulation parameters and output
num_reps = 10
# population tree descriptions
stepwise_tree_str = "[&R](A:120000,(B:80000,(C:40000,D:40000):40000):40000):100000"
frag_tree_str = "[&R](A:120000,B:120000,C:120000,D:120000):100000"
# taxa and trees
containing_taxa = dendropy.TaxonSet()
stepwise_tree = dendropy.Tree.get_from_string(
stepwise_tree_str,
"newick",
taxon_set=containing_taxa)
frag_tree = dendropy.Tree.get_from_string(
frag_tree_str,
"newick",
taxon_set=containing_taxa)
# taxon set association
genes_to_species = dendropy.TaxonSetMapping.create_contained_taxon_mapping(
containing_taxon_set=containing_taxa,
num_contained=8)
# convert to containing tree
stepwise_tree = reconcile.ContainingTree(stepwise_tree,
contained_taxon_set=genes_to_species.domain_taxon_set,
contained_to_containing_taxon_map=genes_to_species)
frag_tree = reconcile.ContainingTree(frag_tree,
contained_taxon_set=genes_to_species.domain_taxon_set,
contained_to_containing_taxon_map=genes_to_species)
42
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
# for each rep
for rep in range(num_reps):
gene_tree1 = treesim.contained_coalescent(containing_tree=stepwise_tree,
gene_to_containing_taxon_map=genes_to_species,
default_pop_size=40000)
stepwise_tree.embed_tree(gene_tree1)
gene_tree2 = treesim.contained_coalescent(containing_tree=frag_tree,
gene_to_containing_taxon_map=genes_to_species,
default_pop_size=40000)
frag_tree.embed_tree(gene_tree2)
# write results
# returns dictionary with contained trees as keys
# and number of deep coalescences as values
stepwise_deep_coals = stepwise_tree.deep_coalescences()
stepwise_out = open("stepwise.txt", "w")
for tree in stepwise_deep_coals:
stepwise_out.write("%d\n" % stepwise_deep_coals[tree])
# returns dictionary with contained trees as keys
# and number of deep coalescences as values
frag_deep_coals = frag_tree.deep_coalescences()
frag_out = open("frag.txt", "w")
for tree in frag_deep_coals:
frag_out.write("%d\n" % frag_deep_coals[tree])
For more details on simulating contained coalescent trees and counting numbers of deep coalescences on them, see
“Contained Coalescent Trees” or “Simulating the Distribution of Number Deep Coalescences Under Different Phylogeographic History Scenarios”.
3.3.8 Majority-Rule Consensus Tree from a Collection of Trees
To get the majority-rule consensus tree of a TreeList object, you can call the consensus instance method. You
can specify the frequency threshold for the consensus tree by the min_freq argument, which default to 0.5 (i.e., a
50% majority rule tree). The following example aggregates the post-burn-in trees from four MCMC samples into a
single TreeList object, and prints the 95% majority-rule consensus as a Newick string:
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
7
8
9
10
11
12
13
14
15
trees = dendropy.TreeList()
for tree_file in [’pythonidae.mb.run1.t’,
’pythonidae.mb.run2.t’,
’pythonidae.mb.run3.t’,
’pythonidae.mb.run4.t’]:
trees.read_from_path(
tree_file,
’nexus’,
tree_offset=20)
con_tree = trees.consensus(min_freq=0.95)
print(con_tree.as_string(’newick’))
3.3. Tree Statistics, Metrics, and Calculations
43
DendroPy Tutorial, Release 3.12.1
3.3.9 Frequency of a Split in a Collection of Trees
The frequency_of_split method of a TreeList object returns the frequency of occurrence of a single split
across all the Tree objects in the TreeList. The split can be specified by passing a split bitmask directly using the
split_bitmask keyword argument, as a list of Taxon objects using the taxa keyword argument, or as a list of
taxon labels using the labels keyword argument. The following example shows how to calculate the frequency of a
split defined by two taxa, “Morelia amethistina” and “Morelia tracyae”, from the post-burn-in trees aggregated across
four MCMC samples:
1
#! /usr/bin/env python
2
3
import dendropy
4
5
6
7
8
9
10
11
12
13
14
15
16
trees = dendropy.TreeList()
for tree_file in [’pythonidae.mb.run1.t’,
’pythonidae.mb.run2.t’,
’pythonidae.mb.run3.t’,
’pythonidae.mb.run4.t’]:
trees.read_from_path(
tree_file,
’nexus’,
tree_offset=20)
split_leaves = [’Morelia amethistina’, ’Morelia tracyae’]
f = trees.frequency_of_split(labels=split_leaves)
print(’Frequency of split %s: %s’ % (split_leaves, f))
3.3.10 Tree Distances
Native Tree Methods
New in version 3.2.
The Tree class provides methods for calculating distances between two trees:
symmetric_difference This method returns the symmetric distance between two trees. The symmetric distance between two trees is the sum of the number of splits found in one of the trees but not
the other. It is common to see this statistic called the “Robinson-Foulds distance”, but in DendroPy
we reserve this term to apply to the Robinson-Foulds distance in the strict sense, i.e., the weighted
symmetric distance (see below).
false_positives_and_negatives This method returns a tuple pair, with the first element the
number of splits in the current Tree object not found in the Tree object to which it is being
compared, while the second element is the number of splits in the second Tree object that are
not in the current Tree. The sum of these two elements is exactly equal to the value reported by
symmetric_distance.
euclidean_distance This method returns the “branch length distance” of Felsenstein (2004), i.e.
the sum of absolute differences in branch lengths for equivalent splits between two trees, with the
branch length for a missing split taken to be 0.0.
robinson_foulds_distance This method returns the Robinsons-Foulds distance between two
trees, i.e., the sum of the square of differences in branch lengths for equivalent splits between two
trees, with the branch length for a missing split taken to be 0.0.
For example:
44
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
>>> import dendropy
>>> s1 = "((t5:0.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):0.028969):0.065
>>> s2 = "((t5:2.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):1):0.065840,t3:
>>> tree1 = dendropy.Tree.get_from_string(s1, ’newick’)
>>> tree2 = dendropy.Tree.get_from_string(s2, ’newick’)
>>> tree1.symmetric_difference(tree2)
0
>>> tree1.false_positives_and_negatives(tree2)
(0, 0)
>>> tree1.euclidean_distance(tree2)
2.2232636377544162
>>> tree1.robinson_foulds_distance(tree2)
2.971031
Using the treecalc Module
The treecalc module provides for these operations as independent functions that take two Tree objects as arguments. These independent functions require that both trees have the same TaxonSet reference, otherwise an
exception is raised:
>>> import dendropy
>>> from dendropy import treecalc
>>> s1 = "((t5:0.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):0.028969):0.065
>>> s2 = "((t5:2.161175,t6:0.161175):0.392293,((t4:0.104381,(t2:0.075411,t1:0.075411):1):0.065840,t3:
>>> tree1 = dendropy.Tree.get_from_string(s1, ’newick’)
>>> tree2 = dendropy.Tree.get_from_string(s2, ’newick’)
>>> treecalc.symmetric_difference(tree1, tree2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "treecalc.py", line 240, in symmetric_difference
t = false_positives_and_negatives(tree1, tree2)
File "treecalc.py", line 254, in false_positives_and_negatives
% (hex(id(reference_tree.taxon_set)), hex(id(test_tree.taxon_set))))
TypeError: Trees have different TaxonSet objects: 0x10111ec00 vs. 0x10111eaa0
>>> treecalc.euclidean_distance(tree1, tree2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "treecalc.py", line 236, in euclidean_distance
value_type=value_type)
File "treecalc.py", line 160, in splits_distance
% (hex(id(tree1.taxon_set)), hex(id(tree2.taxon_set))))
TypeError: Trees have different TaxonSet objects: 0x10111ec00 vs. 0x10111eaa0
>>> treecalc.robinson_foulds_distance(tree1, tree2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "treecalc.py", line 223, in robinson_foulds_distance
value_type=float)
File "treecalc.py", line 160, in splits_distance
% (hex(id(tree1.taxon_set)), hex(id(tree2.taxon_set))))
TypeError: Trees have different TaxonSet objects: 0x10111ec00 vs. 0x10111eaa0
>>> tree3 = dendropy.Tree.get_from_string(s2, ’newick’, taxon_set=tree1.taxon_set)
>>> treecalc.symmetric_difference(tree1, tree3)
0
>>> treecalc.euclidean_distance(tree1, tree3)
2.2232636377544162
>>> treecalc.robinson_foulds_distance(tree1, tree3)
2.971031
3.3. Tree Statistics, Metrics, and Calculations
45
DendroPy Tutorial, Release 3.12.1
3.4 Tree Manipulation and Restructuring
The Tree class provides both low-level and high-level methods for manipulating tree structure.
Note: In versions of DendroPy prior to 3.8.0, some of the functionality described here were available as standalone
functions in the treemanip module. With version 3.8.0, this functionality has been refactored into native instance
methods of the Tree class. The functions are still available in the treemanip module, but these will soon be
deprecated. All new code carrying out any of the operations described below should be written using native Tree
methods, rather than the standalone functions in the treemanip module.
Low-level methods are associated with Node objects, and allow to restructure the relationships between nodes at a
fine level: add_child, new_child, remove_child, etc.
In most cases, however, you will be using high-level methods to restructure Tree objects.
In all cases, if any part of the Tree object’s structural relations change, and you are interested in calculating any
metrics or statistics on the tree or comparing the tree to another tree, you need to call update_splits on the object
to update the internal splits hash representation. This is not done for you automatically because there is a computational
cost associated with the operation, and the splits hashes are not always needed. Furthermore, even when needed, if
there are a number of structural changes to be made to a Tree object before calculations/comparisions, it makes sense
to postpone the splits rehashing until there all the tree manipulations are completed. Most methods that affect the tree
structure that require the splits hashes to updated take a update_splits argument. By specifying True for this, the
Tree object will recalculate the splits hashes after the changes have been made.
3.4.1 Rooting, Derooting and Rerooting
Setting the Rooting State
All Tree objects have a boolean property, is_rooted that DendroPy uses to track whether or not the tree should
be treated as rooted. The property is_unrooted is also defined, and these two properties are synchronized. Thus
setting is_rooted to True will result in is_rooted being set to False and vice versa.
The state of a Tree object’s rootedness flag does not modify any internal structural relationship between nodes.
It simply determines how its splits hashes are calculated, which in turn affects a broad range of comparison and
metric operations. Thus you need to update the splits hashes after modifying the is_rooted property by calling the update_splits before carrying out any calculations on or with the Tree object. Note that calling
update_splits on an unrooted tree will force the basal split to be a trifurcation. So if the original tree was
bifurcating, the end result will be a tree with a trifurcation at the root. This can be prevented by passing in the keyword
argument delete_outdegree_one=False to update_splits.
#! /usr/bin/env python
import dendropy
tree_str = "[&R] (A, (B, (C, (D, E))));"
tree = dendropy.Tree.get_from_string(
tree_str,
"newick")
print("Original:")
print(tree.as_ascii_plot())
tree.is_rooted = False
print("After ‘is_rooted=False‘:")
46
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
print(tree.as_ascii_plot())
tree.update_splits()
print("After ‘update_splits()‘:")
print(tree.as_ascii_plot())
tree2 = dendropy.Tree.get_from_string(
tree_str,
"newick")
tree2.is_rooted = False
tree2.update_splits(delete_outdegree_one=False)
print("After ‘update_splits(delete_outdegree_one=False)‘:")
print(tree2.as_ascii_plot())
will result in:
Original:
/---------------------------------------------------+
|
/--------------------------------------\------------+
|
/-------------------------\------------+
|
/------------\------------+
\-------------
After ‘is_rooted=False‘:
/---------------------------------------------------+
|
/--------------------------------------\------------+
|
/-------------------------\------------+
|
/------------\------------+
\-------------
After ‘update_splits()‘:
/---------------------------------------------------|
+---------------------------------------------------|
|
/----------------------------------\----------------+
|
/----------------\-----------------+
\-----------------
A
B
C
D
E
A
B
C
D
E
A
B
C
D
E
After ‘update_splits(delete_outdegree_one=False)‘:
/---------------------------------------------------- A
+
|
/--------------------------------------- B
\------------+
|
/-------------------------- C
3.4. Tree Manipulation and Restructuring
47
DendroPy Tutorial, Release 3.12.1
\------------+
|
/------------- D
\------------+
\------------- E
Derooting
To deroot a rooted Tree, you can also call the deroot method, which collapses the root to a trifurcation if it is
bifurcation and sets the is_rooted to False. The deroot method has the same structural and semantic affect of
is_rooted to False and then calling update_splits. You would use the former if you are not going to be doing
any tree comparisons or calculating tree metrics, and thus do not want to calculate the splits hashes.
Rerooting
To reroot a Tree along an existing edge, you can use the reroot_at_edge method. This method takes an Edge
object as as its first argument. This rerooting is a structural change that will require the splits hashes to be updated
before performing any tree comparisons or calculating tree metrics. If needed, you can do this yourself by calling
update_splits later, or you can pass in True as the second argument to the reroot_at_edge method call,
which instructs DendroPy to automatically update the splits for you.
As an example, the following reroots the tree along an internal edge (note that we do not recalculate the splits hashes,
as we are not carrying out any calculations or comparisons with the Tree):
#! /usr/bin/env python
import dendropy
tree_str = "[&R] (A, (B, (C, (D, E))));"
tree = dendropy.Tree.get_from_string(
tree_str,
"newick")
print("Before:")
print(tree.as_string(’newick’))
print(tree.as_ascii_plot())
mrca = tree.mrca(taxon_labels=["D", "E"])
tree.reroot_at_edge(mrca.edge, update_splits=False)
print("After:")
print(tree.as_string(’newick’))
print(tree.as_ascii_plot())
and results in:
Before:
[&R] (A,(B,(C,(D,E))));
/---------------------------------------------------+
|
/--------------------------------------\------------+
|
/-------------------------\------------+
|
/------------\------------+
\-------------
48
A
B
C
D
E
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
After:
[&R] ((D,E),(C,(B,A)));
/----------------/----------------------------------+
|
\----------------+
|
/----------------------------------\----------------+
|
/----------------\-----------------+
\-----------------
D
E
C
B
A
Another example, this time rerooting along an edge subtending a tip instead of an internal edge:
#! /usr/bin/env python
import dendropy
tree_str = "[&R] (A, (B, (C, (D, E))));"
tree = dendropy.Tree.get_from_string(
tree_str,
"newick")
print("Before:")
print(tree.as_string(’newick’))
print(tree.as_ascii_plot())
node_D = tree.find_node_with_taxon_label("D")
tree.reroot_at_edge(node_D.edge, update_splits=False)
print("After:")
print(tree.as_string(’newick’))
print(tree.as_ascii_plot())
which results in:
Before:
[&R] (A,(B,(C,(D,E))));
/---------------------------------------------------+
|
/--------------------------------------\------------+
|
/-------------------------\------------+
|
/------------\------------+
\-------------
A
B
C
D
E
After:
[&R] (D,(E,(C,(B,A))));
/---------------------------------------------------- D
+
|
/--------------------------------------- E
\------------+
3.4. Tree Manipulation and Restructuring
49
DendroPy Tutorial, Release 3.12.1
|
/-------------------------- C
\------------+
|
/------------- B
\------------+
\------------- A
To reroot a Tree at a node instead, you can use the reroot_at_node method:
#! /usr/bin/env python
import dendropy
tree_str = "[&R] (A, (B, (C, (D, E))));"
tree = dendropy.Tree.get_from_string(
tree_str,
"newick")
print("Before:")
print(tree.as_string(’newick’))
print(tree.as_ascii_plot())
mrca = tree.mrca(taxon_labels=["D", "E"])
tree.reroot_at_node(mrca, update_splits=False)
print("After:")
print(tree.as_string(’newick’))
print(tree.as_ascii_plot())
which results in:
Before:
[&R] (A,(B,(C,(D,E))));
/---------------------------------------------------+
|
/--------------------------------------\------------+
|
/-------------------------\------------+
|
/------------\------------+
\-------------
A
B
C
D
E
After:
[&R] (D,E,(C,(B,A)));
/---------------------------------------------------|
+---------------------------------------------------|
|
/----------------------------------\----------------+
|
/----------------\-----------------+
\-----------------
D
E
C
B
A
You can also reroot the tree such that a particular node is moved to the outgroup position using the
to_outgroup_position, which takes a Node as the first argument. Again, you can update the splits hashes
in situ by passing True to the second argument, and again, here we do not because we are not carrying out any calcu50
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
lations. For example:
#! /usr/bin/env python
import dendropy
tree_str = "[&R] (A, (B, (C, (D, E))));"
tree = dendropy.Tree.get_from_string(
tree_str,
"newick")
print("Before:")
print(tree.as_string(’newick’))
print(tree.as_ascii_plot())
outgroup_node = tree.find_node_with_taxon_label("C")
tree.to_outgroup_position(outgroup_node, update_splits=False)
print("After:")
print(tree.as_string(’newick’))
print(tree.as_ascii_plot())
which will result in:
Before:
[&R] (A,(B,(C,(D,E))));
/---------------------------------------------------+
|
/--------------------------------------\------------+
|
/-------------------------\------------+
|
/------------\------------+
\-------------
A
B
C
D
E
After:
[&R] (C,(D,E),(B,A));
/---------------------------------------------------|
|
/-------------------------+-------------------------+
|
\-------------------------|
|
/-------------------------\-------------------------+
\--------------------------
C
D
E
B
A
If you have a tree with edge lengths specified, you can reroot it at the midpoint, using the reroot_at_midpoint
method:
#! /usr/bin/env python
import dendropy
tree_str = "[&R] (A:0.55, (B:0.82, (C:0.74, (D:0.42, E:0.64):0.24):0.15):0.20):0.3;"
tree = dendropy.Tree.get_from_string(
3.4. Tree Manipulation and Restructuring
51
DendroPy Tutorial, Release 3.12.1
tree_str,
"newick")
print("Before:")
print(tree.as_string(’newick’))
print(tree.as_ascii_plot(plot_metric=’length’))
tree.reroot_at_midpoint(update_splits=False)
print("After:")
print(tree.as_string(’newick’))
print(tree.as_ascii_plot(plot_metric=’length’))
which results in:
Before:
[&R] (A:0.55,(B:0.82,(C:0.74,(D:0.42,E:0.64):0.24):0.15):0.2):0.3;
/------------------- A
+
|
/---------------------------- B
\------+
|
/-------------------------- C
\----+
|
/-------------- D
\--------+
\---------------------- E
After:
[&R] ((C:0.74,(D:0.42,E:0.64):0.24):0.045,(B:0.82,A:0.75):0.105):0.3;
/------------------------------- C
/-+
| |
/------------------ D
| \---------+
+
\---------------------------- E
|
|
/------------------------------------ B
\---+
\-------------------------------- A
3.4.2 Pruning Subtrees and Tips
To remove a set of tips from a Tree, you cna use either the prune_taxa or the prune_taxa_with_labels
methods. The first takes a container of TaxonSet objects as an argument, while the second takes container of strings.
In both cases, nodes associated with the specified taxa (as given by the TaxonSet objects directly in the first case,
or TaxonSet objects with labels given in the list of string in the second case) will e removed from the tree. For
example:
#! /usr/bin/env python
import dendropy
tree_str = "[&R] ((A, (B, (C, (D, E)))),(F, (G, H)));"
tree = dendropy.Tree.get_from_string(
tree_str,
"newick")
52
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
print("Before:")
print(tree.as_string(’newick’))
print(tree.as_ascii_plot())
tree.prune_taxa_with_labels(["A", "C", "G"])
print("After:")
print(tree.as_string(’newick’))
print(tree.as_ascii_plot())
which results in:
Before:
[&R] ((A,(B,(C,(D,E)))),(F,(G,H)));
/------------------------------------------/---------+
|
|
/-------------------------------|
\----------+
|
|
/--------------------|
\----------+
+
|
/---------|
\----------+
|
\---------|
|
/--------------------\-------------------------------+
|
/---------\----------+
\----------
A
B
C
D
E
F
G
H
After:
[&R] ((B,(D,E)),(F,H));
/----------------------------------/-----------------+
|
|
/----------------|
\-----------------+
+
\----------------|
|
/----------------\-----------------------------------+
\-----------------
B
D
E
F
H
Alternatively, the tree can be pruned based on a set of taxa that you want to keep. This can be affected through the use
of the counterpart “retain” methods, retain_taxa and retain_taxa_with_labels. For example:
#! /usr/bin/env python
import dendropy
tree_str = "[&R] ((A, (B, (C, (D, E)))),(F, (G, H)));"
tree = dendropy.Tree.get_from_string(
tree_str,
"newick")
print("Before:")
print(tree.as_string(’newick’))
3.4. Tree Manipulation and Restructuring
53
DendroPy Tutorial, Release 3.12.1
print(tree.as_ascii_plot())
tree.retain_taxa_with_labels(["A", "C", "G"])
print("After:")
print(tree.as_string(’newick’))
print(tree.as_ascii_plot())
which results in:
Before:
[&R] ((A,(B,(C,(D,E)))),(F,(G,H)));
/------------------------------------------/---------+
|
|
/-------------------------------|
\----------+
|
|
/--------------------|
\----------+
+
|
/---------|
\----------+
|
\---------|
|
/--------------------\-------------------------------+
|
/---------\----------+
\----------
A
B
C
D
E
F
G
H
After:
[&R] ((A,C),G);
/-------------------------- A
/--------------------------+
+
\-------------------------- C
|
\----------------------------------------------------- G
Again, it should be noted that, as these operations modify the structure of the tree, you need to call update_splits
to update the internal splits hashes, before carrying out any calculations, comparisons, or metrics.
3.4.3 Rotating
You can ladderize trees (sort the child nodes in order of the number of their children) by calling the ladderize
method. This method takes one argument, ascending. If ascending=True, which is the default, then the
nodes are sorted in ascending order (i.e., nodes with fewer children sort before nodes with more children). If
ascending=False, then the nodes are sorted in descending order (i.e., nodes with more children sorting before
nodes with fewer children). For example:
#! /usr/bin/env python
import dendropy
tree_str = "[&R] ((A, (B, (C, (D, E)))),(F, (G, H)));"
tree = dendropy.Tree.get_from_string(
tree_str,
"newick")
54
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
print("Before:")
print(tree.as_string(’newick’))
print(tree.as_ascii_plot())
tree.ladderize(ascending=True)
print("Ladderize, ascending=True:")
print(tree.as_string(’newick’))
print(tree.as_ascii_plot())
tree.ladderize(ascending=False)
print("Ladderize, ascending=False:")
print(tree.as_string(’newick’))
print(tree.as_ascii_plot())
results in:
Before:
[&R] ((A,(B,(C,(D,E)))),(F,(G,H)));
/------------------------------------------/---------+
|
|
/-------------------------------|
\----------+
|
|
/--------------------|
\----------+
+
|
/---------|
\----------+
|
\---------|
|
/--------------------\-------------------------------+
|
/---------\----------+
\----------
A
B
C
D
E
F
G
H
Ladderize, ascending=True:
[&R] ((F,(G,H)),(A,(B,(C,(D,E)))));
/--------------------/-------------------------------+
|
|
/---------|
\----------+
+
\---------|
|
/------------------------------------------\---------+
|
/-------------------------------\----------+
|
/--------------------\----------+
|
/---------\----------+
\----------
F
G
H
A
B
C
D
E
Ladderize, ascending=False:
[&R] (((((D,E),C),B),A),((G,H),F));
/---------- D
3.4. Tree Manipulation and Restructuring
55
DendroPy Tutorial, Release 3.12.1
/----------+
/----------+
\---------|
|
/----------+
\--------------------|
|
/---------+
\-------------------------------|
|
|
\------------------------------------------+
|
/---------|
/----------+
\-------------------------------+
\---------|
\---------------------
E
C
B
A
G
H
F
Tree rotation operations do not actually change the tree structure, at least in so far as splits are concerned, so it is not
neccessary to update the splits hashes.
3.5 Tree Simulation and Generation
The treesim module provides functions for the simulation of trees under a variety of theoretical models.
3.5.1 Birth-Death Process Trees
There are two different birth-death process tree simulation routines in DendroPy:
birth_death Returns a tree generated under a continuous-time birth-death process, with branch
lengths in arbitrary time units.
discrete_birth_death Returns a tree generated under discrete-time birth-death process, with
branch length in generation units.
Both of these functions have identical interfaces, and will grow a tree under a branching process with the specified
birth-date and death-rate until the termination condition (pre-specified number of leaves or maximum amount of time)
is met.
For example, to get a continuous-time tree with 10 leaves, generated under a birth rate of 1.0 and death rate of 0.5:
>>> from dendropy import treesim
>>> t = treesim.birth_death(birth_rate=1.0, death_rate=0.5,
>>> t.print_plot()
/-------------------------------------------|
/-------------+
/-------------|
|
/--------------+
|
\--------------+
\-------------|
|
|
\----------------------------+
|
/----------------------------|
/--------------+
|
|
|
/-------------|
|
\--------------+
\-------------+
\-------------|
|
/--------------
56
ntax=10)
T1
T2
T3
T4
T5
T6
T7
T8
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
|
/--------------+
\--------------+
\-------------- T9
|
\----------------------------- T10
While to get a continuous time tree generated under the same rates after 6 time units:
>>> t = treesim.birth_death(birth_rate=1.0, death_rate=0.5, max_time=6.0)
If both conditions are given simultaneously, then tree growth will terminate when any of the termination conditions
(i.e., number of tips == ntax, or number of tips == len(taxon_set) or maximum time == max_time) are met.
Specifying a TaxonSet
By default, a new Taxon object will be created and associated with each leaf (labeled “T1”, “T2”, etc.), all belonging
to a new TaxonSet object associated with the resulting tree.
You can pass in an explicit TaxonSet object using the “taxon_set” keyword:
>>>
>>>
>>>
>>>
>>>
import dendropy
from dendropy import treesim
taxa = dendropy.TaxonSet([’a’, ’b’, ’c’, ’d’, ’e’, ’f’, ’g’, ’h’])
t = treesim.birth_death(0.4, 0.1, taxon_set=taxa)
t.print_plot()
/-------------------------------------- h
|
/-----------+
/------------ c
|
|
/------------+
|
\------------+
\------------ a
|
|
+
\------------------------- g
|
|
/------------ e
|
/------------+
|
|
\------------ f
\------------------------+
|
/------------ d
\------------+
\------------ b
In this case, the branching process underlying the tree generation will terminate when the number of leaves in the tree
equals the number of taxa in the TaxonSet “taxa”, and the Taxon objects in “taxa” will be randomly assigned
to the leaves.
The “taxon_set” keyword can be combined with the “ntax” keyword. If the size of the TaxonSet object given
by the taxon_set argument is greater than the specified target tree taxon number, then a random subset of Taxon
object in the TaxonSet will be assigned to the leaves:
>>> import dendropy
>>> from dendropy import treesim
>>> taxa = dendropy.TaxonSet([’a’, ’b’, ’c’, ’d’, ’e’, ’f’, ’g’, ’h’])
>>> t = treesim.birth_death(birth_rate=1.0, death_rate=0.5, ntax=5, taxon_set=taxa)
>>> t.print_plot()
/-------------------------------------------------- g
|
+
/------------------------- a
|
/------------+
|
|
|
/------------ d
3.5. Tree Simulation and Generation
57
DendroPy Tutorial, Release 3.12.1
\-----------+
\------------+
|
\------------ c
|
\-------------------------------------- f
If the size of the TaxonSet object is less than the target taxon number, then new Taxon objects will be created as
needed and added to the TaxonSet object as well as associated with the leaves:
>>>
>>>
>>>
>>>
>>>
import dendropy
from dendropy import treesim
taxa = dendropy.TaxonSet([’a’, ’b’])
t = treesim.birth_death(birth_rate=1.0, death_rate=0.5, ntax=5, taxon_set=taxa)
t.print_plot()
/---------------- a
/--------------------------------+
|
\---------------- b
+
|
/--------------------------------- T3
\---------------+
|
/---------------- T4
\----------------+
\---------------- T5
Repeating Failed Branching Processes
With a non-zero death rate, it is possible for all lineages of a tree to go extinct before the termination conditions are
reached. In this case, by default a TreeSimTotalExtinctionException will be raised:
>>> t = treesim.birth_death(birth_rate=1.0, death_rate=0.9, ntax=10)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jeet/Projects/DendroPy/dendropy/treesim.py", line 188, in birth_death
raise TreeSimTotalExtinctionException()
dendropy.treesim.TreeSimTotalExtinctionException
If the keyword argument “repeat_until_success” is given, then instead of raising an exception the process
starts again and repeats until the termination condition is met:
>>> t = treesim.birth_death(birth_rate=1.0,
...
death_rate=0.9,
...
ntax=10,
...
repeat_until_success=True)
>>> t.print_plot()
/------------------/--------------------------------------+
|
|
/--------|
\---------+
|
\--------+
|
/--------|
/---------------------------------------+
|
|
\--------|
|
\--------+
/----------------------------|
/---------+
|
|
|
/------------------|
|
\---------+
58
T1
T2
T3
T4
T5
T6
T7
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
\---------+
|
/--------- T8
|
\---------+
|
\--------- T9
|
\--------------------------------------- T10
Suppressing Taxon Assignment
You can specify “assign_taxa” to be False to avoid taxa from being automatically assigned to a tree (for example, when you want to build a tree in stages – see below).
Extending an Existing Tree
Both these functions also accept a Tree object (with valid branch lengths) as an argument passed using the keyword
tree. If given, then this tree will be used as the starting point; otherwise a new one will be created.
Evolving Birth and Death Rates
The same functions can also produce trees generated under variable birth and death rates. The “birth_rate_sd”
keyword argument specifies the standard deviation of the normally-distributed error of birth rates as they evolve from
parent to child node, while the “death_rate_sd” keyword argument specifies the same of the the death rates. For
example, to get a 10-taxon tree generated under a birth- and death-rate that evolves with a standard deviation of 0.1:
>>> t = treesim.birth_death(birth_rate=1.0,
death_rate=0.5,
birth_rate_sd=0.1,
death_rate_sd=0.1,
ntax=10)
Building a Tree in Multiple Stages under Different Conditions
You might want to generate a tree under different condition in different stages. To do this, you would start with an
empty tree and passing it to the birth-death function as an argument using the “tree” keyword argument, and at
the same time suppress the automatic taxon assignment using the “assign_taxa=False” keyword argument to
avoid taxa being assigned to what will eventually become internal nodes. When the tree is ready, you will call the
randomly_assign_taxa function to assign taxa at random to the leaves.
For example, the following generates a birth-death tree with equal birth and death rates, but both rates shifting for a
short while to a temporarily higher (though equal) rates:
1
# /usr/bin/env python
2
3
4
5
import random
import dendropy
from dendropy import treesim
6
7
8
9
10
11
12
13
def generate(birth_rates, death_rates):
assert len(birth_rates) == len(death_rates)
tree = dendropy.Tree()
for i, br in enumerate(birth_rates):
tree = treesim.birth_death(birth_rates[i],
death_rates[i],
max_time=random.randint(1,8),
3.5. Tree Simulation and Generation
59
DendroPy Tutorial, Release 3.12.1
tree=tree,
assign_taxa=False,
repeat_until_success=True)
print(tree.as_string(’newick’))
tree.randomly_assign_taxa(create_required_taxa=True)
return tree
14
15
16
17
18
19
20
21
22
tree = generate([0.1, 0.6, 0.1], [0.1, 0.6, 0.1])
print(tree.as_string(’newick’))
Another example draws birth and death rates from a normal distribution with the same mean and standard deviation in
multiple stages:
1
#! /usr/bin/env python
2
3
4
5
import random
import dendropy
from dendropy import treesim
6
7
8
9
10
11
12
13
14
15
16
17
def generate(mean, sd, num_periods):
tree = dendropy.Tree()
for i in range(num_periods):
tree = treesim.birth_death(birth_rate=random.gauss(mean, sd),
death_rate=random.gauss(mean, sd),
max_time=random.randint(1,5),
tree=tree,
assign_taxa=False,
repeat_until_success=True)
tree.randomly_assign_taxa(create_required_taxa=True)
return tree
18
19
20
tree = generate(0.1, 0.01, 100)
print(tree.as_string(’newick’))
3.5.2 Star Trees
The star_tree generates a simple polytomy tree, with a single node as the immediate ancestor to a set of leaves,
with one leaf per Taxon in the TaxonSet object given by the taxon_set argument. For example:
>>> from dendropy import treesim
>>> taxa = dendropy.TaxonSet([’a’, ’b’,
>>> tree = treesim.star_tree(taxa)
>>> print(tree.as_ascii_plot())
/-------------------------------------|
|-------------------------------------|
+-------------------------------------|
|-------------------------------------|
\--------------------------------------
’c’, ’d’, ’e’])
a
b
c
d
e
3.5.3 Population Genetic Trees
Coming soon: pop_gen_tree.
60
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
3.5.4 (Pure Neutral) Coalescent Trees
The pure_kingman function returns a tree generated under an unconstrained neutral coalescent model. The first
argument to this function, taxon_set, is a TaxonSet object, where each member Taxon object represents a gene
to be coalesced. The second argument, pop_size, specifies the population size in terms of the number of gene
copies in the population. This means that for a diploid population of size N, pop_size should be N*2, while for a
haploid population of size N, pop_size should be N. If pop_size is None, 1, or 0, then the edge lengths of the
returned gene tree will be in population units (i.e., 1 unit of edge length == to 2N generations if a diploid population
or 1N generations if a haploid population). Otherwise, the edge lengths will be in generation units. For example:
#! /usr/bin/env python
import dendropy
from dendropy import treesim
taxa = dendropy.TaxonSet(["z1", "z2", "z3", "z4", "z5", "z6", "z7", "z8"])
tree = treesim.pure_kingman(taxon_set=taxa,
pop_size=10000)
print(tree.as_string("newick"))
print(tree.as_ascii_plot())
3.5.5 Contained Coalescent Trees
The contained_coalescent function returns a tree generated under a neutral coalescent model conditioned on
population splitting times or events given by a containing species or population tree. Such a tree is often referred to
as a contained, embedded, censored, truncated, or constrained genealogy/tree. At a minimum, this function takes two
arguments: a Tree object representing the containing (species or population) tree, and a TaxonSetMapping object
describing how the sampled gene taxa map or are associated with the species/population Tree taxa.
The Tree object representing the containing species or population tree should be rooted and ultrametric. If
edge lengths are given in generations, then a meaningful population size needs to be communicated to the
contained_coalescent function. In general, for coalescent operations in DendroPy, unless otherwise specified, population sizes are the haploid population size, i.e. the number of genes in the population. This is 2N for
a diploid population with N individuals, or N for haploid population of N individuals. If edge lengths are given in
population units (e.g., N), then the appropriate population size to use is 1.
If the population size is fixed throughout the containing species/population tree, then simply passing in the appropriate
value using the default_pop_size argument to the contained_coalescent function is sufficient. If, on the
other hand, the population size varies, the a special attribute must be added to each edge, “pop_size”, that specifies
the population size for that edge. For example:
tree = dendropy.Tree.get_from_path("sp.tre", "newick")
for edge in tree.postorder_edge_iter():
edge.pop_size = 100000
The easiest way to get a TaxonSetMapping object is to call the special factory function
create_contained_taxon_mapping. This will create a new TaxonSet to manage the gene taxa, and
create the associations between the gene taxa and the containing tree taxa for you. It takes two arguments: the
TaxonSet of the containing tree, and the number of genes you want sampled from each species.
The
following
example
shows
how
to
create
a
TaxonSetMapping
using
create_contained_taxon_mapping, and then calls contained_coalescent to produce a contained
coalescent tree:
#! /usr/bin/env python
import dendropy
3.5. Tree Simulation and Generation
61
DendroPy Tutorial, Release 3.12.1
from dendropy import treesim
sp_tree_str = """\
[&R] (A:10,(B:6,(C:4,(D:2,E:2):2):2):4)
"""
sp_tree = dendropy.Tree.get_from_string(sp_tree_str, "newick")
gene_to_species_map = dendropy.TaxonSetMapping.create_contained_taxon_mapping(
containing_taxon_set=sp_tree.taxon_set,
num_contained=3)
gene_tree = treesim.contained_coalescent(containing_tree=sp_tree,
gene_to_containing_taxon_map=gene_to_species_map)
print(gene_tree.as_string(’newick’))
print(gene_tree.as_ascii_plot())
In the above example, the branch lengths were in haploid population units, so we did not specify a population size. If
the gene-species associations are more complex, e.g., different numbers of genes per species, we can pass in a list of
values as the second argument to ~dendropy.dataobject.taxon.TaxonSetMapping.create_contained_taxon_mapping():
#! /usr/bin/env python
import dendropy
from dendropy import treesim
sp_tree_str = """\
[&R] (A:10,(B:6,(C:4,(D:2,E:2):2):2):4)
"""
sp_tree = dendropy.Tree.get_from_string(sp_tree_str, "newick")
gene_to_species_map = dendropy.TaxonSetMapping.create_contained_taxon_mapping(
containing_taxon_set=sp_tree.taxon_set,
num_contained=[3, 4, 6, 2, 2,])
gene_tree = treesim.contained_coalescent(containing_tree=sp_tree,
gene_to_containing_taxon_map=gene_to_species_map)
print(gene_tree.as_string(’newick’))
print(gene_tree.as_ascii_plot())
This approach should be used with caution if we cannot be certain of the order of taxa (as is the case with data
read in Newick formats). In these case, and in more complex cases, we might need to directly instantiate the
TaxonSetMapping object. The API to describe the associations when constructing this object is very similar
to that of the TaxonSetPartition object: you can use a function, attribute or dictionary.
3.5.6 Simulating the Distribution of Number Deep Coalescences Under Different
Phylogeographic History Scenarios
A typical application for simulating censored coalescent trees is to produce a distribution of trees under different
hypotheses of demographic or phylogeographic histories.
For example, imagine we wanted to generate the distribution of the number of deep coalescences under two scenarios:
one in which a population underwent sequential or step-wise vicariance, and another when there was simultaneous fragmentation. This can be achieved by generating trees under contained_coalescent, and then using a
ContainingTree object to embed the trees and count the number of deep coalescences.
#! /usr/bin/env python
62
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
import dendropy
from dendropy import treesim
from dendropy import reconcile
# simulation parameters and output
num_reps = 10
# population tree descriptions
stepwise_tree_str = "[&R](A:120000,(B:80000,(C:40000,D:40000):40000):40000):100000"
frag_tree_str = "[&R](A:120000,B:120000,C:120000,D:120000):100000"
# taxa and trees
containing_taxa = dendropy.TaxonSet()
stepwise_tree = dendropy.Tree.get_from_string(
stepwise_tree_str,
"newick",
taxon_set=containing_taxa)
frag_tree = dendropy.Tree.get_from_string(
frag_tree_str,
"newick",
taxon_set=containing_taxa)
# taxon set association
genes_to_species = dendropy.TaxonSetMapping.create_contained_taxon_mapping(
containing_taxon_set=containing_taxa,
num_contained=8)
# convert to containing tree
stepwise_tree = reconcile.ContainingTree(stepwise_tree,
contained_taxon_set=genes_to_species.domain_taxon_set,
contained_to_containing_taxon_map=genes_to_species)
frag_tree = reconcile.ContainingTree(frag_tree,
contained_taxon_set=genes_to_species.domain_taxon_set,
contained_to_containing_taxon_map=genes_to_species)
# for each rep
for rep in range(num_reps):
gene_tree1 = treesim.contained_coalescent(containing_tree=stepwise_tree,
gene_to_containing_taxon_map=genes_to_species,
default_pop_size=40000)
stepwise_tree.embed_tree(gene_tree1)
gene_tree2 = treesim.contained_coalescent(containing_tree=frag_tree,
gene_to_containing_taxon_map=genes_to_species,
default_pop_size=40000)
frag_tree.embed_tree(gene_tree2)
# write results
# returns dictionary with contained trees as keys
# and number of deep coalescences as values
stepwise_deep_coals = stepwise_tree.deep_coalescences()
stepwise_out = open("stepwise.txt", "w")
for tree in stepwise_deep_coals:
stepwise_out.write("%d\n" % stepwise_deep_coals[tree])
# returns dictionary with contained trees as keys
# and number of deep coalescences as values
frag_deep_coals = frag_tree.deep_coalescences()
3.5. Tree Simulation and Generation
63
DendroPy Tutorial, Release 3.12.1
frag_out = open("frag.txt", "w")
for tree in frag_deep_coals:
frag_out.write("%d\n" % frag_deep_coals[tree])
Actually,
the ContainingTree class has its own native contained coalescent simulator,
embed_contained_kingman, which simulates and embeds a contained coalescent tree at the same time.
So a more practical approach might be:
#! /usr/bin/env python
import dendropy
from dendropy import treesim
from dendropy import reconcile
# simulation parameters and output
num_reps = 10
# population tree descriptions
stepwise_tree_str = "[&R](A:120000,(B:80000,(C:40000,D:40000):40000):40000):100000"
frag_tree_str = "[&R](A:120000,B:120000,C:120000,D:120000):100000"
# taxa and trees
containing_taxa = dendropy.TaxonSet()
stepwise_tree = dendropy.Tree.get_from_string(
stepwise_tree_str,
"newick",
taxon_set=containing_taxa)
frag_tree = dendropy.Tree.get_from_string(
frag_tree_str,
"newick",
taxon_set=containing_taxa)
# taxon set association
genes_to_species = dendropy.TaxonSetMapping.create_contained_taxon_mapping(
containing_taxon_set=containing_taxa,
num_contained=8)
# convert to containing tree
stepwise_tree = reconcile.ContainingTree(stepwise_tree,
contained_taxon_set=genes_to_species.domain_taxon_set,
contained_to_containing_taxon_map=genes_to_species)
frag_tree = reconcile.ContainingTree(frag_tree,
contained_taxon_set=genes_to_species.domain_taxon_set,
contained_to_containing_taxon_map=genes_to_species)
# for each rep
for rep in range(num_reps):
stepwise_tree.embed_contained_kingman(default_pop_size=40000)
frag_tree.embed_contained_kingman(default_pop_size=40000)
# write results
# returns dictionary with contained trees as keys
# and number of deep coalescences as values
stepwise_deep_coals = stepwise_tree.deep_coalescences()
stepwise_out = open("stepwise.txt", "w")
for tree in stepwise_deep_coals:
stepwise_out.write("%d\n" % stepwise_deep_coals[tree])
64
Chapter 3. Working with Trees and Tree Lists
DendroPy Tutorial, Release 3.12.1
# returns dictionary with contained trees as keys
# and number of deep coalescences as values
frag_deep_coals = frag_tree.deep_coalescences()
frag_out = open("frag.txt", "w")
for tree in frag_deep_coals:
frag_out.write("%d\n" % frag_deep_coals[tree])
3.5. Tree Simulation and Generation
65
DendroPy Tutorial, Release 3.12.1
66
Chapter 3. Working with Trees and Tree Lists
CHAPTER 4
Working with Characters
4.1 Character Matrices
4.1.1 Types of Character Matrices
The CharacterMatrix object represents character data in DendroPy. In most cases, you will not deal with objects
of the CharacterMatrix class directly, but rather with objects of one of the classes specialized to handle specific
data types:
• DnaCharacterMatrix, for DNA nucleotide sequence data
• RnaCharacterMatrix, for RNA nucleodtide sequence data
• ProteinCharacterMatrix, for amino acid sequence data
• ContinuousCharacterMatrix, for continuous-valued data
• StandardCharacterMatrix, for discrete-value data
4.1.2 CharacterMatrix Creating and Reading
As with most other phylogenetic data objects, objects of the CharacterMatrix-derived classes support the
get_from_* factory and read_from_* instance methods to populate objects from a data source. These methods
take a data source as the first argument, and a schema specification string (“nexus”, “newick”, “nexml”, “fasta”,
or “phylip”, etc.) as the second, as well as optional keyword arguments to customize the reading behavior.
Creating a New CharacterMatrix from a Data Source
The following examples simultaneously instantiate and populate CharacterMatrix objects of the appropriate type
from various file data sources:
>>>
>>>
>>>
>>>
>>>
>>>
import dendropy
dna = dendropy.DnaCharacterMatrix.get_from_path(’pythonidae_cytb.nex’, ’nexus’)
rna = dendropy.DnaCharacterMatrix.get_from_path(’hiv1_env.nex’, ’nexus’)
aa = dendropy.DnaCharacterMatrix.get_from_path(’pythonidae_mos.nex’, ’nexus’)
cv = dendropy.DnaCharacterMatrix.get_from_path(’pythonidae_sizes.nex’, ’nexus’)
sm = dendropy.DnaCharacterMatrix.get_from_path(’pythonidae_skull.nex’, ’nexus’)
67
DendroPy Tutorial, Release 3.12.1
Repopulating a CharacterMatrix from a DataSource
The read_from_* instance methods replace the calling object with data from the data source, overwriting existing
data:
>>>
>>>
>>>
>>>
import dendropy
dna = dendropy.DnaCharacterMatrix()
dna.read_from_path(’pythonidae_cytb.nex’, ’nexus’)
dna.read_from_path(’pythonidae_rag1.nex’, ’nexus’)
The second read_from_* will result in the dna object being re-populated with data from the file
pythonidae_rag1.nex.
4.1.3 CharacterMatrix Saving and Writing
Writing to Files
The write_to_stream, and write_to_path instance methods allow you to write the data of a
CharacterMatrix to a file-like object or a file path respectively. These methods take a file-like object (in the
case of write_to_stream) or a string specifying a filepath (in the case of write_to_path) as the first argument, and a schema specification string as the second argument.
The following example reads a FASTA-formatted file and writes it out to a a NEXUS-formatted file:
>>> import dendropy
>>> dna = dendropy.DnaCharacterMatrix.get_from_path(’pythonidae_cytb.fasta’, ’dnafasta’)
>>> dna.write_to_path(’pythonidae_cytb.nexus’, ’nexus’)
Fine-grained control over the output format can be specified using keyword arguments.
Composing a String
If you do not want to actually write to a file, but instead simply need a string representing the data in a particular
format, you can call the instance method as_string, passing a schema specification string as the first argument:
>>>
>>>
>>>
>>>
import dendropy
dna = dendropy.DnaCharacterMatrix.get_from_path(’pythonidae_cytb.fasta’, ’dnafasta’)
s = dna.as_string(’nexus’)
print(s)
As above, fine-grained control over the output format can be specified using keyword arguments.
4.1.4 Taxon Management with Character Matrices
Taxon management with CharacterMatrix-derived objects work very much the same as it does with Tree or
TreeList objects: every time a CharacterMatrix-derived object is independentally created or read, a new
TaxonSet is created, unless an existing one is specified. Thus, again, if you are creating multiple character matrices
that refer to the same set of taxa, you will want to make sure to pass each of them a common TaxonSet reference:
>>>
>>>
>>>
>>>
68
import
taxa =
dna1 =
std1 =
dendropy
dendropy.TaxonSet()
dendropy.DnaCharacterMatrix.get_from_path("pythonidae_cytb.fasta", "dnafasta", taxon_set=t
dendropy.ProteinCharacterMatrix.get_from_path("pythonidae_morph.nex", "nexus", taxon_set=t
Chapter 4. Working with Characters
DendroPy Tutorial, Release 3.12.1
4.1.5 Accessing Data
Each sequence for a particular Taxon object is organized into a CharacterDataVector object, which, in
turn, is a list of CharacterDataCell objects. You can retrieve the CharacterDataVector for a particular taxon by passing the corresponding Taxon object, its label, or its index to the CharacterMatrix object.
Thus, to get the character sequence vector associated with the first taxon (“Python regius”) from the data source
pythonidae_cytb.fasta:
>>> from dendropy import DnaCharacterMatrix
>>> cytb = DnaCharacterMatrix.get_from_path(’pythonidae_cytb.fasta’, ’dnafasta’)
>>> v1 = cytb[0]
>>> v2 = cytb[’Python regius’]
>>> v3 = cytb[cytb.taxon_set[0]]
>>> v1 == v2 == v3
True
4.2 Phylogenetic Character Analyses
4.2.1 Phylogenetic Independent Contrasts (PIC)
Basic Analysis
A phylogenetic independent contrasts analysis (Felsenstein 1985; Garland et al. 2005) can be carried out
using the PhylogeneticIndependentConstrasts class. This requires you to have a Tree and a
ContinuousCharacterMatrix which reference the same TaxonSet. Thus, if your data is in the same file:
>>>
>>>
>>>
>>>
import dendropy
dataset = dendropy.DataSet.get_from_path("primates.cont.nex", "nexus")
tree = dataset.tree_list[0][0]
chars = dataset.char_matrices[0]
While if you have the tree and characters in a different file:
>>>
>>>
>>>
>>>
import dendropy
taxa = dendropy.TaxonSet()
tree = dendropy.Tree.get_from_path("primates.tre", "newick", taxon_set=taxa)
chars = dendropy.ContinuousCharacterMatrix.get_from_path("primates.cc.nex", "nexus", taxon_set=ta
In either case, we have a Tree object, tree and a ContinuousCharacterMatrix object, chars, that both
reference the same TaxonSet.
Once the data is loaded, we create the PhylogeneticIndependentConstrasts object:
>>> from dendropy import continuous
>>> pic = dendropy.continuous.PhylogeneticIndependentContrasts(tree=tree, char_matrix=chars)
At this point, the data is ready for analysis. Typically, we want to map the contrasts onto a tree. The
contrasts_tree method takes a single mandatory argument, the 0-based index of the character (or column) to be
analyzed, and returns a Tree object that is a clone of the original input Tree, but with the following attributes added
to each Node:
• pic_state_value
• pic_state_variance
• pic_contrast_raw
4.2. Phylogenetic Character Analyses
69
DendroPy Tutorial, Release 3.12.1
• pic_contrast_variance
• pic_contrast_standardized
• pic_edge_length_error
• pic_corrected_edge_length
In addition to the 0-based index first argument, character_index, the contrasts_tree method takes the
following optional arguments:
annotate_pic_statistics If True then the PIC statistics attributes will be annotated (i.e., serialized or persisted when the tree is written out or saved. Defaults to False.
state_values_as_node_labels If True then the label attribute of each Node object will be
set to the value of the character.
corrected_edge_lengths If True then the Tree returned will have its edge lengths adjusted to
the corrected edge lengths as yielded by the PIC analysis.
Results as a Table
So the following retrieves the constrasts tree for the first character (index=0), and prints a table of the various statistics:
>>> ctree1 = pic.contrasts_tree(character_index=0,
...
annotate_pic_statistics=True,
...
state_values_as_node_labels=False,
...
corrected_edge_lengths=False)
>>> for nd in ctree1.postorder_internal_node_iter():
...
row = [nd.pic_state_value,
...
nd.pic_state_variance,
...
nd.pic_contrast_raw,
...
nd.pic_edge_length_error]
...
row_str = [(("%10.8f") % i) for i in row]
...
row_str = "
".join(row_str)
...
label = nd.label.ljust(6)
...
print "%s %s" % (label, row_str)
HP
3.85263000
0.38500000
0.48342000
0.10500000
HPM
3.20037840
0.34560000
1.48239000
0.21560000
HPMA
2.78082358
0.60190555
1.17222840
0.22190555
Root
1.18372461
0.37574347
4.25050358
0.37574347
Results as a Newick String with State Values as Node Labels
Alternatively, you might want to visualize the results as a tree showing the numeric values of the states. The following
produces this for each character in the matrix by first requesting that contrasts_tree replace existing node labels
with the state values for that node, and then, when writing out in Newick format, suppressing taxon labels and printing
node labels in their place:
#! /usr/bin/env python
import dendropy
from dendropy import continuous
taxa = dendropy.TaxonSet()
tree = dendropy.Tree.get_from_path(
"primates.cc.tre",
"newick",
taxon_set=taxa)
70
Chapter 4. Working with Characters
DendroPy Tutorial, Release 3.12.1
chars = dendropy.ContinuousCharacterMatrix.get_from_path(
"primates.cc.nex",
"nexus",
taxon_set=taxa)
pic = dendropy.continuous.PhylogeneticIndependentConstrasts(
tree=tree,
char_matrix=chars)
for cidx in range(chars.vector_size):
ctree1 = pic.contrasts_tree(character_index=cidx,
annotate_pic_statistics=True,
state_values_as_node_labels=True,
corrected_edge_lengths=False)
print(ctree1.as_string("newick",
suppress_leaf_taxon_labels=True,
suppress_leaf_node_labels=False,
suppress_internal_taxon_labels=True,
suppress_internal_node_labels=False))
This results in:
[&R] ((((4.09434:0.21,3.61092:0.21)3.85263:0.28,2.37024:0.49)3.2003784:0.13,2.02815:0.62)2.7808235791
[&R] ((((4.74493:0.21,3.3322:0.21)4.038565:0.28,3.3673:0.49)3.7432084:0.13,2.89037:0.62)3.43796714996
Results as a NEXUS Document with Analysis Statistics as Node Metadata
However, probably the best way to visualize the results would be as a tree marked up with metadata that can be viewed
in FigTree (by checking “Node Labels” and selecting the appropriate statistics from the drop-down menu). This is,
in fact, even easier to do than the above, as it will result from the default options. The following illustrates this.
It collects the metadata-annotated contrast analysis trees produced by contrasts_tree in a TreeList object,
and then prints the TreeList as NEXUS-formatted string. The default options to contrasts_tree result in
annotated attributes, while the default options to the writing method result in the annotations being written out as
comment metadata.
#! /usr/bin/env python
import dendropy
from dendropy import continuous
taxa = dendropy.TaxonSet()
tree = dendropy.Tree.get_from_path(
"primates.cc.tre",
"newick",
taxon_set=taxa)
chars = dendropy.ContinuousCharacterMatrix.get_from_path(
"primates.cc.nex",
"nexus",
taxon_set=taxa)
pic = dendropy.continuous.PhylogeneticIndependentConstrasts(
tree=tree,
char_matrix=chars)
pic_trees = dendropy.TreeList(taxon_set=taxa)
for cidx in range(chars.vector_size):
ctree1 = pic.contrasts_tree(character_index=cidx)
ctree1.label = "PIC %d" % (cidx+1)
pic_trees.append(ctree1)
print(pic_trees.as_string("nexus"))
4.2. Phylogenetic Character Analyses
71
DendroPy Tutorial, Release 3.12.1
Thus, we get:
#NEXUS
BEGIN TAXA;
DIMENSIONS NTAX=5;
TAXLABELS
Homo
Pongo
Macaca
Ateles
Galago
;
END;
BEGIN TREES;
TREE PIC_1 = [&R] ((((Homo:0.21[&pic_contrast_variance=None,pic_edge_length_error=0.0,pic_state_v
TREE PIC_2 = [&R] ((((Homo:0.21[&pic_contrast_variance=None,pic_edge_length_error=0.0,pic_state_v
END;
Multifurcating Trees and Polytomies
By default, the PhylogeneticIndependentConstrasts class only handles fully-bifurcating trees, and throws
an exception if the input tree has polytomies. You can change this behavior by specifying one of the following strings
to the “polytomy_strategy” argument of the class constructor:
“ignore“ Polytomies will handled without complaint:
>>> pic = dendropy.continuous.PhylogeneticIndependentContrasts(tree=tree,
...
char_matrix=chars,
...
polytomy_strategy=’ignore’)
Note that in this case the raw contrast and the raw contrast variance calculated for nodes that have
more than two children will be invalid. The reconstructed state values should be still valid, though.
“resolve“ Polytomies will be arbitrarily resolved with 0-length branches:
>>> pic = dendropy.continuous.PhylogeneticIndependentContrasts(tree=tree,
...
char_matrix=chars,
...
polytomy_strategy=’resolve’)
In this case this validity of the analysis for nodes with (originally) more than two children is dubious,
as the resulting contrasts are non-independent.
4.3 Population Genetic Summary Statistics
The popgenstat module provides functions that calculate some common population genetic summary statistics.
For example, given a DnaCharacterMatrix as an argument, the num_segregating_sites function returns the raw number of segregating sites, average_number_of_pairwise_differences returns the average number of pairwise differences, and nucleotide_diversity returns the nucleotide diversity.
More complex statistics are provided by the PopulationPairSummaryStatistics class. Objects of this class
are instantatiated with two lists of CharacterDataVector objects as arguments, each representing a sample of
72
Chapter 4. Working with Characters
DendroPy Tutorial, Release 3.12.1
DNA sequences drawn from two distinct but related populations. Once instantiated, the following attributes of the
PopulationPairSummaryStatistics object are available:
average_number_of_pairwise_differences The average number of pairwise differences
between every sequence across both populations.
average_number_of_pairwise_differences_between The average number of pairwise
differences between every sequence between both populations.
average_number_of_pairwise_differences_within The average number of pairwise differences between every sequence within each population.
average_number_of_pairwise_differences_net The net number of pairwise differences.
num_segregating_sites The number of segregating sites.
wattersons_theta Watterson’s theta.
wakeleys_psi Wakeley’s psi.
tajimas_d Tajima’s D.
The following example calculates the suite of population genetic summary statistics for sequences drawn from two
populations of sticklebacks. The original data consists of 23 sequences, with individuals from Eastern Pacific populations identified by their taxon labels beginning with “EPAC” and individuals from Western Pacific populations
identified by their taxon labels beginning with “WPAC”. The taxon labels thus are used as the basis for sorting the
sequences into the required lists of CharacterDataVector objects, p1 and p2.
1
#! /usr/bin/env python
2
3
4
import dendropy
from dendropy import popgenstat
5
6
7
8
9
10
11
12
13
14
seqs = dendropy.DnaCharacterMatrix.get_from_path("orti1994.nex", schema="nexus")
p1 = []
p2 = []
for idx, t in enumerate(seqs.taxon_set):
if t.label.startswith(’EPAC’):
p1.append(seqs[t])
else:
p2.append(seqs[t])
pp = popgenstat.PopulationPairSummaryStatistics(p1, p2)
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
print(’Average number of pairwise differences (total): %s’ \
% pp.average_number_of_pairwise_differences)
print(’Average number of pairwise differences (between populations): %s’ \
% pp.average_number_of_pairwise_differences_between)
print(’Average number of pairwise differences (within populations): %s’ \
% pp.average_number_of_pairwise_differences_within)
print(’Average number of pairwise differences (net): %s’ \
% pp.average_number_of_pairwise_differences_net)
print(’Number of segregating sites: %s’ \
% pp.num_segregating_sites)
print("Watterson’s theta: %s" \
% pp.wattersons_theta)
print("Wakeley’s Psi: %s" \
% pp.wakeleys_psi)
print("Tajima’s D: %s" \
% pp.tajimas_d)
Lines 6-12 build up the two lists of CharacterDataVector objects by sorting the original sequences into their
source populations based on the taxon label (with operational taxonomic units with labels beginning with “EPAC”
4.3. Population Genetic Summary Statistics
73
DendroPy Tutorial, Release 3.12.1
coming from the Eastern Pacific, and assigned to the list p1, while those that begin with “WPAC” coming from
the Western Pacific, and assigned to the list p2). These lists are then passed as the instantiation arguments to the
PopulationPairSummaryStatistics constructor in line 14. The calculations are performed immediately,
and the results reported in the following lines.
74
Chapter 4. Working with Characters
CHAPTER 5
Working with Data Sets
5.1 Data Sets
The DataSet class provides for objects that allow you to manage multiple types of phylogenetic data.
It has three primary attributes:
taxon_sets A list of all TaxonSet objects in the DataSet, in the order that they were added or
read, include TaxonSet objects added implicitly through being associated with added TreeList
or CharacterMatrix objects.
tree_lists A list of all TreeList objects in the DataSet, in the order that they were added or
read.
char_matrices A list of all CharacterMatrix objects in the DataSet, in the order that they
were added or read.
5.1.1 DataSet Creation and Reading
Creating a new DataSet from a Data Source
You can use the get_from_stream, get_from_path, and get_from_string factory class methods for
simultaneously instantiating and populating an object, taking a data source as the first argument and a schema specification string (“nexus”, “newick”, “nexml”, “fasta”, “phylip”, etc.) as the second:
>>> import dendropy
>>> ds = dendropy.DataSet.get_from_path(’pythonidae.nex’, ’nexus’)
In addition, fine-grained control over the parsing of the data source is available through various keyword arguments.
Reading into an Existing DataSet from a Data Source —————————————————–
The read_from_stream, read_from_path, and read_from_string instance methods for populating existing objects are also supported, taking the same arguments (i.e., a data source, a schema specification string, as well
as optional keyword arguments to customize the parse behavior)
>>>
>>>
>>>
>>>
>>>
import dendropy
ds = dendropy.DataSet()
ds.attach_taxon_set()
ds = dendropy.DataSet.read_from_path(’pythonidae.cytb.fasta’, ’dnafasta’)
ds = dendropy.DataSet.read_from_path(’pythonidae.mle.nex’, ’nexus’)
Note how the attach_taxon_set method is called before invoking any read_from_* statements, to ensure that
all the taxon references in the data sources get mapped to the same TaxonSet instance.
75
DendroPy Tutorial, Release 3.12.1
Cloning an Existing DataSet
You can also clone an existing DataSet object by passing it as an argument to the DataSet constructor:
>>> import dendropy
>>> ds1 = dendropy.DataSet.get_from_path(’pythonidae.cytb.fasta’, ’dnafasta’)
>>> ds2 = dendropy.DataSet(ds1)
Following this, ds2 will be a full deep-copy clone of ds1, with distinct and independent, but identical, Taxon,
TaxonSet, TreeList, Tree and CharacterMatrix objects. Note that, in distinction to the similar cloning
methods of Tree and TreeList, even the Taxon and TaxonSet objects are cloned, meaning that you manipulate
the Taxon and TaxonSet objects of ds2 without in any way effecting those of ds1.
Creating a New DataSet from Existing TreeList and CharacterMatrix Objects
You can add independentally created or parsed data objects to a DataSet by passing them as unnamed arguments to
the constructor:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
import dendropy
treelist1 = dendropy.TreeList.get_from_path(’pythonidae_cytb.mb.run1.t’, ’nexus’)
treelist2 = dendropy.TreeList.get_from_path(’pythonidae_cytb.mb.run2.t’, ’nexus’)
treelist3 = dendropy.TreeList.get_from_path(’pythonidae_cytb.mb.run3.t’, ’nexus’)
treelist4 = dendropy.TreeList.get_from_path(’pythonidae_cytb.mb.run4.t’, ’nexus’)
cytb = dendropy.DnaCharacterMatrix.get_from_path(’pythonidae_cytb.fasta’, ’dnafasta’)
ds = dendropy.DataSet(cytb, treelist1, treelist2, treelist3, treelist4)
ds.unify_taxa()
Note how we call the instance method unify_taxa after the creation of the DataSet object. This method will remove all existing TaxonSet objects from the DataSet, create and add a new one, and then map all taxon references
in all contained TreeList and CharacterMatrix objects to this new, unified TaxonSet.
Adding Data to an Exisiting DataSet
You can add independentally created or parsed data objects to a DataSet using the add method:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
import dendropy
ds = dendropy.DataSet()
treelist1 = dendropy.TreeList.get_from_path(’pythonidae_cytb.mb.run1.t’, ’nexus’)
treelist2 = dendropy.TreeList.get_from_path(’pythonidae_cytb.mb.run2.t’, ’nexus’)
treelist3 = dendropy.TreeList.get_from_path(’pythonidae_cytb.mb.run3.t’, ’nexus’)
treelist4 = dendropy.TreeList.get_from_path(’pythonidae_cytb.mb.run4.t’, ’nexus’)
cytb = dendropy.DnaCharacterMatrix.get_from_path(’pythonidae_cytb.fasta’, ’dnafasta’)
ds.add(treelist1)
ds.add(treelist2)
ds.add(treelist3)
ds.add(treelist4)
ds.add(cytb)
ds.unify_taxa()
Here, again, we call the unify_taxa to map all taxon references to the same, common, unified TaxonSet.
76
Chapter 5. Working with Data Sets
DendroPy Tutorial, Release 3.12.1
5.1.2 DataSet Saving and Writing
Writing to Files
The write_to_stream, and write_to_path instance methods allow you to write the data of a DataSet
object to a file-like object or a file path respectively. These methods take a file-like object (in the case of
write_to_stream) or a string specifying a filepath (in the case of write_to_path) as the first argument,
and a schema specification string as the second argument.
The following example aggregates the post-burn in MCMC samples from a series of NEXUS-formatted tree files into
a single TreeList, then, adds the TreeList as well as the original character data into a single DataSet object,
which is then written out as NEXUS-formatted file:
1
#! /usr/bin/env python
2
3
4
5
6
7
8
9
10
11
12
13
14
import dendropy
trees = dendropy.TreeList()
trees.read_from_path(’pythonidae.mb.run1.t’, ’nexus’,
trees.read_from_path(’pythonidae.mb.run2.t’, ’nexus’,
trees.read_from_path(’pythonidae.mb.run3.t’, ’nexus’,
trees.read_from_path(’pythonidae.mb.run4.t’, ’nexus’,
ds = dendropy.DataSet(trees)
ds.read_from_path(’pythonidae_cytb.fasta’,
schema=’fasta’,
data_type=’dna’,
taxon_set=ds.taxon_sets[0])
ds.write_to_path(’pythonidae_combined.nex’, ’nexus’)
tree_offset=200)
tree_offset=200)
tree_offset=200)
tree_offset=200)
Fine-grained control over the output format can be specified using keyword arguments.
Composing a String
If you do not want to actually write to a file, but instead simply need a string representing the data in a particular
format, you can call the instance method as_string, passing a schema specification string as the first argument:
>>>
>>>
>>>
>>>
import dendropy
ds = dendropy.DataSet(attached_taxon_set=True)
ds.read_from_path(’pythonidae.cytb.fasta’, ’dnafasta’)
s = ds.as_string(’nexus’)
As above, fine-grained control over the output format can be specified using keyword arguments.
5.1.3 Taxon Management with Data Sets
The DataSet object, representing a meta-collection of phylogenetic data, differs in one important way from all the
other phylogenetic data objects discussed so far with respect to taxon management, in that it is not associated with any
particular TaxonSet object. Rather, it maintains a list (in the property taxon_sets) of all the TaxonSet objects
referenced by its contained TreeList objects (in the property tree_lists) and CharacterMatrix objects (in
the property char_matrices).
With respect to taxon management, DataSet objects operate in one of two modes: “detached taxon set” mode and
“attached taxon set” mode.
5.1. Data Sets
77
DendroPy Tutorial, Release 3.12.1
Detached (Multiple) Taxon Set Mode
In the “detached taxon set” mode, which is the default, DataSet object tracks all TaxonSet references of their other
data members in the property taxon_sets, but no effort is made at taxon management as such. Thus, every time a
data source is read with a “detached taxon set” mode DataSet object, by deault, a new TaxonSet object will be
created and associated with the Tree, TreeList, or CharacterMatrix objects created from each data source,
resulting in multiple TaxonSet independent references. As such, “detached taxon set” mode DataSet objects are
suitable for handling data with multiple distinct sets of taxa.
For example:
>>>
>>>
>>>
>>>
import dendropy
ds = dendropy.DataSet()
ds.read_from_path("primates.nex", "nexus")
ds.read_from_path("snakes.nex", "nexus")
The dataset, ds, will now contain two distinct sets of TaxonSet objects, one for the taxa defined in “primates.nex”,
and the other for the taxa defined for “snakes.nex”. In this case, this behavior is correct, as the two files do indeed refer
to different sets of taxa.
However, consider the following:
>>>
>>>
>>>
>>>
>>>
>>>
import dendropy
ds = dendropy.DataSet()
ds.read_from_path("pythonidae_cytb.fasta", "dnafasta")
ds.read_from_path("pythonidae_aa.nex", "nexus")
ds.read_from_path("pythonidae_morphological.nex", "nexus")
ds.read_from_path("pythonidae.mle.tre", "nexus")
Here, even though all the data files refer to the same set of taxa, the resulting DataSet object will actually have
4 distinct TaxonSet objects, one for each of the independent reads, and a taxon with a particular label in the first
file (e.g., “Python regius” of “pythonidae_cytb.fasta”) will map to a completely distinct Taxon object than a taxon
with the same label in the second file (e.g., “Python regius” of “pythonidae_aa.nex”). This is incorrect behavior, and to
achieve the correct behavior with a multiple taxon set mode DataSet object, we need to explicitly pass a TaxonSet
object to each of the read_from_path statements:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
import dendropy
ds = dendropy.DataSet()
ds.read_from_path("pythonidae_cytb.fasta", "dnafasta")
ds.read_from_path("pythonidae_aa.nex", "nexus", taxon_set=ds.taxon_sets[0])
ds.read_from_path("pythonidae_morphological.nex", "nexus", taxon_set=ds.taxon_sets[0])
ds.read_from_path("pythonidae.mle.tre", "nexus", taxon_set=ds.taxon_sets[0])
ds.write_to_path("pythonidae_combined.nex", "nexus")
In the previous example, the first read_from_path statement results in a new TaxonSet object, which is added
to the taxon_sets property of the DataSet object ds. This TaxonSet object gets passed via the taxon_set
keyword to subsequent read_from_path statements, and thus as each of the data sources are processed, the taxon
references get mapped to Taxon objects in the same, single, TaxonSet object.
While this approach works to ensure correct taxon mapping across multiple data object reads and instantiation, in this
context, it is probably more convenient to use the DataSet in “attached taxon set” mode.
Attached (Single) Taxon Set Mode
In the “attached taxon set” mode, DataSet objects ensure that the taxon references of all data objects that are added
to them are mapped to the same TaxonSet object (at least one for each independent read or creation operation).
The “attached taxon set” mode can be set by passing the keyword argument attach_taxon_set=True to the
constructor of the DataSet when instantiating a new DataSet object (in which case a new TaxonSet object
78
Chapter 5. Working with Data Sets
DendroPy Tutorial, Release 3.12.1
will be created and added to the DataSet object as the default), by passing an existing TaxonSet object to which
to attach using the keyword argument taxon_set, or by calling attach_taxon_set on an existing DataSet
object
For example:
>>>
>>>
>>>
>>>
>>>
>>>
import dendropy
ds = dendropy.DataSet(attach_taxon_set=True)
ds.read_from_path("pythonidae_cytb.fasta", "dnafasta")
ds.read_from_path("pythonidae_aa.nex", "nexus")
ds.read_from_path("pythonidae_morphological.nex", "nexus")
ds.read_from_path("pythonidae.mle.tre", "nexus")
Or:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
import dendropy
taxa = dendropy.TaxonSet(label="global")
ds = dendropy.DataSet(taxon_set=taxa)
ds.read_from_path("pythonidae_cytb.fasta", "dnafasta")
ds.read_from_path("pythonidae_aa.nex", "nexus")
ds.read_from_path("pythonidae_morphological.nex", "nexus")
ds.read_from_path("pythonidae.mle.tre", "nexus")
Or:
>>> import dendropy
>>> ds = dendropy.DataSet()
>>> ds.attach_taxon_set()
<TaxonSet object at 0x5779c0>
>>> ds.read_from_path("pythonidae_cytb.fasta", "dnafasta")
>>> ds.read_from_path("pythonidae_aa.nex", "nexus")
>>> ds.read_from_path("pythonidae_morphological.nex", "nexus")
>>> ds.read_from_path("pythonidae.mle.tre", "nexus")
All of the above will result in only a single TaxonSet object that have all the taxa from the four data sources mapped
to them. Note how attach_taxon_set returns the new TaxonSet object created and attached when called. If
you needed to detach the TaxonSet object and then later on reattach it again, you would assign the return value to a
variable, and pass it to as an argument to the later call to attach_taxon_set.
Switching Between Attached and Detached Taxon Set Modes
As noted above, you can use the attached_taxon_set method to switch a DataSet object to attached taxon set
mode. To restore it to multiple taxon set mode, you would use the detach_taxon_set method:
>>> import dendropy
>>> ds = dendropy.DataSet()
>>> ds.attach_taxon_set()
<TaxonSet object at 0x5779c0>
>>> ds.read_from_path("pythonidae_cytb.fasta", "dnafasta")
>>> ds.read_from_path("pythonidae_aa.nex", "nexus")
>>> ds.read_from_path("pythonidae_morphological.nex", "nexus")
>>> ds.read_from_path("pythonidae.mle.tre", "nexus")
>>> ds.detach_taxon_set()
>>> ds.read_from_path("primates.nex", "nexus")
Here, the same TaxonSet object is used to manage taxon references for data parsed from the first four files, while
the data from the fifth and final file gets its own, distinct, TaxonSet object and associated Taxon object references.
5.1. Data Sets
79
DendroPy Tutorial, Release 3.12.1
Attaching a Particular Taxon Set
When attach_taxon_set is called without arguments, a new TaxonSet object is created and added to the
taxon_sets list of the DataSet object, and taxon references of all data subsequently read (or created and added
independentally) will be mapped to Taxon objects in this new TaxonSet object. If you want to use an existing
TaxonSet object instead of a new one, you can pass this object as an argument to the attach_taxon_set
method:
>>> import dendropy
>>> ds = dendropy.DataSet()
>>> ds.read_from_path("pythonidae_cytb.fasta", "dnafasta")
>>> ds.read_from_path("primates.nex", "nexus")
>>> ds.attach_taxon_set(ds.taxon_sets[0])
<TaxonSet object at 0x5b8150>
>>> ds.read_from_path("pythonidae_aa.nex", "nexus")
>>> ds.read_from_path("pythonidae_morphological.nex", "nexus")
>>> ds.read_from_path("pythonidae.mle.tre", "nexus")
>>> ds.detach_taxon_set()
Here, the first two read_from_path statements result in two distinct TaxonSet objects, one for each read, each
with their own independent Taxon objects. The attach_taxon_set statement is passed the TaxonSet object
from the first read operation, and all data created from the next three read_from_path statements will have their
taxon references mapped to this first TaxonSet object.
80
Chapter 5. Working with Data Sets
CHAPTER 6
Working with Metadata Annotations
DendroPy provides a rich infrastructure for decorating most types of phylogenetic objects (e.g., the DataSet,
TaxonSet, Taxon TreeList, Tree, and various CharacterMatrix classes) with metadata information.
These phylogenetic objects have an attribute, annotations, that is an instance of the AnnotationSet class,
which is an iterable (derived from dendropy.utility.containers.OrderedSet) that serves to manage a
collection of Annotation objects. Each Annotation object tracks a single annotation element. These annotations will be rendered as meta elements when writing to NeXML format or ampersand-prepended comemnt strings
when writing to NEXUS/NEWICK format. Note that full and robust expression of metadata annotations, including
stable and consistent round-tripping of information, can only be achieved while in the NeXML format.
6.1 Overview of the Infrastructure for Metadata Annotation in DendroPy
Each item of metadata is maintained in an object of the Annotation class. This class has the following attributes:
name The name of the metadata item or annotation.
value The value or content of the metadata item or annotation.
datatype_hint Custom data type indication for NeXML output (e.g. “xsd:string”).
name_prefix Prefix that represents an abbreviation of the namespace associated with this metadata
item.
namespace The namespace (e.g. “http://www.w3.org/XML/1998/namespace”) of this metadata item
(NeXML output).
annotate_as_reference If True, indicates that this annotation should not be interpreted semantically as a literal value, but rather as a source to be dereferenced.
is_hidden If True, indicates that this annotation should not be printed or written out.
prefixed_name Returns the name of this annotation with its namespace prefix (e.g. “dc:subject”).
These Annotation objects are typically collected and managed in a “annotations manager” container class,
AnnotationSet. This is a specialization of dendropy.utility.containers.OrderedSet whose elements are instances of Annotation. The full set of annotations associated with each object of DataSet,
TaxonSet, Taxon TreeList, Tree, various CharacterMatrix and other phylogenetic data class types is
available through the annotations attribute of those objects, which is an instance of AnnotationSet. The
AnnotationSet includes the following additional methods to support the creation, access, and management of the
Annotation object elements contained within it:
• add_new
81
DendroPy Tutorial, Release 3.12.1
• add_bound_attribute
• add_citation
• findall
• find
• drop
• values_as_dict
The following code snippet reads in a data file in NeXML format, and dumps out the annotations:
#! /usr/bin/env python
import sys
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml", "nexml")
print "-- (dataset) ---\n"
for a in ds.annotations:
print "%s = ’%s’" % (a.name, a.value)
for tree_list in ds.tree_lists:
for tree in tree_list:
print "\n-- (tree ’%s’) --\n" % tree.label
for a in tree.annotations:
print "%s = ’%s’" % (a.name, a.value)
Running the above results in:
-- (dataset) ---
bibliographicCitation = ’Wiklund H., Altamira I.V., Glover A., Smith C., Baco A., & Dahlgren T.G. 201
subject = ’whale-fall’
changeNote = ’Generated on Wed Jun 06 11:02:45 EDT 2012’
subject = ’wood-fall’
title = ’Systematics and biodiversity of Ophryotrocha (Annelida, Dorvilleidae) with descriptions of s
publicationName = ’Systematics and Biodiversity’
creator = ’Wiklund H., Altamira I.V., Glover A., Smith C., Baco A., & Dahlgren T.G.’
publisher = ’Systematics and Biodiversity’
contributor = ’Wiklund H.’
volume = ’’
contributor = ’Altamira I.V.’
number = ’’
contributor = ’Glover A.’
historyNote = ’Mapped from TreeBASE schema using org.cipres.treebase.domain.nexus.nexml.NexmlDocument
contributor = ’Smith C.’
modificationDate = ’2012-06-04’
contributor = ’Baco A.’
contributor = ’Dahlgren T.G.’
identifier.study.tb1 = ’None’
publicationDate = ’2012’
section = ’Study’
doi = ’’
title.study = ’Systematics and biodiversity of Ophryotrocha (Annelida, Dorvilleidae) with description
subject = ’New species’
subject = ’Ophryotrocha’
creationDate = ’2012-05-09’
subject = ’polychaeta’
date = ’2012-06-04’
subject = ’molecular phylogeny’
82
Chapter 6. Working with Metadata Annotations
DendroPy Tutorial, Release 3.12.1
identifier.study = ’12713’
-- (tree ’con 50 majrule’) -ntax.tree = ’41’
kind.tree = ’Species Tree’
quality.tree = ’Unrated’
isDefinedBy = ’http://purl.org/phylo/treebase/phylows/study/TB2:S12713’
type.tree = ’Consensus’
The following sections discuss these methods and attributes in detail, describing how the create, read, write, search,
and manipulate annotations.
6.2 Metadata Annotation Creation
6.2.1 Reading Data from an External Source
When reading data in NeXML format, metadata annotations given in the source are automatically created and associated with the corresponding data objects.
The metadata annotations associated with the phylogenetic data objects are collected in the attribute annotations
of the objects, which is an object of type AnnotationSet. Each annotation item is represented as an object of type
Annotation.
For example:
#! /usr/bin/env python
import dendropy
ds = dendropy.DataSet.get_from_path("pythonidae.annotated.nexml",
"nexml")
for a in ds.annotations:
print "Data Set ’%s’: %s" % (ds.label, a)
for taxon_set in ds.taxon_sets:
for a in taxon_set.annotations:
print "Taxon Set ’%s’: %s" % (taxon_set.label, a)
for taxon in taxon_set:
for a in taxon.annotations:
print "Taxon ’%s’: %s" % (taxon.label, a)
for tree_list in ds.tree_lists:
for a in tree_list.annotations:
print "Tree List ’%s’: %s" % (tree_list.label, a)
for tree in tree_list:
for a in tree.annotations:
print "Tree ’%s’: %s" % (tree.label, a)
produces:
Data Set ’None’: description="composite dataset of Pythonid sequences and trees"
Data Set ’None’: subject="Pythonidae"
Taxon Set ’None’: subject="Pythonidae"
Taxon ’Python regius’: closeMatch="http://purl.uniprot.org/taxonomy/51751"
Taxon ’Python sebae’: closeMatch="http://purl.uniprot.org/taxonomy/51752"
Taxon ’Python molurus’: closeMatch="http://purl.uniprot.org/taxonomy/51750"
Taxon ’Python curtus’: closeMatch="http://purl.uniprot.org/taxonomy/143436"
Taxon ’Morelia bredli’: closeMatch="http://purl.uniprot.org/taxonomy/461327"
Taxon ’Morelia spilota’: closeMatch="http://purl.uniprot.org/taxonomy/51896"
6.2. Metadata Annotation Creation
83
DendroPy Tutorial, Release 3.12.1
Taxon ’Morelia tracyae’: closeMatch="http://purl.uniprot.org/taxonomy/129332"
Taxon ’Morelia clastolepis’: closeMatch="http://purl.uniprot.org/taxonomy/129329"
Taxon ’Morelia kinghorni’: closeMatch="http://purl.uniprot.org/taxonomy/129330"
Taxon ’Morelia nauta’: closeMatch="http://purl.uniprot.org/taxonomy/129331"
Taxon ’Morelia amethistina’: closeMatch="http://purl.uniprot.org/taxonomy/51895"
Taxon ’Morelia oenpelliensis’: closeMatch="http://purl.uniprot.org/taxonomy/461329"
Taxon ’Antaresia maculosa’: closeMatch="http://purl.uniprot.org/taxonomy/51891"
Taxon ’Antaresia perthensis’: closeMatch="http://purl.uniprot.org/taxonomy/461324"
Taxon ’Antaresia stimsoni’: closeMatch="http://purl.uniprot.org/taxonomy/461325"
Taxon ’Antaresia childreni’: closeMatch="http://purl.uniprot.org/taxonomy/51888"
Taxon ’Morelia carinata’: closeMatch="http://purl.uniprot.org/taxonomy/461328"
Taxon ’Morelia viridisN’: closeMatch="http://purl.uniprot.org/taxonomy/129333"
Taxon ’Morelia viridisS’: closeMatch="http://purl.uniprot.org/taxonomy/129333"
Taxon ’Apodora papuana’: closeMatch="http://purl.uniprot.org/taxonomy/129310"
Taxon ’Liasis olivaceus’: closeMatch="http://purl.uniprot.org/taxonomy/283338"
Taxon ’Liasis fuscus’: closeMatch="http://purl.uniprot.org/taxonomy/129327"
Taxon ’Liasis mackloti’: closeMatch="http://purl.uniprot.org/taxonomy/51889"
Taxon ’Antaresia melanocephalus’: closeMatch="http://purl.uniprot.org/taxonomy/51883"
Taxon ’Antaresia ramsayi’: closeMatch="http://purl.uniprot.org/taxonomy/461326"
Taxon ’Liasis albertisii’: closeMatch="http://purl.uniprot.org/taxonomy/129326"
Taxon ’Bothrochilus boa’: closeMatch="http://purl.uniprot.org/taxonomy/461341"
Taxon ’Morelia boeleni’: closeMatch="http://purl.uniprot.org/taxonomy/129328"
Taxon ’Python timoriensis’: closeMatch="http://purl.uniprot.org/taxonomy/51753"
Taxon ’Python reticulatus’: closeMatch="http://purl.uniprot.org/taxonomy/37580"
Taxon ’Xenopeltis unicolor’: closeMatch="http://purl.uniprot.org/taxonomy/196253"
Taxon ’Candoia aspera’: closeMatch="http://purl.uniprot.org/taxonomy/51853"
Taxon ’Loxocemus bicolor’: closeMatch="http://purl.uniprot.org/taxonomy/39078"
Tree ’0’: treeEstimator="RAxML"
Tree ’0’: substitutionModel="GTR+G+I"
Metadata annotations in NEXUS and NEWICK must be given in the form of “hot comments” either in BEAST/FigTree
syntax:
[&subject=’Pythonidae’]
[&length_hpd95={0.01917252,0.06241567},length_quant_5_95={0.02461821,0.06197141},length_range={0.0157
or NHX-like syntax:
[&&subject=’Pythonidae’]
[&&length_hpd95={0.01917252,0.06241567},length_quant_5_95={0.02461821,0.06197141},length_range={0.015
However, by default these annotations are not parsed into DendroPy data model unless the keyword argument
extract_comment_metadata=True is passed in to the call:
>>> ds = dendropy.DataSet.get_from_path("data.nex",
... "nexus",
... extract_comment_metadata=True)
In general, support for metadata in NEXUS and NEWICK formats is very basic and lossy, and is limited to a small
range of phylogenetic data types (taxa, trees, nodes, edges). These issues and limits are fundamental to the NEXUS
and NEWICK formats, and thus if metadata is important to you and your work, you should be working with NeXML
format. The NeXML format provides for rich, flexible and robust metadata annotation for the broad range of phylogenetic data, and DendroPy provides full support for metadata reading and writing in NeXML.
84
Chapter 6. Working with Metadata Annotations
DendroPy Tutorial, Release 3.12.1
6.2.2 Direct Composition with Literal Values
The add_new method of the annotations attribute allows for direct adding of metadata. This method has two
mandatory arguments, “name” and “value”:
>>>
>>>
>>>
>>>
...
...
...
import dendropy
tree = dendropy.Tree.get_from_path(’pythonidae.mle.tree’, ’nexus’)
tree = dendropy.Tree.get_from_path(’examples/pythonidae.mle.nex’, ’nexus’)
tree.annotations.add_new(
name="subject",
value="Python phylogenetics",
)
When printing the tree in NeXML, the metadata will be rendered as a “<meta>” tag child element of the associated
“<tree>” element:
<nex:nexml
version="0.9"
xsi:schemaLocation="http://www.nexml.org/2009"
xmlns:dendropy="http://packages.python.org/DendroPy/"
xmlns="http://www.nexml.org/2009"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xml="http://www.w3.org/XML/1998/namespace"
xmlns:nex="http://www.nexml.org/2009"
>
.
.
.
<trees id="x4320340992" otus="x4320340552">
<tree id="x4320381904" label="0" xsi:type="nex:FloatTree">
<meta xsi:type="nex:LiteralMeta" property="dendropy:subject" content="Python phylogenetic
.
.
.
As can be seen, by default, the metadata property is mapped to the “dendropy” namespace (i.e.,
‘xmlns:dendropy="http://packages.python.org/DendroPy/"‘). This can be customized by using
the “name_prefix” and “namespace” arguments to the call to add_new:
>>>
...
...
...
...
...
tree.annotations.add_new(
name="subject",
value="Python phylogenetics",
name_prefix="dc",
namespace="http://purl.org/dc/elements/1.1/",
)
This will result in the following NeXML fragment:
<nex:nexml
version="0.9"
xsi:schemaLocation="http://www.nexml.org/2009"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns="http://www.nexml.org/2009"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xml="http://www.w3.org/XML/1998/namespace"
xmlns:nex="http://www.nexml.org/2009"
xmlns:dendropy="http://packages.python.org/DendroPy/"
>
.
6.2. Metadata Annotation Creation
85
DendroPy Tutorial, Release 3.12.1
.
.
<trees id="x4320340904" otus="x4320340464">
<tree id="x4320377872" label="0" xsi:type="nex:FloatTree">
<meta xsi:type="nex:LiteralMeta" property="dc:subject" content="Python phylogenetics" id=
.
.
.
Note that the “name_prefix” or “namespace” must be specified simultaneously; that is, if one is specified, then
the other must be specified as well. For convenience, you can specify the name of the annotation with the name prefix
prepended by specifying “name_is_prefixed=True”, though the namespace must still be provided separately:
>>>
...
...
...
...
...
tree.annotations.add_new(
name="dc:subject",
value="Python phylogenetics",
name_is_prefixed=True,
namespace="http://purl.org/dc/elements/1.1/",
)
For NeXML output, you can also specify a datatype:
>>>
...
...
...
...
>>>
...
...
...
...
tree.annotations.add_new(
name="subject",
value="Python phylogenetics",
datatype_hint="xsd:string",
)
tree.annotations.add_new(
name="answer",
value=42,
datatype_hint="xsd:integer",
)
When writing to NeXML, this will result in the following fragment:
<trees id="x4320340992" otus="x4320340552">
<tree id="x4320381968" label="0" xsi:type="nex:FloatTree">
<meta xsi:type="nex:LiteralMeta" property="dendropy:answer" content="42" datatype="xsd:intege
<meta xsi:type="nex:LiteralMeta" property="dendropy:subject" content="Python phylogenetics" d
You can also specify that the data should be interpreted as a source to be dereferenced in NeXML by passing in
annotate_as_reference=True. Note that this does not actually populate the contents of the annotation from
the source (unlike the dynamic attribute value binding discussed below), but just indicates the the contents of the
annotation should be interpreted differently by semantic readers. Thus, the following annotation:
>>>
...
...
...
...
...
...
tree.annotations.add_new(
name="subject",
value="http://en.wikipedia.org/wiki/Pythonidae",
name_prefix="dc",
namespace="http://purl.org/dc/elements/1.1/",
annotate_as_reference=True,
)
will be rendered in NeXML as:
<meta xsi:type="nex:ResourceMeta" rel="dc:subject" href="http://en.wikipedia.org/wiki/Pythonidae" />
Sometimes, you may want to annotate an object with metadata, but do not want it to be printed or written out. Passing
the is_hidden=True argument will result in the annotation being suppressed in all output:
86
Chapter 6. Working with Metadata Annotations
DendroPy Tutorial, Release 3.12.1
>>>
...
...
...
...
...
...
tree.annotations.add_new(
name="subject",
value="Python phylogenetics",
name_prefix="dc",
namespace="http://purl.org/dc/elements/1.1/",
is_hidden=True,
)
The is_hidden attribute of the an Annotation object can also be set directly:
>>> subject_annotations = tree.annotations.findall(name="citation")
>>> for a in subject_annotations:
...
a.is_hidden = True
6.2.3 Dynamically Binding Annotation Values to Object Attribute Values
In some cases, instead of “hard-wiring” in metadata for an object, you may want to write out metadata that takes its
value from the value of an attribute of the object. The add_bound_attribute method allows you to do this. This
method takes, as a minimum, a string specifying the name of an existing attribute to which the value of the annotation
will be dynamically bound.
For example:
#! /usr/bin/env python
import dendropy
import random
categories = {
"A" : "N/A",
"B" : "N/A",
"C" : "N/A",
"D" : "N/A",
"E" : "N/A"
}
tree = dendropy.Tree.get_from_string(
"(A,(B,(C,(D,E))));",
"newick")
for taxon in tree.taxon_set:
taxon.category = categories[taxon.label]
taxon.annotations.add_bound_attribute("category")
for node in tree.postorder_node_iter():
node.pop_size = None
node.annotations.add_bound_attribute("pop_size")
for node in tree.postorder_node_iter():
node.pop_size = random.randint(100, 10000)
if node.taxon is not None:
if node.pop_size >= 8000:
node.taxon.category = "large"
elif node.pop_size >= 6000:
node.taxon.category = "medium"
elif node.pop_size >= 4000:
node.taxon.category = "small"
elif node.pop_size >= 2000:
node.taxon.category = "tiny"
print tree.as_string("nexml")
results in:
6.2. Metadata Annotation Creation
87
DendroPy Tutorial, Release 3.12.1
<?xml version="1.0" encoding="ISO-8859-1"?>
<nex:nexml
version="0.9"
xsi:schemaLocation="http://www.nexml.org/2009"
xmlns:dendropy="http://packages.python.org/DendroPy/"
xmlns="http://www.nexml.org/2009"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xml="http://www.w3.org/XML/1998/namespace"
xmlns:nex="http://www.nexml.org/2009"
>
<otus id="x4320344648">
<otu id="x4320380112" label="A">
<meta xsi:type="nex:LiteralMeta" property="dendropy:category" content="tiny" id="meta4320
</otu>
<otu id="x4320380432" label="B">
<meta xsi:type="nex:LiteralMeta" property="dendropy:category" content="medium" id="meta43
</otu>
<otu id="x4320380752" label="C">
<meta xsi:type="nex:LiteralMeta" property="dendropy:category" content="N/A" id="meta43203
</otu>
<otu id="x4320381072" label="D">
<meta xsi:type="nex:LiteralMeta" property="dendropy:category" content="tiny" id="meta4320
</otu>
<otu id="x4320381264" label="E">
<meta xsi:type="nex:LiteralMeta" property="dendropy:category" content="tiny" id="meta4320
</otu>
</otus>
<trees id="x4320344560" otus="x4320344648">
<tree id="x4320379600" xsi:type="nex:FloatTree">
<node id="x4320379856">
<meta xsi:type="nex:LiteralMeta" property="dendropy:pop_size" content="5491" id="meta
</node>
<node id="x4320379984" otu="x4320380112">
<meta xsi:type="nex:LiteralMeta" property="dendropy:pop_size" content="2721" id="meta
</node>
<node id="x4320380176">
<meta xsi:type="nex:LiteralMeta" property="dendropy:pop_size" content="4627" id="meta
</node>
<node id="x4320380304" otu="x4320380432">
<meta xsi:type="nex:LiteralMeta" property="dendropy:pop_size" content="7202" id="meta
</node>
<node id="x4320380496">
<meta xsi:type="nex:LiteralMeta" property="dendropy:pop_size" content="5337" id="meta
</node>
<node id="x4320380624" otu="x4320380752">
<meta xsi:type="nex:LiteralMeta" property="dendropy:pop_size" content="1478" id="meta
</node>
<node id="x4320380816">
<meta xsi:type="nex:LiteralMeta" property="dendropy:pop_size" content="1539" id="meta
</node>
<node id="x4320380944" otu="x4320381072">
<meta xsi:type="nex:LiteralMeta" property="dendropy:pop_size" content="3457" id="meta
</node>
<node id="x4320381136" otu="x4320381264">
<meta xsi:type="nex:LiteralMeta" property="dendropy:pop_size" content="3895" id="meta
</node>
<rootedge id="x4320379920" target="x4320379856" />
<edge id="x4320380048" source="x4320379856" target="x4320379984" />
88
Chapter 6. Working with Metadata Annotations
DendroPy Tutorial, Release 3.12.1
<edge
<edge
<edge
<edge
<edge
<edge
<edge
</tree>
</trees>
</nex:nexml>
id="x4320380240"
id="x4320380368"
id="x4320380560"
id="x4320380688"
id="x4320380880"
id="x4320381008"
id="x4320381200"
source="x4320379856"
source="x4320380176"
source="x4320380176"
source="x4320380496"
source="x4320380496"
source="x4320380816"
source="x4320380816"
target="x4320380176"
target="x4320380304"
target="x4320380496"
target="x4320380624"
target="x4320380816"
target="x4320380944"
target="x4320381136"
/>
/>
/>
/>
/>
/>
/>
By default, the add_bound_attribute method uses the name of the attribute as the name of the annotation. The
“annotation_name” argument allows you explictly set the name of the annotation. In addition, the method call
also supports the other customization arguments of the add_new method: “datatype_hint”, “name_prefix”,
“namespace”, “name_is_prefixed”, “annotate_as_reference”, “is_hidden”, etc.:
>>>
>>>
...
...
...
...
tree.source_uri = None
tree.annotations.add_bound_attribute(
"source_uri",
annotation_name="dc:subject",
namespace="http://purl.org/dc/elements/1.1/",
annotate_as_reference=True)
6.2.4 Adding Citation Metadata
You can add citation annotations using the add_citation method. This method takes at least one argument,
citation. This can be a string representing the citation as a BibTex record or a dictionary with BibTex fields as
keys and field content as values.
For example:
#! /usr/bin/env python
import dendropy
citation = """\
@article{HeathHH2012,
Author = {Tracy A. Heath and Mark T. Holder and John P. Huelsenbeck},
Doi = {10.1093/molbev/msr255},
Journal = {Molecular Biology and Evolution},
Number = {3},
Pages = {939-955},
Title = {A {Dirichlet} Process Prior for Estimating Lineage-Specific Substitution Rates.},
Url = {http://mbe.oxfordjournals.org/content/early/2011/11/04/molbev.msr255.abstract},
Volume = {29},
Year = {2012}
}
"""
dataset = dendropy.DataSet.get_from_string(
"(A,(B,(C,(D,E))));",
"newick")
dataset.annotations.add_citation(citation)
print dataset.as_string("nexml")
will result in:
6.2. Metadata Annotation Creation
89
DendroPy Tutorial, Release 3.12.1
<?xml version="1.0" encoding="ISO-8859-1"?>
<nex:nexml
version="0.9"
xsi:schemaLocation="http://www.nexml.org/2009"
xmlns:bibtex="http://www.edutella.org/bibtex#"
xmlns="http://www.nexml.org/2009"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xml="http://www.w3.org/XML/1998/namespace"
xmlns:nex="http://www.nexml.org/2009"
xmlns:dendropy="http://packages.python.org/DendroPy/"
>
<meta xsi:type="nex:LiteralMeta" property="bibtex:journal" content="Molecular Biology and Evoluti
<meta xsi:type="nex:LiteralMeta" property="bibtex:bibtype" content="article" datatype="xsd:string
<meta xsi:type="nex:LiteralMeta" property="bibtex:number" content="3" datatype="xsd:string" id="m
<meta xsi:type="nex:LiteralMeta" property="bibtex:citekey" content="HeathHH2012" datatype="xsd:st
<meta xsi:type="nex:LiteralMeta" property="bibtex:pages" content="939-955" datatype="xsd:string"
<meta xsi:type="nex:LiteralMeta" property="bibtex:volume" content="29" datatype="xsd:string" id="
<meta xsi:type="nex:LiteralMeta" property="bibtex:year" content="2012" datatype="xsd:string" id="
<meta xsi:type="nex:LiteralMeta" property="bibtex:doi" content="10.1093/molbev/msr255" datatype="
<meta xsi:type="nex:LiteralMeta" property="bibtex:title" content="A {Dirichlet} Process Prior for
<meta xsi:type="nex:LiteralMeta" property="bibtex:url" content="http://mbe.oxfordjournals.org/con
<meta xsi:type="nex:LiteralMeta" property="bibtex:author" content="Tracy A. Heath and Mark T. Hol
.
.
.
The following results in the same output as above, but the citation is given as a dictionary with BibTex fields as keys
and content as values:
#! /usr/bin/env python
import dendropy
citation = {
"BibType": "article",
"Author": "Tracy A. Heath and Mark T. Holder and John P. Huelsenbeck",
"Doi": "10.1093/molbev/msr255",
"Journal": "Molecular Biology and Evolution",
"Number": "3",
"Pages": "939-955",
"Title": "A {Dirichlet} Process Prior for Estimating Lineage-Specific Substitution Rates.",
"Url": "http://mbe.oxfordjournals.org/content/early/2011/11/04/molbev.msr255.abstract",
"Volume": "29",
"Year": "2012",
}
dataset = dendropy.DataSet.get_from_string(
"(A,(B,(C,(D,E))));",
"newick")
dataset.annotations.add_citation(citation)
print dataset.as_string("nexml")
By default, the citation gets annotated as a series of separate BibTex elements. You can specify alternate formats by
using the “store_as” argument. This argument can take one of the following values:
• “bibtex“ Each BibTex field gets recorded as a separate annotation, with name given by the field name,
content by the field value. This is the default, and the results in NeXML are shown above.
90
Chapter 6. Working with Metadata Annotations
DendroPy Tutorial, Release 3.12.1
• “dublin“ A subset of the BibTex fields gets recorded as a set of Dublin Core (Publishing Requirements for
Industry Standard Metadata) annotations, one per field:
<meta
<meta
<meta
<meta
xsi:type="nex:LiteralMeta"
xsi:type="nex:LiteralMeta"
xsi:type="nex:LiteralMeta"
xsi:type="nex:LiteralMeta"
property="dc:date" content="2012" datatype="xsd:string" id=
property="dc:publisher" content="Molecular Biology and Evol
property="dc:title" content="A {Dirichlet} Process Prior fo
property="dc:creator" content="Tracy A. Heath and Mark T. H
• “prism“ A subset of the BibTex fields gets recorded as a set of PRISM (Publishing Requirements for Industry
Standard Metadata) annotations, one per field:
<meta
<meta
<meta
<meta
xsi:type="nex:LiteralMeta"
xsi:type="nex:LiteralMeta"
xsi:type="nex:LiteralMeta"
xsi:type="nex:LiteralMeta"
property="prism:volume" content="29" datatype="xsd:string"
property="prism:pageRange" content="939-955" datatype="xsd:
property="prism:publicationDate" content="2012" datatype="x
property="prism:publicationName" content="Molecular Biology
In addition, the method call also supports some of the other customization arguments of the add_new method:
“name_prefix”, “namespace”, “name_is_prefixed”, “is_hidden”.
6.2.5 Copying Metadata Annotations from One Phylogenetic Data Object to Another
As the AnnotationSet is derived from dendropy.utility.containers.OrderedSet, it has the
dendropy.utility.containers.OrderedSet.add and dendropy.utility.containers.OrderedSet.update
methods available for direct addition of Annotation objects. The following example shows how to add metadata
annotations associated with a DataSet object to all its Tree objects:
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml",
"nexml")
ds_annotes = ds.annotations.findall(name_prefix="dc").values_as_dict()
for tree_list in ds.tree_lists:
for tree in tree_list:
tree.annotations.update(ds_annotes)
Or, alternatively:
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml",
"nexml")
ds_annotes = ds.annotations.findall(name_prefix="dc").values_as_dict()
for tree_list in ds.tree_lists:
for tree in tree_list:
for a in ds_annotes:
tree.annotations.add(a)
6.3 Metadata Annotation Access and Manipulation
6.3.1 Iterating Over Collections of Annotations
The collection of Annotation objects representing metadata annotations associated with particular phylgoenetic
data objects can be accessed through the annotations attribute of each particular object.
For example:
6.3. Metadata Annotation Access and Manipulation
91
DendroPy Tutorial, Release 3.12.1
#! /usr/bin/env python
ds = dendropy.DataSet.get_from_path("pythonidae.annotated.nexml",
"nexml")
for a in ds.annotations:
print "The dataset has metadata annotation ’%s’ with content ’%s’" % (a.name, a.value)
tree = ds.tree_lists[0][0]
for a in tree.annotations:
print "Tree ’%s’ has metadata annotation ’%s’ with content ’%s’" % (tree.label, a.name, a.value)
will result in:
The dataset has metadata annotation ’description’ with content ’composite dataset of Pythonid sequenc
The dataset has metadata annotation ’subject’ with content ’Pythonidae’
Tree ’0’ has metadata annotation ’treeEstimator’ with content ’RAxML’
Tree ’0’ has metadata annotation ’substitutionModel’ with content ’GTR+G+I’
6.3.2 Retrieving Annotations By Search Criteria
Instead of interating through every element in the annotations attribute of data objects, you can use the findall
method of the the annotations object to return a collection of Annotation objects that match the search or filter
criteria specified in keyword arguments to the findall call. These keyword arguments should specify attributes of
Annotation and the corresponding value to be matched. Multiple keyword-value pairs can be specified, and only
Annotation objects that match all the criteria will be returned.
For example, the following returns a collection of annotations that have a name of “contributor”:
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml",
"nexml")
results = ds.annotations.findall(name="contributor")
for a in results:
print "%s=’%s’" % (a.name, a.value)
and will result in:
contributor=’Dahlgren T.G.’
contributor=’Baco A.’
contributor=’Smith C.’
contributor=’Glover A.’
contributor=’Altamira I.V.’
contributor=’Wiklund H.’
While the following returns a collection of annotations that are in the Dublin Core namespace:
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml",
"nexml")
results = ds.annotations.findall(namespace="http://purl.org/dc/elements/1.1/")
for a in results:
print "%s=’%s’" % (a.name, a.value)
and results in:
subject=’wood-fall’
contributor=’Wiklund H.’
publisher=’Systematics and Biodiversity’
subject=’whale-fall’
contributor=’Dahlgren T.G.’
92
Chapter 6. Working with Metadata Annotations
DendroPy Tutorial, Release 3.12.1
contributor=’Smith C.’
date=’2012-06-04’
subject=’polychaeta’
contributor=’Glover A.’
subject=’Ophryotrocha’
title=’Systematics and biodiversity of Ophryotrocha (Annelida, Dorvilleidae) with descriptions of six
subject=’New species’
subject=’molecular phylogeny’
contributor=’Altamira I.V.’
creator=’Wiklund H., Altamira I.V., Glover A., Smith C., Baco A., & Dahlgren T.G.’
contributor=’Baco A.’
The following, in turn, searches for and suppresses printing of annotations that have a name prefix of “dc” and have
empty values:
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml",
"nexml")
results = ds.annotations.findall(name_prefix="dc", value="")
for a in results:
a.is_hidden = True
Modifying the Annotation objects in a returned collection modifies the metadata of the parent data object. For
example, the following sets all the field values to upper case characters:
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml",
"nexml")
results = ds.annotations.findall(name="contributor")
for a in results:
a.value = a.value.upper()
results = ds.annotations.findall(name="contributor")
for a in results:
print a.value
and results in:
DAHLGREN T.G.
BACO A.
SMITH C.
GLOVER A.
ALTAMIRA I.V.
WIKLUND H.
The collection returned by the findall method is an object of type AnnotationSet. However, while modifying Annotation objects in this collection will result in the metadata of the parent object being modified (as in the
previous example), adding new annotations to this returned collection will not add them to the collection of metadata annotations of the parent object. Thus, the following example shows that the size of the annotations collection
associated with the dataset is unchanged by adding new annotations to the results of a findall call:
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml",
"nexml")
print len(ds.annotations)
results = ds.annotations.findall(namespace="http://purl.org/dc/elements/1.1/")
results.add_new(name="color", value="blue")
results.add_new(name="height", value="100")
results.add_new(name="length", value="200")
results.add_new(name="width", value="50")
6.3. Metadata Annotation Access and Manipulation
93
DendroPy Tutorial, Release 3.12.1
print len(ds.annotations)
The above produces:
30
30
As can be seen, no new annotations are added to the data set metadata.
If no matching Annotation objects are found then the AnnotationSet that is returned is empty.
If no keyword arguments are passed to findall, then all annotations are returned:
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml",
"nexml")
results = ds.annotations.findall()
print len(results) == len(ds.annotations)
The above produces:
True
6.3.3 Retrieving a Single Annotation By Search Criteria
The find method of the the annotations object return a the first Annotation object that matches the search or
filter criteria specified in keyword arguments to the findall call. These keyword arguments should specify attributes
of Annotation and the corresponding value to be matched. Multiple keyword-value pairs can be specified, and only
the first Annotation object that matches all the criteria will be returned.
For example:
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml",
"nexml")
print ds.annotations.find(name="contributor")
and will result in:
contributor=’Dahlgren T.G.’
While the following returns the first annotation in the Dublin Core namespace:
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml",
"nexml")
print ds.annotations.find(namespace="http://purl.org/dc/elements/1.1/")
and results in:
subject=’wood-fall’
If no matching Annotation objects are found then a default of None is returned:
>>> print ds.annotations.find(name="author")
None
Unlike findall, it is invalid to call find with no search criteria keyword arguments, and an TypeError exception
will be raised.
94
Chapter 6. Working with Metadata Annotations
DendroPy Tutorial, Release 3.12.1
6.3.4 Retrieving the Value of a Single Annotation
For convenience, the get_value, method is provided. This will search the AnnotationSet for the first
Annotation that has its name field equal to the first argument passed to the get_value method, and return its
value. If no match is found, the second argument is returned (or None, if no second argument is specified). Examples:
>>> print tree.annotations.get_value("subject")
molecular phylogeny
>>> print tree.annotations.get_value("creator")
Yoder A.D., & Yang Z.
>>> print tree.annotations.get_value("generator")
None
>>> print tree.annotations.get_value("generator", "unspecified")
unspecified
6.3.5 Transforming Annotations to a Dictionary
In some applications, it might be more convenient to work with dictionaries rather than AnnotationSet objects.
The values_as_dict methods creates a dictionary populated with key-value pairs from the collection. By default, the keys are the name attribute of the Annotation object and the values are the value attribute. Thus, the
following:
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml",
"nexml")
a = ds.annotations.values_as_dict()
print a
results in:
{’volume’: ’’,
’doi’: ’’,
’date’: ’2012-06-04’,
’bibliographicCitation’: ’Wiklund H., Altamira I.V., Glover A., Smith C., Baco A., & Dahlgren T.G. 20
’changeNote’: ’Generated on Wed Jun 06 11:02:45 EDT 2012’,
’creator’: ’Wiklund H., Altamira I.V., Glover A., Smith C., Baco A., & Dahlgren T.G.’,
’section’: ’Study’,
’title’: ’Systematics and biodiversity of Ophryotrocha (Annelida, Dorvilleidae) with descriptions of
’publisher’: ’Systematics and Biodiversity’,
’identifier.study.tb1’: None,
’number’: ’’,
’identifier.study’: ’12713’,
’modificationDate’: ’2012-06-04’,
’historyNote’: ’Mapped from TreeBASE schema using org.cipres.treebase.domain.nexus.nexml.NexmlDocumen
’publicationDate’: ’2012’,
’contributor’: ’Wiklund H.’,
’publicationName’: ’Systematics and Biodiversity’,
’creationDate’: ’2012-05-09’,
’title.study’: ’Systematics and biodiversity of Ophryotrocha (Annelida, Dorvilleidae) with descriptio
’subject’: ’molecular phylogeny’}
Note that no attempt is made to prevent or account for key collision: Annotation with the same name value will
overwrite each other in the dictionary. Custom control of the dictionary key/value generation can be specified via
keyword arguments:
key_attr String specifying an Annotation object attribute name to be used as keys for the dictionary.
6.3. Metadata Annotation Access and Manipulation
95
DendroPy Tutorial, Release 3.12.1
key_func Function that takes an Annotation object as an argument and returns the value to be used as
a key for the dictionary.
value_attr String specifying an Annotation object attribute name to be used as values for the dictionary.
value_func Function that takes an Annotation object as an argument and returns the value to be used
as a value for the dictionary.
For example:
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml",
"nexml")
a = ds.annotations.values_as_dict(key_attr="prefixed_name")
a = ds.annotations.values_as_dict(key_attr="prefixed_name", value_attr="namespace")
a = ds.annotations.values_as_dict(key_func=lambda a: a.namespace + a.name)
a = ds.annotations.values_as_dict(key_func=lambda a: a.namespace + a.name,
value_attr="value")
As the collection returned by the findall method is an object of type AnnotationSet, this can also be transformed to a dictionary. For example:
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml",
"nexml")
a = ds.annotations.findall(name_prefix="dc").values_as_dict()
print a
will result in:
{’publisher’: ’Systematics and Biodiversity’,
’creator’: ’Wiklund H., Altamira I.V., Glover A., Smith C., Baco A., & Dahlgren T.G.’,
’title’: ’Systematics and biodiversity of Ophryotrocha (Annelida, Dorvilleidae) with descriptions of
’date’: ’2012-06-04’,
’contributor’: ’Baco A.’,
’subject’: ’molecular phylogeny’}
Note how only one entry for “contributor” is present: the others were overwritten/replaced.
Adding to, deleting, or modifying either the keys or the values of the dictionary returned by values_as_dict in
no way changes any of the original metadata: it is serves as snapshot copy of literal values of the metadata.
6.3.6 Deleting or Removing Metadata Annotations
The drop method of AnnotationSet objects takes search criteria similar to findall, but instead of returning
the matched Annotation objects, it removes them from the parent collection. For example, the following removes
all metadata annotations with the name prefix “dc” from the DataSet object ds:
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml",
"nexml")
print "Original: %d items" % len(ds.annotations)
removed = ds.annotations.drop(name_prefix="dc")
print "Removed: %d items" % len(removed)
print "Current: %d items" % len(ds.annotations)
and results in:
96
Chapter 6. Working with Metadata Annotations
DendroPy Tutorial, Release 3.12.1
Original: 30 items
Removed: 16 items
Current: 14 items
As can be seen, the drop method returns the individual Annotation removed as a new AnnotationSet collection. This is useful if you still want to use the removed Annotation objects elsewhere.
As with the findall method, multiple keyword criteria can be specified:
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml",
"nexml")
ds.annotations.drop(name_prefix="dc", name="contributor")
In addition, again similar in behavior to the findall method, no keyword arguments result in all the annotations
being removed. Thus, the following results in all metadata annotations being deleted from the DataSet object ds:
import dendropy
ds = dendropy.DataSet.get_from_path("sample1.xml",
"nexml")
print "Original: %d items" % len(ds.annotations)
removed = ds.annotations.drop()
print "Removed: %d items" % len(removed)
print "Current: %d items" % len(ds.annotations)
and results in:
Original: 30 items
Removed: 30 items
Current: 0 items
6.4 Writing or Saving Metadata
When writing to NeXML format, all metadata annotations are preserved and can be fully round-tripped. Currently,
this is the only data format that allows for robust treatment of metadata.
Due to the fundamental limitations of the NEXUS/Newick format, metadata handling in this format is limited and
rather idiosyncratic. Currently, metadata will be written out as name-value pairs (separated by “=”) in ampersandprepended comments associated with the particular phylogenetic data object. This syntax corresponds to the BEAST
or FigTree style of metadata annotation. However, this association might not be preserved. For example, metadata
annotations associated with edges and nodes of trees will be written out fully in NEXUS and NEWICK formats, but
when read in again will all be associated with nodes. The keyword argument annotations_as_nhx=True passed
to the call to write the data in NEXUS/NEWICK format will result in a double ampersand prefix to the comment, thus
(partially) conforming to NHX specifications. Metadata associated with DataSet objects will be written in out in
the same BEAST/FigTree/NHX syntax at the top of the file, while metadata associated with TaxonSet and Taxon
objects will be written out immediately after the start of the Taxa Block and taxon labels respectively. This is very
fragile: for example, a metadata annotation before a taxon label will be associated with the previous taxon when being
read in again. As noted above, if metadata annotations are important for yourself, your workflow, or your task, then
the NeXML format should be used rather than NEXUS or NEWICK.
6.4. Writing or Saving Metadata
97
DendroPy Tutorial, Release 3.12.1
98
Chapter 6. Working with Metadata Annotations
CHAPTER 7
Interoperability with Other Programs, Libraries and Applications
7.1 Working with GenBank Molecular Sequence Databases
The genbank module provides the classes and methods to download sequences from GenBank and instantiate them
into DendroPy phylogenetic data objects. Three classes are provided, all of which have an identical interface, varying
only in the type of data retrieved:
GenBankDna
Acquire and manage DNA sequence data from the GenBank Nucleotide database.
GenBankRna
Acquire and manage RNA sequence data from the GenBank Nucleotide database.
GenBankProtein
Acquire and manage AA sequence data from the GenBank Protein database.
7.1.1 Quick Start
The basic way to retrieve sequence data is create a GenBankDna, GenBankRna, or GenBankProtein object, and
pass in a list of identifiers to be retrieved using the “ids” argument. The value of this argument should be a container
with either GenBank accession identifiers or GI numbers:
>>> from dendropy.interop import genbank
>>> gb_dna = genbank.GenBankDna(ids=[’EU105474’, ’EU105475’])
>>> for gb in gb_dna:
...
print gb
gi|158930545|gb|EU105474.1| Homo sapiens Ache non-coding region T864 genomic sequence
gi|158930546|gb|EU105475.1| Homo sapiens Arara non-coding region T864 genomic sequence
The records are stored as GenBankAccessionRecord objects. These records store the full information available in a GenBank record, including the references, feature table, qualifiers, and other details, and these are available as attributes of the GenBankAccessionRecord objects (e.g., “primary_accession”, “taxonomy”,
“feature_table” and so on).
To generate a CharacterMatrix object from the collection of sequences, call the generate_char_matrix
method:
>>> from dendropy.interop import genbank
>>> gb_dna = genbank.GenBankDna(ids=[’EU105474’, ’EU105475’])
>>> char_matrix = gb_dna.generate_char_matrix()
99
DendroPy Tutorial, Release 3.12.1
>>> print char_matrix.as_string("nexus")
#NEXUS
BEGIN TAXA;
DIMENSIONS NTAX=2;
TAXLABELS
EU105474
EU105475
;
END;
BEGIN CHARACTERS;
DIMENSIONS NCHAR=494;
FORMAT DATATYPE=DNA GAP=- MISSING=? MATCHCHAR=.;
MATRIX
EU105474
TCTCTTATCA...
EU105475
TCTCTTATCA...
;
END;
As can be seen, by default the taxon labels assigned to the sequences are set to the identifier used to request the sequences. This, and many other aspects of the character matrix generation, including annotation of taxa and sequences,
can be customized, as discussed in detail below.
7.1.2 Acquiring Data from GeneBank
The GenBankDna, GenBankRna, and GenBankProtein classes provide for the downloading and management
of DNA, RNA, and protein (AA) sequences from GenBank. The first two of these query the “nucleotide” or “nuccore” database, while the last queries the “protein” database. The constructors of these classes accept the following
arguments:
ids
A list of accession identifiers of GI numbers of the records to be downloaded. E.g.
“ids=[’EU105474’, ’EU105475’]”, “ids=[’158930545’, ’EU105475’]”,
or “ids=[’158930545’, ’158930546’]”. If “prefix” is specified, this string will
be pre-pended to all values in the list.
id_range A tuple of integers that specify the first and last values (inclusive) of accession or GI numbers of the records to be downloaded. If “prefix” is specified, this string will be prepended to all
numbers in this range. Thus specifying “id_range=(158930545, 158930550)” is exactly
equivalent to specifying “ids=[158930545, 158930546, 158930547, 158930548,
158930549, 158930550]”,
while specifying “id_range=(105474, 105479),
prefix="EU"” is exactly equivalent tp specifying “ids=["EU105474", "EU105475",
"EU105476", "EU105477", "EU105478", "EU105479"]”.
prefix This string will be prepended to all values resulting from the “ids” and “id_range”.
verify By default, the results of the download are checked to make sure there is a one-to-one correspondence between requested id’s and retrieved records. Setting “verify=False” skips this
checking.
So, for example, the following are all different ways of instantiating GenBank resource data store:
>>>
>>>
>>>
>>>
>>>
100
from dendropy.interop import genbank
gb_dna = genbank.GenBankDna(ids=[’EU105474’, ’EU105475’])
gb_dna = genbank.GenBankDna(ids=[’158930545’, ’EU105475’])
gb_dna = genbank.GenBankDna(ids=[’158930545’, ’158930546’])
gb_dna = genbank.GenBankDna(ids=[’105474’, ’105475’], prefix="EU")
Chapter 7. Interoperability with Other Programs, Libraries and Applications
DendroPy Tutorial, Release 3.12.1
>>> gb_dna = genbank.GenBankDna(id_range=(105474, 105478), prefix="EU")
>>> gb_dna = genbank.GenBankDna(id_range=(158930545, 158930546))
You can add more records to an existing instance of GenBankDna, GenBankRna, or GenBankProtein objects
by using the “acquire” or “acquire_range” methods. The “acquire” method takes a sequence of accession
identifiers or GI numbers for the first argument (“ids”), and, in addition an optional string prefix to be prepended
can be supplied using the second argument, “prefix”, while verification can be disabled by specifying False for the
third argument, “verify”. The “acquire_range” method takes two mandatory integer arguments: the first and
last value of the range of accession or GI numbers of the records to be downloaded. As with the other method, a string
prefix to be prepended can be optionally supplied using the argument “prefix”, while verification can be disabled
by specifying “verify=|False|”. For example:
>>>
>>>
>>>
>>>
>>>
>>>
>>>
2
4
6
from dendropy.interop import genbank
gb_dna = genbank.GenBankDna([’EU105474’, ’EU105475’])
print len(gb_dna)
gb_dna.acquire([158930547, 158930548])
print len(gb_dna)
gb_dna.acquire_range(105479, 105480, prefix="EU")
print len(gb_dna)
7.1.3 Accessing GenBank Records
The GenBank records accumulated in GenBankDna, GenBankRna, and GenBankProtein objects are represented by collections of GenBankAccessionRecord objects. Each of these GenBankAccessionRecord
objects represent the full information from the GenBank source as a rich Python object.
>>> from dendropy.interop import genbank
>>> gb_dna = genbank.GenBankDna([’EU105474’, ’EU105475’])
>>> for gb_rec in gb_dna:
...
print gb_rec.gi
...
print gb_rec.locus
...
print gb_rec.length
...
print gb_rec.moltype
...
print gb_rec.topology
...
print gb_rec.strandedness
...
print gb_rec.division
...
print gb_rec.update_date
...
print gb_rec.create_date
...
print gb_rec.definition
...
print gb_rec.primary_accession
...
print gb_rec.accession_version
...
print "(other seq ids)"
...
for osi_key, osi_value in gb_rec.other_seq_ids.items():
...
print "
", osi_key, osi_value
...
print gb_rec.source
...
print gb_rec.organism
...
print gb_rec.taxonomy
...
print "(references)"
...
for ref in gb_rec.references:
...
print "
", ref.number , ref.position , ref.authors , ref.consrtm , ref.title , ref.jour
...
print "(feature_table)"
...
for feature in gb_rec.feature_table:
...
print "
", feature.key, feature.location
7.1. Working with GenBank Molecular Sequence Databases
101
DendroPy Tutorial, Release 3.12.1
...
for qualifier in feature.qualifiers:
...
print "
", qualifier.name, qualifier.value
...
158930545
EU105474
494
DNA
linear
double
PRI
27-NOV-2007
27-NOV-2007
Homo sapiens Ache non-coding region T864 genomic sequence
EU105474
EU105474.1
(other seq ids)
gb EU105474.1
gi 158930545
Homo sapiens (human)
Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Eutel...
(references)
1 1..494 [] None Statistical evaluation of alternativ...
2 1..494 [] None Direct Submission Submitted (17-AUG-...
(feature_table)
source 1..494
organism Homo sapiens
mol_type genomic DNA
db_xref taxon:9606
chromosome 18
note Ache
misc_feature 1..494
note non-coding region T864
.
.
.
(etc.)
7.1.4 Generating Character Matrix Objects from GenBank Data
The “generate_char_matrix()” method of GenBankDna, GenBankRna, and GenBankProtein objects
creates and returns a CharacterMatrix object of the appropriate type out of the data collected in them. When
called without any arguments, it generates a new TaxonSet block, creating one new Taxon object for every sequence in the collection with a label corresponding to the identifier used to request the sequence:
>>> from dendropy.interop import genbank
>>> gb_dna = genbank.GenBankDna(ids=[158930545, ’EU105475’])
>>> char_matrix = gb_dna.generate_char_matrix()
>>> print char_matrix.as_string("nexus")
#NEXUS
BEGIN TAXA;
DIMENSIONS NTAX=2;
TAXLABELS
158930545
EU105475
102
Chapter 7. Interoperability with Other Programs, Libraries and Applications
DendroPy Tutorial, Release 3.12.1
;
END;
BEGIN CHARACTERS;
DIMENSIONS NCHAR=494;
FORMAT DATATYPE=DNA GAP=- MISSING=? MATCHCHAR=.;
MATRIX
158930545
TCTCTTATCAAACTA...
EU105475
TCTCTTATCAAACTA...
;
END;
BEGIN SETS;
END;
Customizing/Controlling Sequence Taxa
The taxon assignment can be controlled in one of two ways:
1. Using the “label_components” and optionally the “label_component_separator” arguments.
2. Specifying a custom function using the “gb_to_taxon_func” argument that takes a
GenBankAccessionRecord object and returns the Taxon object to be assigned to the sequence;
this approach requires specification of a TaxonSet object passed using the “taxon_set” argument.
Specifying a Custom Label for Sequence Taxa
The “label_components” and the “label_component_separator” arguments allow for customization
of the taxon labels of the Taxon objects created for each sequence. The “label_components” argument
should be assigned an ordered container (e.g., a list) of strings that correspond to attributes of objects of the
GenBankAccessionRecord class. The values of these attributes will be concatenated to compose the Taxon
object label. By default, the components will be separated by spaces, but you can override this by passing the string to
be used by the “label_component_separator” argument. For example:
>>> from dendropy.interop import genbank
>>> gb_dna = genbank.GenBankDna(ids=[158930545, ’EU105475’])
>>> char_matrix = gb_dna.generate_char_matrix(
... label_components=["accession", "organism", ],
... label_component_separator="_")
>>> print [t.label for t in char_matrix.taxon_set]
[’EU105474_Homo_sapiens’, ’EU105475_Homo_sapiens’]
>>> char_matrix = gb_dna.generate_char_matrix(
... label_components=["organism", "moltype", "gi"],
... label_component_separator=".")
>>> print [t.label for t in char_matrix.taxon_set]
[’Homo.sapiens.DNA.158930545’, ’Homo.sapiens.DNA.158930546’]
Specifying a Custom Taxon-Discovery Function
Full control over the Taxon object assignment process is given by using the “gb_to_taxon_func” argument.
This should be used to specify a function that takes a GenBankAccessionRecord object and returns the Taxon
object to be assigned to the sequence. The specification of a TaxonSet object passed using the “taxon_set”
argument is also required, so that this can be assigned to the CharacterMatrix object.
7.1. Working with GenBank Molecular Sequence Databases
103
DendroPy Tutorial, Release 3.12.1
A simple example that illustrates the usage of the “gb_to_taxon_func” argument by creating a custom label:
#! /usr/bin/env python
import dendropy
from dendropy.interop import genbank
def gb_to_taxon(gb):
locality = gb.feature_table.find("source").qualifiers.find("note").value
label = "GI" + gb.gi + "." + locality
taxon = dendropy.Taxon(label=label)
return taxon
taxon_set = dendropy.TaxonSet()
gb_dna = genbank.GenBankDna(ids=[158930545, ’EU105475’])
char_matrix = gb_dna.generate_char_matrix(
taxon_set=taxon_set,
gb_to_taxon_func=gb_to_taxon)
print [t.label for t in char_matrix.taxon_set]
which results in:
[’GI158930545.Ache’, ’GI158930546.Arara’]
A more complex case might be where you may already have a TaxonSet with existing Taxon objects that you may
want to associate with the sequences. The following illustrates how to do this:
#! /usr/bin/env python
import dendropy
from dendropy.interop import genbank
tree = dendropy.Tree.get_from_string(
"(Ache, (Arara, (Bribri, (Guatuso, Guaymi))))",
"newick")
def gb_to_taxon(gb):
locality = gb.feature_table.find("source").qualifiers.find("note").value
taxon = tree.taxon_set.get_taxon(label=locality)
assert taxon is not None
return taxon
gb_ids = [158930545, 158930546, 158930547, 158930548, 158930549]
gb_dna = genbank.GenBankDna(ids=gb_ids)
char_matrix = gb_dna.generate_char_matrix(
taxon_set=tree.taxon_set,
gb_to_taxon_func=gb_to_taxon)
print [t.label for t in char_matrix.taxon_set]
print tree.taxon_set is char_matrix.taxon_set
for taxon in tree.taxon_set:
print "{}: {}".format(
taxon.label,
char_matrix[taxon].symbols_as_string()[:10])
which results in:
True
Ache: TCTCTTATCA
Arara: TCTCTTATCA
104
Chapter 7. Interoperability with Other Programs, Libraries and Applications
DendroPy Tutorial, Release 3.12.1
Bribri: TCTCTTATCA
Guatuso: TCTCTTATCA
Guaymi: TCTCTTATCA
[’Ache’, ’Arara’, ’Bribri’, ’Guatuso’, ’Guaymi’]
The important thing to note here is the the Taxon objects in the DnaCharacterMatrix do not just have the same
labels as the Taxon object in the Tree, “tree”, but actually are the same objects (i.e., reference the same operational
taxonomic units within DendroPy).
Adding the GenBank Record as an Attribute
It is sometimes useful to maintain a handle on the original GenBank record in the CharacterMatrix resulting from “generate_char_matrix()”. The “set_taxon_attr” and “set_seq_attr” arguments of the
“generate_char_matrix()” method allow you to this. The values supplied to these arguments should be strings
that specify the name of the attribute that will be created on the Taxon or CharacterDataVector objects,
respectively. The value of this attribute will be the GenBankAccessionRecord that underlies the Taxon or
CharacterDataVector sequence. For example:
#! /usr/bin/env python
import dendropy
from dendropy.interop import genbank
gb_dna = genbank.GenBankDna(ids=[158930545, ’EU105475’])
char_matrix = gb_dna.generate_char_matrix(set_taxon_attr="gb_rec")
for taxon in char_matrix.taxon_set:
print "Data for taxon ’{}’ is based on GenBank record: {}".format(
taxon.label,
taxon.gb_rec.definition)
will result in:
Data for taxon ’158930545’ is based on GenBank record: Homo sapiens Ache non-coding region T864 genom
Data for taxon ’EU105475’ is based on GenBank record: Homo sapiens Arara non-coding region T864 genom
Alternatively, the following:
#! /usr/bin/env python
import dendropy
from dendropy.interop import genbank
gb_dna = genbank.GenBankDna(ids=[158930545, ’EU105475’])
char_matrix = gb_dna.generate_char_matrix(set_seq_attr="gb_rec")
for sidx, sequence in enumerate(char_matrix.vectors()):
print "Sequence {} (’{}’) is based on GenBank record: {}".format(
sidx+1,
char_matrix.taxon_set[sidx].label,
sequence.gb_rec.defline)
will result in:
Sequence 1 (’158930545’) is based on GenBank record: gi|158930545|gb|EU105474.1| Homo sapiens Ache no
Sequence 2 (’EU105475’) is based on GenBank record: gi|158930546|gb|EU105475.1| Homo sapiens Arara no
7.1. Working with GenBank Molecular Sequence Databases
105
DendroPy Tutorial, Release 3.12.1
Annotating with GenBank Data and Metadata
To persist the information in a the GenBankAccessionRecord object through serialization and deserialization,
you can request that this information gets added as an Annotation (see “Working with Metadata Annotations”) to
the corresponding Taxon or CharacterDataVector object.
Reference Annotation
Specifying “add_ref_annotation_to_taxa=True” will result in a reference-style metadata annotation added to the
Taxon object, while specifying “add_ref_annotation_to_seqs=True” will result in a reference-style metadata annotation added to the sequence. The reference-style annotation is brief, single annotation that points to the URL of
the original record. As with metadata annotations in general, you really need to be using the NeXML format for full
functionality.
So, for example:
#! /usr/bin/env python
import dendropy
from dendropy.interop import genbank
gb_dna = genbank.GenBankDna(ids=[158930545, ’EU105475’])
char_matrix = gb_dna.generate_char_matrix(add_ref_annotation_to_taxa=True)
print char_matrix.as_string("nexml")
will result in:
<?xml version="1.0" encoding="ISO-8859-1"?>
<nex:nexml
version="0.9"
xsi:schemaLocation="http://www.nexml.org/2009 ../xsd/nexml.xsd"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns="http://www.nexml.org/2009"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xml="http://www.w3.org/XML/1998/namespace"
xmlns:nex="http://www.nexml.org/2009"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
>
<otus id="d4320533416">
<otu id="d4323884688" label="158930545">
<meta xsi:type="nex:ResourceMeta" rel="dcterms:source" href="http://www.ncbi.nlm.nih.gov/
</otu>
<otu id="d4323884816" label="EU105475">
<meta xsi:type="nex:ResourceMeta" rel="dcterms:source" href="http://www.ncbi.nlm.nih.gov/
</otu>
</otus>
.
.
.
</nex:nexml>
Alternatively:
#! /usr/bin/env python
import dendropy
from dendropy.interop import genbank
gb_dna = genbank.GenBankDna(ids=[158930545, ’EU105475’])
106
Chapter 7. Interoperability with Other Programs, Libraries and Applications
DendroPy Tutorial, Release 3.12.1
char_matrix = gb_dna.generate_char_matrix(add_ref_annotation_to_seqs=True)
print char_matrix.as_string("nexml")
will result in:
<?xml version="1.0" encoding="ISO-8859-1"?>
<nex:nexml
version="0.9"
xsi:schemaLocation="http://www.nexml.org/2009 ../xsd/nexml.xsd"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns="http://www.nexml.org/2009"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xml="http://www.w3.org/XML/1998/namespace"
xmlns:nex="http://www.nexml.org/2009"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
>
<matrix>
<row id="d4320533856" otu="d4322811536">
<meta xsi:type="nex:ResourceMeta" rel="dcterms:source" href="http://www.ncbi.nlm.nih.
<seq>TCTCTTATCAAAC.../seq>
</row>
<row id="d4320534384" otu="d4322811664">
<meta xsi:type="nex:ResourceMeta" rel="dcterms:source" href="http://www.ncbi.nlm.nih.
<seq>TCTCTTATCAAAC...</seq>
</row>
</matrix>
</characters>
</nex:nexml>
Full Annotation
Specifying “add_full_annotation_to_taxa=True” or “add_full_annotation_to_seqs=True” will result in the entire
GenBank record being added as a set of annotations to the Taxon or CharacterDataVector object, respectively.
For example:
#! /usr/bin/env python
import dendropy
from dendropy.interop import genbank
gb_dna = genbank.GenBankDna(ids=[158930545, ’EU105475’])
char_matrix = gb_dna.generate_char_matrix(add_full_annotation_to_taxa=True)
print char_matrix.as_string("nexml")
will result in the following:
<?xml version="1.0" encoding="ISO-8859-1"?>
<nex:nexml
version="0.9"
xsi:schemaLocation="http://www.nexml.org/2009 ../xsd/nexml.xsd"
xmlns:genbank="http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.mod.dtd"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns="http://www.nexml.org/2009"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xml="http://www.w3.org/XML/1998/namespace"
xmlns:nex="http://www.nexml.org/2009"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
7.1. Working with GenBank Molecular Sequence Databases
107
DendroPy Tutorial, Release 3.12.1
>
<otus id="d4320533416">
<otu id="d4323884688" label="158930545">
<meta xsi:type="nex:ResourceMeta" rel="dcterms:source" href="http://www.ncbi.nlm.nih.gov/
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDSeq_locus" content="EU105474"
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDSeq_length" content="494" id="
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDSeq_moltype" content="DNA" id=
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDSeq_topology" content="linear"
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDSeq_strandedness" content="dou
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDSeq_division" content="PRI" id
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDSeq_update-date" content="27-N
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDSeq_create-date" content="27-N
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDSeq_definition" content="Homo
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDSeq_primary-accesison" content
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDSeq_accession-version" content
<meta xsi:type="nex:ResourceMeta" rel="genbank:otherSeqIds" id="d4323902032" >
<meta xsi:type="nex:LiteralMeta" property="genbank:gb" content="EU105474.1" id="d
<meta xsi:type="nex:LiteralMeta" property="genbank:gi" content="158930545" id="d4
</meta>
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDSeq_source" content="Homo sapi
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDSeq_organism" content="Homo sa
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDSeq_taxonomy" content="Eukaryo
<meta xsi:type="nex:ResourceMeta" rel="genbank:INSDSeq_references" id="d4323902416" >
<meta xsi:type="nex:ResourceMeta" rel="genbank:INSDReference_reference" id="d4323
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDReference_reference" c
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDReference_position" co
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDReference_title" conte
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDReference_journal" con
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDReference_pubmed" cont
</meta>
<meta xsi:type="nex:ResourceMeta" rel="genbank:INSDReference_reference" id="d4323
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDReference_reference" c
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDReference_position" co
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDReference_title" conte
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDReference_journal" con
</meta>
</meta>
<meta xsi:type="nex:ResourceMeta" rel="genbank:INSDSeq_feature-table" id="d4323902480
<meta xsi:type="nex:ResourceMeta" rel="genbank:INSDSeq_feature" id="d4323903312"
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDFeature_key" content="
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDFeature_location" cont
<meta xsi:type="nex:ResourceMeta" rel="genbank:INSDFeature_quals" id="d432390
<meta xsi:type="nex:LiteralMeta" property="genbank:organism" content="Hom
<meta xsi:type="nex:LiteralMeta" property="genbank:mol_type" content="gen
<meta xsi:type="nex:LiteralMeta" property="genbank:db_xref" content="taxo
<meta xsi:type="nex:LiteralMeta" property="genbank:chromosome" content="1
<meta xsi:type="nex:LiteralMeta" property="genbank:note" content="Ache" i
</meta>
</meta>
<meta xsi:type="nex:ResourceMeta" rel="genbank:INSDSeq_feature" id="d4323903568"
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDFeature_key" content="
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDFeature_location" cont
<meta xsi:type="nex:ResourceMeta" rel="genbank:INSDFeature_quals" id="d432390
<meta xsi:type="nex:LiteralMeta" property="genbank:note" content="non-cod
</meta>
</meta>
</meta>
</meta>
108
Chapter 7. Interoperability with Other Programs, Libraries and Applications
DendroPy Tutorial, Release 3.12.1
</otu>
<otu id="d4323884816" label="EU105475">
<meta xsi:type="nex:ResourceMeta" rel="dcterms:source" href="http://www.ncbi.nlm.nih.gov/
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDSeq_locus" content="EU105475"
<meta xsi:type="nex:LiteralMeta" property="genbank:INSDSeq_length" content="494" id="
.
.
.
(etc.)
</meta>
</otu>
</otus>
.
.
.
(etc.)
</nex:nexml>
7.2 Seq-Gen
DendroPy includes native infrastructure for phylogenetic sequence simulation on DendroPy trees under the HKY
model. Being pure-Python, however, it is a little slow. If Seq-Gen is installed on your system, though, you can take
advantage of the dendropy.interop.seqgen.SeqGen wrapper to efficiently simulate sequences under a wide
variety of models. The following examples should be enough to get started, and the class is simple and straightforward
enough so that all options should be pretty much self-documented.
#! /usr/bin/env python
import dendropy
from dendropy.interop import seqgen
trees = dendropy.TreeList.get_from_path("pythonidae.mcmc.nex",
"nexus")
s = seqgen.SeqGen()
# generate one alignment per tree
# as substitution model is not specified, defaults to a JC model
# will result in a DataSet object with one DnaCharacterMatrix per input tree
d0 = s.generate(trees)
print len(d0.char_matrices)
# instruct Seq-Gen to scale branch lengths by factor of 0.1
# note that this does not modify the input trees
s.scale_branch_lens = 0.1
# more complex model
s.char_model = seqgen.SeqGen.GTR
s.state_freqs = [0.4, 0.4, 0.1, 0.1]
s.general_rates = [0.8, 0.4, 0.4, 0.2, 0.2, 0.1]
d1 = s.generate(trees)
print len(d1.char_matrices)
7.2. Seq-Gen
109
DendroPy Tutorial, Release 3.12.1
7.3 PAUP
The paup module provides functions to estimate a tree given a data matrix, or a substitution model given a tree and a
data model.
Trees can be estimated using likelihood:
#! /usr/bin/env python
import dendropy
from dendropy.interop import paup
data = dendropy.DnaCharacterMatrix.get_from_path("pythonidae.nex", "nexus")
tree = paup.estimate_tree(data,
tree_est_criterion=’likelihood’,
num_states=2,
unequal_base_freqs=True,
gamma_rates=False,
prop_invar=False)
print tree.as_string("newick")
Or neighbor-joining:
#! /usr/bin/env python
import dendropy
from dendropy.interop import paup
data = dendropy.DnaCharacterMatrix.get_from_path("pythonidae.nex", "nexus")
tree = paup.estimate_tree(data,
tree_est_criterion=’nj’)
print tree.as_string("newick")
Estimating a substitution model parameters requires both a tree and a data matrix:
#! /usr/bin/env python
import dendropy
from dendropy.interop import paup
data = dendropy.DnaCharacterMatrix.get_from_path("pythonidae.nex", "nexus")
tree = paup.estimate_tree(data,
tree_est_criterion=’nj’)
model = paup.estimate_model(data,
tree,
num_states=2,
unequal_base_freqs=True,
gamma_rates=False,
prop_invar=False)
for k, v in model:
print k, v
7.4 RAxML
The raxml module provides the RaxmlRunner class, which is a lighweight (i.e., mostly “pass-through”) wrapper
to the RAxML maximum-likelihood tree estimation program. RAxML needs to be installed in the system for this class
to be used.
110
Chapter 7. Interoperability with Other Programs, Libraries and Applications
DendroPy Tutorial, Release 3.12.1
The class handles the exporting of the DendroPy dataset in a format suitable for RAxML analysis, and re-reading the
resulting tree back into a DendroPy object.
The basic call assumes nucleotide data and estimates a tree under the GTRCAT model:
#! /usr/bin/env python
import dendropy
from dendropy.interop import raxml
data = dendropy.DnaCharacterMatrix.get_from_path("pythonidae.nex", "nexus")
rx = raxml.RaxmlRunner()
tree = rx.estimate_tree(data)
Currently, the only way to customize the call to the underlying RAxML program using estimate_tree is to pass
it a list of command-line arguments, with each argument token a separate list element. So, for example, to include
invariant sites in the substitutio model and run 250 independent tree searches:
>>> tree = rx.estimate_tree(data, [’-m’, ’GTRCATI’, ’-N’, ’250’])
Obviously, while this works, it is neither ideal nor very Pythonic. Future releases will polish up the interface.
7.5 [*deprecated*] NCBI (National Center for Biotechnology Information) Databases
Deprecated since version 3.12.0: This module has been deprecated. Use the “genbank module” instead.
Warning: This module has been deprecated. For currently-supported approaches to download and generate
character matrices from GenBank data, see “Working with GenBank Molecular Sequence Databases”.
The ncbi module provides the Entrez class, which wraps up some basic querying and retrieval of data from the
NCBI (National Center for Biotechnology Information) life-sciences databases.
7.5.1 Retrieving Nucleotide Data
By Lists of Accession Identifiers
The fetch_nucleotide_accessions method takes a list of nucleotide accession numbers as arguments:
>>> from dendropy.interop import ncbi
>>> entrez = ncbi.Entrez()
>>> data = entrez.fetch_nucleotide_accessions([’EU105474’, ’EU105475’])
>>> for t in data.taxon_set:
...
print(t)
...
gi|158930546|gb|EU105475.1| Homo sapiens Arara non-coding region T864 genomic sequence
gi|158930545|gb|EU105474.1| Homo sapiens Ache non-coding region T864 genomic sequence
As can be seen, If successful, it returns a DnaCharacterMatrix object that consists of sequences corresponding
to the requested accessions. This can, of course, be written to a file or manipulated as needed:
>>> data.write_to_path(’eu105474-105475.nex’, ’nexus’)
7.5. [*deprecated*] NCBI (National Center for Biotechnology Information) Databases
111
DendroPy Tutorial, Release 3.12.1
By Ranges of Accession Identifiers
Sometimes, it might be more convenient to specify the required accession numbers as a range. For example, a publication may list sequences accessioned as “EU10574-106045”. The fetch_nucleotide_accession_range
takes three arguments: a numeric value indicating the start of the range, a numeric value indicating the end of the
range, and string giving a prefix to be added to each number within the range to yield the full accession identifier. Note
that, unlike Python’s native range function, the last or end value is included as part of the range. So, to get the all
the sequences given in a publication as “EU10574-106045”:
>>> from dendropy.interop import ncbi
>>> entrez = ncbi.Entrez()
>>> data = entrez.fetch_nucleotide_accession_range(105474, 106045, prefix="EU")
>>> for t in data.taxon_set:
...
print(t)
...
gi|158930636|gb|EU105565.1| Homo sapiens Arara non-coding region T946 genomic sequence
gi|158930635|gb|EU105564.1| Homo sapiens Ache non-coding region T946 genomic sequence
gi|158930638|gb|EU105567.1| Homo sapiens Guatuso non-coding region T946 genomic sequenc
.
.
.
>>> data.write_to_path(’data2.fas’, ’fasta’)
Tracking NCBI Curation Information
The Taxon objects associated with each sequence downloaded will have three additional attributes:
ncbi_accession, ncbi_version, and ncbi_gi. These track the accession, sequence version, and GI numbers, respectively, of the sequence:
>>> from dendropy.interop import ncbi
>>> entrez = ncbi.Entrez()
>>> data = entrez.fetch_nucleotide_accessions([’EU105474’, ’EU105475’])
>>> for taxon in data1.taxon_set:
...
print(taxon.ncbi_accession, taxon.ncbi_version, taxon.ncbi_gi)
...
(’EU105476’, ’EU105476.1’, ’158930547’)
(’EU105474’, ’EU105474.1’, ’158930545’)
RNA vs. DNA
As noted, it is assumed that the data is DNA, and thus the query will result in a DnaCharacterMatrix
object. By specifying matrix_type=RnaCharacterMatrix to fetch_nucleotide_accessions or
fetch_nucleotide_accession_range, you can retrieve RNA data as well.
Error Handling
By default, if you were to give non-existing accession numbers, an exception will be thrown:
>>> entrez = ncbi.Entrez()
>>> data = entrez.fetch_nucleotide_accessions([’zzz0’, ’zzz1’])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "dendropy/interop/ncbi.py", line 232, in fetch_nucleotide_accessions
raise Entrez.AccessionFetchError(missing_ids)
dendropy.interop.ncbi.AccessionFetchError: Failed to retrieve accessions: zzz0, zzz1
112
Chapter 7. Interoperability with Other Programs, Libraries and Applications
DendroPy Tutorial, Release 3.12.1
An exception will be thrown even if some of the specified accessions are valid::
>>> data = entrez.fetch_nucleotide_accessions([’zzz0’, ’zzz1’, ’EU105475’])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "dendropy/interop/ncbi.py", line 232, in fetch_nucleotide_accessions
raise Entrez.AccessionFetchError(missing_ids)
dendropy.interop.ncbi.AccessionFetchError: Failed to retrieve accessions: zzz0, zzz1
By
passing
in
verify=False
to
fetch_nucleotide_accessions
or
fetch_nucleotide_accession_range, you can request that data retrieval failures can be ignore, and
only existing accessions be returned:
>>> data = entrez.fetch_nucleotide_accessions([’zzz0’, ’zzz1’, ’EU105475’], verify=False)
>>> len(data)
1
>>> for t in data.taxon_set:
...
print(t.label)
...
gi|158930546|gb|EU105475.1| Homo sapiens Arara non-coding region T864 genomic sequence
Note that specifying verify=False means that you might end up with empty DnaCharacterMatrix objects:
>>> data = entrez.fetch_nucleotide_accessions([’zzz0’, ’zzz1’], verify=False)
>>> len(data)
0
Also, perhaps more of a concern, turning off verification may lead to wrong sequences being retrieved. For example,
when trying to download a range of accessions, but inadvertently omitting to specify a prefix value to be pre-pended
to identifiers might result in matching the wrong sequences, based on GI values:
>>> data = entrez.fetch_nucleotide_accession_range(1000, 1001, verify=False)
>>> print(len(data))
2
>>> for t in data.taxon_set:
...
print(t.genbank_id, ": ", t.label)
...
Z18639 :
gi|1000|emb|Z18639.1| D.leucas gene for large subunit rRNA
Z18638 :
gi|1001|emb|Z18638.1| D.leucas gene for small subunit rRNA
Here, the sequences were retrieved based on matching GI numbers (1000, 1001) rather than the accession ids (e.g.,
“AY1000”, “AY1001”).
Switching off verification can also lead to some confusing errors. For example:
>>> data = entrez.fetch_nucleotide_accession_range(1000, 1003, verify=False)
--Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ":dendropy/interop/ncbi.py", line 282, in fetch_nucleotide_accession_range
File "dendropy/interop/ncbi.py", line 245, in fetch_nucleotide_accessions
sys.stderr.write("---\nNCBI Entrez Query returned:\n%s\n---\n" % results_str)
File "dendropy/utility/iosys.py", line 199, in get_from_string
readable.read_from_string(src, schema, **kwargs)
File "dendropy/utility/iosys.py", line 260, in read_from_string
return self.read(stream=s, schema=schema, **kwargs)
File "dendropy/dataobject/char.py", line 653, in read
7.5. [*deprecated*] NCBI (National Center for Biotechnology Information) Databases
113
DendroPy Tutorial, Release 3.12.1
d = DataSet(stream=stream, schema=schema, **kwargs)
File "dendropy/dataobject/dataset.py", line 90, in __init__
self.process_source_kwargs(**kwargs)
File "dendropy/utility/iosys.py", line 221, in process_source_kwargs
self.read(stream=stream, schema=schema, **kwargs)
File "dendropy/dataobject/dataset.py", line 172, in read
raise x
dendropy.utility.error.DataParseError: Error parsing data source on line 42 at column 3: Unrecognized
Here, the sequence with GI number of “1003” was a protein sequence, so it included characters not part of the DNA
alphabet, resulting in the DataParseError exception being raised.
7.5.2 (Auto-)Generating Analysis-Friendly Sequence Labels
When fetching nucleotides, you can request the Entrez object to generate labels that are little more compact and
analysis friendly by passing generate_label=True to the constructor. This will generate a new taxon label for
sequence based on the GenBank FASTA defline value. By default, it will compose a label in the form of:
<GBNUM>_<Genus>_<species>_<other>
So, for example, a sequence with the defline:
gi|158930547|gb|EU105476.1| Homo sapiens Bribri non-coding region T864 genomic sequence
will get the taxon label:
EU105476_Homo_sapiens_Bribri
You can control details of the label construction by the following arguments to the constructor:
• label_num_desc_components specifies the number of components from the defline to use. By default,
this is 3, which usually corresponds (in a sensible defline) to the genus name, the species epithet, and either the
sub-species or locality information.
• label_separator specifies the string used in between different label components. By default, this is an
underscore.
• label_id_in_front specifies whether the GenBank accession number should form the beginning
(True; default) or tail (False) end of the label.
Furthermore, you can request that the data get sorted by label value by specifying sort_taxa_by_label=True.
So, for example:
>>> entrez = ncbi.Entrez(generate_labels=True, sort_taxa_by_label=True)
>>> data = entrez.fetch_nucleotide_accessions([’EU105474’, ’EU105475’, ’EU105476’])
>>> for t in data.taxon_set:
...
print(t)
...
EU105474_Homo_sapiens_Ache
EU105475_Homo_sapiens_Arara
EU105476_Homo_sapiens_Bribri
>>> data.write_to_path(’gb2.nex’, ’nexus’)
Or:
>>> entrez = ncbi.Entrez(generate_labels=True,
...
label_num_desc_components=2,
...
label_id_in_front=False,
...
label_separator=’.’)
>>> data = entrez.fetch_nucleotide_accessions([’EU105474’, ’EU105475’, ’EU105476’])
114
Chapter 7. Interoperability with Other Programs, Libraries and Applications
DendroPy Tutorial, Release 3.12.1
>>> for t in data.taxon_set:
...
print(t)
...
Homo.sapiens.EU105476
Homo.sapiens.EU105475
Homo.sapiens.EU105474
>>> data.write_to_path(’seqs.dat’, ’phylip’, strict=False)
7.5. [*deprecated*] NCBI (National Center for Biotechnology Information) Databases
115
© Copyright 2026 Paperzz