AffyTrees: Facilitating Comparative Analysis of

Bioinformatics
AffyTrees: Facilitating Comparative Analysis of
Affymetrix Plant Microarray Chips1[C][OA]
Tancred Frickey, Vagner Augusto Benedito, Michael Udvardi, and Georg Weiller*
Australian Research Council Centre of Excellence for Integrative Legume Research and Bioinformatics
Laboratory, Genomic Interactions Group, Research School of Biological Sciences, Australian National
University, Canberra, Australian Capital Territory 2601, Australia (T.F., G.W.); and The Samuel Roberts Noble
Foundation, Ardmore, Oklahoma 73401 (V.A.B., M.U.)
Microarrays measure the expression of large numbers of genes simultaneously and can be used to delve into interaction networks
involving many genes at a time. However, it is often difficult to decide to what extent knowledge about the expression of genes
gleaned in one model organism can be transferred to other species. This can be examined either by measuring the expression of
genes of interest under comparable experimental conditions in other species, or by gathering the necessary data from comparable
microarray experiments. However, it is essential to know which genes to compare between the organisms. To facilitate comparison
of expression data across different species, we have implemented a Web-based software tool that provides information about
sequence orthologs across a range of Affymetrix microarray chips. AffyTrees provides a quick and easy way of assigning which
probe sets on different Affymetrix chips measure the expression of orthologous genes. Even in cases where gene or genome
duplications have complicated the assignment, groups of comparable probe sets can be identified. The phylogenetic trees provide
a resource that can be used to improve sequence annotation and detect biases in the sequence complement of Affymetrix chips.
Being able to identify sequence orthologs and recognize biases in the sequence complement of chips is necessary for reliable crossspecies microarray comparison. As the amount of work required to generate a single phylogeny in a nonautomated manner is
considerable, AffyTrees can greatly reduce the workload for scientists interested in large-scale cross-species comparisons.
Microarray experiments have made it possible to
rapidly quantify the expression of large numbers of
genes for a given experimental condition. The rapidity
and ease of use of this technology has enabled research
into complex aspects of growth and development involving multiple genes at a time. However, it remains
difficult to extend findings from one organism to another, as it is often not known which of the spots on
different microarray chips measure the expression of
comparable (i.e. orthologous) genes.
The basic idea of using model organisms is that the
knowledge gained from studying such an organism
will, to a large extent, be transferable to other species.
Taking the regulatory feedback loop controlling branching in Arabidopsis (Arabidopsis thaliana) as an example,
validating analyses needed to be performed in a range
of other species to determine to what extent this mech1
This work was supported by the Australian Research Council
Centre of Excellence. Funding to pay for the publication charges was
provided by the same grant.
* Corresponding author; e-mail [email protected].
The author responsible for distribution of materials integral to the
findings presented in this article in accordance with the policy
described in the Instructions for Authors (www.plantphysiol.org) is:
Georg Weiller ([email protected]).
[C]
Some figures in this article are displayed in color online but in
black and white in the print edition.
[OA]
Open Access articles can be viewed online without a subscription.
www.plantphysiol.org/cgi/doi/10.1104/pp.107.109603
anism was conserved and how far the knowledge
gained in Arabidopsis could be applied to other plants
(Johnson et al., 2006).
Approaches to validate such regulatory networks
range from crudely determining whether the necessary
genes might be present in another genome and then
assuming the complete network of gene interaction to
be conserved, to quantifying the expression of the corresponding genes under comparable experimental conditions and verifying that the genes actually do behave
in a similar manner. The former is a crude but quick,
cheap, and easy approach, while the latter is more
refined, but work intensive, expensive, and complicated. Data-mining available microarray data may
provide an intermediate solution to the problem. Microarray data repositories such as the Gene Expression
Omnibus (Edgar et al., 2002) provide a wealth of
information about how an organism responds to a
wide variety of experimental conditions and may provide information about the expression of a gene of
interest in a species of interest under an experimental
condition of interest.
Regardless of the approach used, it is necessary to
know which genes can be compared between organisms. In many cases, available gene annotation or best
BLAST (Altschul et al., 1997) hits are used. However,
gene annotation is not always correct or up to date, and
best BLAST hits do not always correspond to the closest
phylogenetic relative (Koski and Golding, 2001). The
orthology of genes, i.e. gene copies that arose due to a
speciation event, is the quintessential feature to look for
Plant Physiology, February 2008, Vol. 146, pp. 377–386, www.plantphysiol.org Ó 2007 American Society of Plant Biologists
Downloaded from on June 18, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
377
Frickey et al.
when attempting to compare genes or gene products.
The underlying assumption is that a gene in an emergent species will continue to perform the same function
it had in the ancestral species. Genes that arose via
duplication (i.e. paralogous genes) are a different matter, as two copies of the gene are present in the genome
of the organism, making it less likely that changes in
one of the duplicates will lead to a noticeable reduction
in fitness, making it more likely that such changes will
be passed on to the next generation. The paralogous
genes we observe today were therefore less restrained
in their ability to change, be lost, be inactivated, or
evolve toward a new function. Alternatively, both of the
duplicates may have changed only slightly, each continuing to perform a subset of the original gene’s tasks
or both may have remained fully functional, accumulating only minor changes in the regulation of their
expression to counteract potential dosage effects. This
freedom of paralogs to change is the main reason why
comparison of paralogous genes is unlikely to be beneficial or intended, and cross-species comparisons
should be confined to orthologous or co-orthologous
genes.
A number of tools and databases exist that attempt
to determine which genes are orthologous and therefore comparable across organisms (e.g. COG [Tatusov
et al., 1997], Orthomcl [Li et al., 2003], KOG [Tatusov
et al., 2003], Genome Clusters Database [Horan et al.,
2005], Inparanoid [O’Brien et al., 2005], Multiparanoid
[Alexeyenko et al., 2006], and Orthologid [Chiu et al.,
2006]). Unfortunately, some of these provide orthology
assignments for only a very restricted set of species,
while others require completed genomes on which to
base their predictions. Both these points make these
databases next to useless for researchers wanting to
compare sequences from organisms for which completed genomes are not yet available and that were not
part of the select set of species that were included in the
databases. For such organisms, researchers generally
have to rely on sequence similarity searches to determine potential sequence orthologs in better-described
species. In addition, the majority of the methods do
not base their orthology predictions on phylogenetic
trees but on other clustering methods and only use
phylogenies to visualize the results. Finally, none of the
methods provide an easy lookup of which Affymetrix
sequences are comparable across chips, making an
additional mapping of Affymetrix exemplar sequences
to predicted sequence orthologs necessary.
Our Web-based software tool provides a quick and
easy way of assessing the orthology of protein-coding
genes for a variety of plant microarray chips, irrespective of whether the genome of the organism is completed or not. We focused on Affymetrix chips, as the
overwhelming majority of microarray data present in
public repositories is based on these (Gene Expression
Omnibus; Edgar et al., 2002). These chips generally
provide a reasonable coverage of the transcriptome of
an organism, and the corresponding sequence data are
readily available. As many chips are designed and sold
before the corresponding organism is completely sequenced, there may be cases where sequences spotted
on a chip are thought no longer to be present in the
genome, or some genes in the genome may be represented multiple times or missing on the chip. In contrast to other methods, we do not use open reading
frames (ORFs) predicted from genomic data but the
sequences from which the probe sets for a given chip
were derived, hereafter referred to as either exemplar
or consensus sequences. We thereby avoid problems
arising from inaccurate ORF prediction, genome sequences being revised and changed, as well as errors in
assigning the various probe sets to predicted genomic
ORFs. For each of the consensus sequences, we provide
the results of sequence similarity searches against a
number of sequence databases, a Profile-Hidden Markov
model (HMM) representative of the sequence family,
as well as a multiple sequence alignment and phylogenetic tree for that family. An additional utility permits determining sequence orthologs in a species of
choice for the sequences present on an Affymetrix chip.
A Web interface is provided to PHAT, part of the
PhyloGenie package (Frickey and Lupas, 2004), which
allows the repository of phylogenetic trees to be mined
for trees corresponding to specific topological or species constraints.
CONSTRUCTION AND CONTENT
The National Center for Biotechnology Information
(NCBI) nonredundant protein database ‘‘nr’’ and
6-frame translations of the plant microarray chip consensus sequences provided by Affymetrix provide the
set of sequences on which we base our predictions. The
6-frame translations of the consensus sequences provide information as to what proteins are represented on
the various microarray chips. The ‘‘nr’’ database contains a wide variety of species suitable as outgroups for
the phylogenies and provides sequences that may have
failed to be included on the microarray chips of the
various organisms. The latter are of special importance,
as they provide critical data when attempting to assess
whether two sequences are orthologous or paralogous
(Fig. 1).
PhyloGenie is used to automatically search for sequence homologs and infer phylogenetic trees for all
consensus sequences on a chip. This tool was originally
developed to generate and analyze phylomes in regards to gene duplications and lateral gene transfers
and can be briefly described as follows. Each microarray consensus sequence is compared against the
above-mentioned databases using BLAST. The result
of these sequence similarity searches is used to identify
potential sequence homologs. BLAST high-scoring segment pairs (HSPs) with greater than 70% coverage of
the query and E values better than 1e-5 are extracted
and aligned to one another. These parameters were
chosen lax enough to detect nontrivial sequence similarities yet stringent enough to exclude high-scoring
378
Plant Physiol. Vol. 146, 2008
Downloaded from on June 18, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
AffyTrees
Figure 1. An ancestral gene undergoes a duplication and gives rise to
two paralogous genes, A and B. Some time later, a speciation event
gives rise to two species (light and dark). Each of these has retained both
paralogs in their genome, but only genes A for the dark species and B
for the light species are included on the chip. Simple pairwise comparison of the chip sequences alone would predict A (dark) and B#
(light) to be sequence orthologs, as these would appear to be reciprocal
closest relatives. Including additional sequence data, such as sequences of outgroup species or the sequences A# (light) and B (dark)
missing on the chips but present in the genomes of the blue and red
species, can help clarify relationships and allow unambiguous assignment of sequence orthologs.
local similarities that would, by themselves, not warrant the assignment of two sequences as being orthologous. The resulting alignment contains the sequence
regions we regard as homologous to the query. Hmmer
(http://hmmer.janelia.org/) is used to derive an HMM
from this alignment and search the full-length sequences of all BLAST-HSPs with E values better than
1. Deriving an HMM from the above alignment gives a
better representation of the sequence family. Using this
HMM to search against full-length sequences of even
marginal BLAST hits allows detection of more of the
distant sequence homologs and better defines the start
and end of homologous sequence regions than a single
BLAST search could. Sequence regions matching the
full-length HMM with E values better than 1e-5 are
combined to a multiple sequence alignment. A phylogenetic tree with 100 bootstrap replicates is inferred
from this alignment. Due to limited computational
resources, we use neighbor-joining (Saitou and Nei,
1987) to infer phylogenies. All intermediary files are
made available so that the process can be followed from
beginning to end, and alternative approaches, for example, a different method of tree inference, could be
used. The trees are rooted at the phylogenetic node
closest to the ‘‘last universal common ancestor,’’ as
described in the PhyloGenie manuscript (Frickey and
Lupas, 2004).
The set of trees generated by PhyloGenie provides
the basis of our prediction of sequence orthologs. The
actual prediction requires a number of user-specified
parameters and is performed on-the-fly, allowing for a
high degree of flexibility. Detection of sequence orthologs is based on the number of nodes separating the
query sequence, i.e. the sequence for which a tree was
derived, from sequences of any given species in the
tree. In the following examples, we assume that the user
selected the Arabidopsis ATH1-121501 chip and was
attempting to find sequence orthologs in Medicago
truncatula.
Determining sequence orthologs is done in the following manner (Fig. 2). The number of nodes separating each M. truncatula sequence (yellow) from the
query (purple) is determined (minimum no. 4, SD
2.87). An additional scaling factor (default, 0.5) allows
the user to specify the range in which he is willing to
accept M. truncatula sequences as potential sequence
orthologs. Increasing this value causes the program to
take into account more distant sequence relatives as
potential orthologs, while decreasing this value causes
the program to focus on the most closely related sequences only. In the presented analysis, we used a value
of 0.5, as this allowed us to determine orthologs for
most of the chip sequences while not causing too many
of the query sequences to be assigned multiple orthologs in the other species. The distance within which
sequences are accepted as potential sequence orthologs
is referred to as the permissive range in this manuscript. The permissive range is calculated as the minimal number of nodes separating the query sequence
from a M. truncatula homolog in the tree plus the SD
multiplied by the scaling factor. The SD reflects the
dispersal pattern of M. truncatula sequences throughout the tree. The more clades in a tree containing M.
truncatula sequences, the greater the uncertainty about
which of these clades contains sequences orthologous
to the query. We therefore use the SD of the number of
nodes separating M. truncatula sequences from the
query as a measure for how uncertain we are that the
sequences closest to each other, in number of nodes,
really are the sequence orthologs. For the tree shown in
Figure 2, the permissive range is highlighted in green
and encompasses all sequences less than six nodes
removed from the query. Affymetrix Arabidopsis
ATH1-121501 sequences less than six nodes removed
from the query are regarded as sequence paralogs to the
query (260439_at). M. truncatula sequences within the
permissive range are regarded as potential sequence
orthologs (Mtr.28509.1.S1_at, Mtr.17370.1.S1_at, and
Mtr.21922.1.S1_at).
For each of the potential orthologs, we subsequently
perform a reverse lookup. We calculate the minimum
and SD of the number of nodes separating each potential
ortholog from the Affymetrix Arabidopsis ATH1121501 sequences present in the tree. As the minimum
and SD are greatly influenced by the position in the tree
of the sequence for which the values are being calculated, the permissive ranges of the potential orthologs
may be quite different from one another. A red and blue
line show the permissive ranges for two of our three
Plant Physiol. Vol. 146, 2008
379
Downloaded from on June 18, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
Frickey et al.
Figure 2. Determining sequence orthologs based on the number of nodes separating them from the query. This example provides
a case where multiple clades containing both M. truncatula and closely related Arabidopsis homologs are present. Sequences
from the Arabidopsis microarray chip ATH1-121501 are highlighted in blue, the query sequence for which this tree was
computed is highlighted in magenta, and sequences from the M. truncatula microarray chip are highlighted in yellow. The
permissive range for the query is shown with a colored background (green), and red and blue lines, above and below the tree,
respectively, show the permissive range for the reverse lookup for two of the three potential sequence orthologs. Circles show
which of the Arabidopsis ATH1-121501 sequences were recovered in the respective reverse lookups. [See online article for color
version of this figure.]
potential orthologs. The query sequence does not lie
within the permissive range of Mtr.21922.1.S1_at (blue
line). This sequence is therefore removed from the set
of potential orthologs, as it appears much more closely
related to the Affymetrix Arabidopsis sequence
‘‘257728_at’’ than to the query. Mtr.28509.1.S1_at (red
line) and Mtr.17370.1.S1_at (not shown) recover the
query sequence in their permissive ranges, and both are
retained as sequence orthologs to the query. Analysis of
this tree therefore tells us that our query sequence
‘‘245641_at’’ has a sequence paralog (260439_at) on the
Affymetrix Arabidopsis ATH1-121501 chip and two
sequence orthologs (or co-orthologs) on the Affymetrix
M. truncatula chip.
The aim of this tool is two-fold: it offers a fully
automated way of retrieving sequence orthologs for
microarray consensus sequences from a wide variety of
species and provides the results of a BLAST search,
multiple sequence alignment, and phylogenetic inference for every consensus sequence on a chip. This
allows manual validation of any dubious orthology
predictions by comparing the various intermediate results leading to the phylogeny against the corresponding
phylogenetic trees and alignments. In addition, the large
number of alignments generated in the process of
constructing the phylogenies are a useful resource on
which to base further analyses, as they provide sets of
aligned sequence homologs for every consensus sequence on a chip.
UTILITY
The user interface has five Web pages. The home
page allows querying of individual genes and links to
the remaining pages, some help, and supplemental
380
Plant Physiol. Vol. 146, 2008
Downloaded from on June 18, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
AffyTrees
data. The other four pages of the interface deal with
batch requests, analysis of chip phylomes, generation
of phylogenies for sequences provided by the user,
and prediction of sequence orthologs between the
consensus sequences represented on a chip and other
species.
The results of an individual query are shown in
Figure 3. Tabs at the top of the page allow navigation
between the results of a BLAST search (BLAST), alignment of HSPs (CLN), the derived HMM (HMM), results
of the HMM search (HMS), alignment of high-scoring
HMM hits (HLN), and either a textual or applet-based
representation of a Neighbor-Joining tree (TRE). The
tabs allow the user to retrace every step leading from
query sequence to phylogeny and are very useful to
gain a better understanding of why two genes were
regarded as homologous, included in the same tree, or
predicted to be sequence orthologs. To facilitate inter-
pretation of batch requests and complete phylome
analyses, intermediate pages can be generated that
gather the results, order them, and link to the results
pages of the various genes. Prediction of sequence
orthologs between microarray chip consensus sequences
and a species of choice generates a tab-delimited list
containing information about which sequences on the
chip could be assigned sequence orthologs in another
species, which sequences should be regarded as coorthologous or paralogous, and which other homologous sequences were present in the phylogenies but
could not be assigned a more precise relationship.
Supplemental data, providing further information
about the programs used, the individual steps performed to generate the data, as well as the parameters
the user can tweak, are available at http://bioinfoserver.
rsbs.anu.edu.au/utils/affytrees/help.php. Results of
phylome analyses, custom phylogenetic trees, and or-
Figure 3. Screenshot of results using the Arabidopsis ATH1-121501 chip consensus sequence 261590_at as a query. Part of the
corresponding phylogenetic tree is displayed. Red (dark) dots highlight M. truncatula sequences, yellow (light) dots highlight
ATH1-121501 sequences, and a blue dot (bottom) highlights the query sequence. The tabs at the top of the page allow navigation
between BLAST results (BLAST), the alignment of HSPs (CLN), the derived HMM (HMM), the HMM search results (HMS), the
alignment from which the phylogeny is inferred (HLN), and either a text or graphical representation of the phylogenetic tree
(TRE). [See online article for color version of this figure.]
Plant Physiol. Vol. 146, 2008
381
Downloaded from on June 18, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
Frickey et al.
thology predictions are stored for a week and can be
accessed by referring to the job identifier provided in
the results.
This tool differs from other databases and programs
in a number of ways. It provides the data on which tree
inference and orthology prediction is based and
thereby allows the user to retrace each step of the
decision process. Our trees include sequences from the
‘‘nr’’ database that greatly facilitate correct rooting and
interpretation. In addition, this allows us to potentially
detect sequence orthologs for any species represented
in ‘‘nr’’ instead of being limited to those species for
which complete genomes or proteomes are available.
The use of a user-defined ‘‘scaling factor’’ avoids problems co-orthologous genes cause for approaches relying solely on reciprocal best hits between genomes. If,
for example, a species has a gene of interest, gene A, that
was duplicated in another species, giving rise to genes
B and B#, reciprocal best hit approaches may identify
genes A and B or A and B# as reciprocal best hits and
assign them as sequence orthologs. However, if A
appears most similar to B but B# appears most similar
to A, a possible scenario if nonsymmetric scoring
schemes such as employed by BLAST are used, then
no reciprocal best hits can be determined and no
sequence orthologs are assigned. All of the above cases
produce an incorrect assignment of gene orthology, as B
and B# are co-orthologous to A (i.e. duplicates derived
from a gene that was orthologous to A) and should be
treated as such.
Another part of this tool allows the user to search
through the trees of a given species or chip for those
corresponding to specific topological selection criteria.
For example, to find all trees in which a clade contains
at least one M. truncatula and Arabidopsis sequence,
but no sequences from the Arabidopsis ATH1-121501
chip, the selection string ‘‘((Medicago truncatula &
Arabidopsis) & !Arabidopsis ATH1-121501)’’ could be
used. Trees containing such clades could identify sequences present in M. truncatula, the orthologs of which
cannot be measured using the Affymetrix Arabidopsis
ATH1-121501 chip, as no sequence orthologs are present on that chip. As an example of such a case (Fig. 4),
we show a tree derived for a hypothetical protein from
M. truncatula, the ortholog of which was not included
on the ATH1-121501 chip, even though orthologous
sequences are present in the Arabidopsis genome
as well as throughout the plant, fungal, and animal
kingdoms.
Future developments include, as a first step, extending this tool beyond the currently available seven chips
to include all publicly available Affymetrix plant microarray chips. Because this system is not limited as to
what species can be analyzed, provided some sequence
information for the species is available, it is conceivable
that the system may be extended to cover all available
Affymetrix microarray chips. Beyond that, the aim will
be to develop and implement methods that further
facilitate comparative analysis of microarray expression data across species.
RESULTS AND DISCUSSION
To determine whether the AffyTrees orthology predictions were comparable to, less, or more accurate
than reciprocal best BLAST hits, the most widely used
method to identify sequence orthologs, we compared
the orthology predictions generated by both methods.
Phylogenetically orthologous sequences are generally
expected to fulfill the same function in different species,
and functionally orthologous sequences are expected to
be similarly expressed across different species. Therefore, phylogenetic orthologs can be expected to show a
certain degree of similarity in their expression across
species. We based our comparison on prediction of
sequence orthologs between the Arabidopsis ATH1121501 and M. truncatula Affymetrix chips. These species were chosen specifically, because sets of comparable
microarray experiments were available and provided
us with the opportunity to test whether and how well
sequence orthology, as predicted by reciprocal best
BLAST hits and AffyTrees, was reflected in similarity
of expression.
The results of comparing the orthology predictions
for these two microarray chips are shown in Figure 5A.
BLAST produced many more reciprocal best hits
(7,025) than AffyTrees predicted orthologs (5,793). Of
these, 2,926 predictions of sequence orthologs coincided, 4,099 orthology predictions were unique to the
reciprocal best BLAST hits, and, 2867 orthology predictions were unique to AffyTrees. Even though BLAST
produced nearly 30% more orthology predictions,
fewer individual sequences were assigned an ortholog
in BLAST than in AffyTrees. This was due to many of
the BLAST hits having multiple ortholog assignments.
On average, each M. truncatula chip sequence was
assigned 1.78 Arabidopsis chip sequences as reciprocal best BLAST hits, and every Arabidopsis chip sequence was assigned 1.57 M. truncatula chip sequences.
This artificially inflated the number of ‘‘orthology’’
predictions provided by BLAST. Dividing the number of reciprocal best BLAST hits by the amount of
multiple predictions for each species gives us the
number of individual genes for each species that could
be assigned at least one ortholog in the other species:
the exclusively BLAST-based predictions assigned
2,303 sequences from Medicago one or more orthologs
in Arabidopsis, and 2,611 sequences in Arabidopsis
could be assigned one or more orthologs in Medicago.
The exclusively AffyTrees-based predictions assigned
2,515 Medicago sequences orthologs in Arabidopsis
and 2,537 Arabidopsis sequences orthologs in Medicago,
138 more sequences than assigned by reciprocal best
BLAST hits.
To determine which of the methods provided a more
accurate orthology prediction, we compared the expression of predicted sequence orthologs in two sets of
microarray experiments, one for Arabidopsis (Schmid
et al., 2005) and one for M. truncatula (V. Benedito, I.
Torres-Jerez, J. Murray, A. Andriankaja, S. Allen, K.
Kakar, M. Wandrey, J. Verdier, H. Zuber, T. Ott, S. Moreau,
382
Plant Physiol. Vol. 146, 2008
Downloaded from on June 18, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
AffyTrees
Figure 4. Phylogenetic tree of a protein-coding gene present in a wide variety of eukaryotes that is not represented on either of
the Affymetrix Arabidopsis chips. This is recognizable by the sequence identifiers. The Arabidopsis sequences (yellow, light dot)
have NCBI gi-numbers instead of Affymetrix identifiers, signifying that these sequences were taken from the ‘‘nr’’ database and
not one of the 6-frame translations of the microarray chip consensus sequences. The bottom-most sequence is the M. truncatula
query sequence for which this tree was generated. Other M. truncatula sequences are highlighted with a red (dark) dot. [See
online article for color version of this figure.]
A. Niebel, T. Frickey, G. Weiller, J. He, X. Dai, P. Zhao,
Y. Tang, and M. Udvardi, unpublished data; Medicago
Gene Atlas, ArrayExpress accession E-MEXP-1097).
The expression of genes was compared across seven
tissue types: stems, petioles, leaves, vegetative buds,
flowers, roots, and seeds. Different laboratories generated the data, and differences in harvesting, preparation, experimental procedure, growth conditions, and
of course the plants themselves undoubtedly will have
affected the experiments and provide ample explanation for why some sequence orthologs might not be
correlated in their expression in these two species.
Therefore, we did do not expect all sequence orthologs
to show a strong positive correlation in their expression, but a general positive trend in correlation was
certainly expected. However, our aim was not to show
that sequence orthologs share similar expression patterns but to use the available expression data to assess
the accuracy of the two prediction methods.
Accepting the 2,926 orthology assignments both
BLAST and AffyTrees agreed upon as ‘‘true’’ orthologs,
we used the Pearson (linear) correlation coefficient of
the expression values to measure the coexpression of all
predicted ortholog pairs. The histogram in Figure 5B
shows the number of predicted ortholog pairs for a
given correlation coefficient as well as a fitted scaled
extreme value distribution (EVD; Fig. 5B). Most of the
predicted ortholog pairs produced positive correlation
coefficients, supporting our expectation that sequence
orthologs, in general, should show similar expression
across different organisms. In addition, the graph provides us with a means of testing the accuracy of
reciprocal best BLAST hits and AffyTrees orthology
predictions as seen in Figure 5C. Rather than comparing histograms directly, we approximated the histograms by a distribution with a small number of
parameters to facilitate comparison of multiple datasets. The EVD approximates the various histograms
depicted in Figure 5 quite well. The more accurate the
set of orthologs predicted by each method, the better
the corresponding fitted EVD should approximate the
EVD derived from our set of 2,926 true orthologs.
Plant Physiol. Vol. 146, 2008
383
Downloaded from on June 18, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
Frickey et al.
Figure 5. A, Overlap of reciprocal best
BLAST hits (yellow) with AffyTrees orthology predictions (blue). B, Histogram and fitted EVD of ortholog pairs
predicted by both BLAST and AffyTrees
over the correlation coefficient of their
expression values across the microarray experiments. For comparison purposes, the fitted EVD curve (green) for
this data is represented in 5C and 5D as
well. A vertical dotted line is placed at
the peak of the EVD, and the correlation coefficient at which the peak is
found is stated in black numbers at the
bottom. The median value of each
dataset is marked in the top left corner.
C, Histogram and fitted EVD of the
genes assigned orthologs in either
BLAST (yellow) or AffyTrees (blue)
over the average correlation coefficient
of the assigned orthologs. D, Histogram and fitted EVD over the average
correlation coefficient for genes assigned orthologs randomly (black) or
by indiscriminately using any sequences present in the AffyTrees phylogenies as orthologs (magenta).
We then compared the sets of genes for which sequence orthologs could only be predicted by either
BLAST or AffyTrees. Whenever one gene was assigned
multiple sequence orthologs, we averaged their correlation coefficients to reflect that the method generating
the prediction could not decide in more detail which of
the predicted orthologs should be used. A total of 4,914
genes were assigned sequence orthologs only in reciprocal BLAST hits and 5,052 genes were assigned sequence orthologs only in AffyTrees. The graphs of the
histograms and fitted EVDs for these sets of genes are
shown in Figure 5C. Both BLAST and AffyTrees were
able to predict orthologs for similar numbers of genes;
however, the maximum of the BLAST-EVD lies at 0.47,
while the maximum of the AffyTrees-EVD lies at 0.66.
The EVD based on the AffyTrees predictions also better
approximates the EVD based on the set of true orthologs. Taking the median of the correlation coefficients as
the comparison metric leads to similar results (Fig. 5, B–
D). Bootstrap sampling of the BLAST and AffyTrees
distributions (10,000 samples, 1,000 replicates) showed
the median values of the distributions to be very
resilient to change. The probability of generating a
randomly sampled distribution with the median value
observed in the other method was, in both cases, quite
unlikely (BLAST, 2.1236; AffyTrees, 6.2226). Both the
median values of the distributions as well as the maximum of the fitted EVDs show that the histogram of the
AffyTrees predictions (blue) is more similar to the
histogram of the true orthologs (green) than the histogram of the best BLAST-based predictions (yellow) is to
the true orthologs. This points to the AffyTrees predictions being more reliable than the predictions based on
best BLAST hits.
However, it was recently shown that GCRMA (Wu
et al., 2004) normalization can lead to overprediction of
correlated genes (Lim et al., 2007). To see whether this
was affecting our results, we repeated the above analysis using MAS5 (Hubbell et al., 2002) normalized data.
The median values of the resulting distributions were
0.417 for our set of true orthologs, 0.339 for the
AffyTrees orthologs, 0.275 for the BLAST predictions,
0.267 for AffyTrees homologs, and 0.018 for random
sequence pairs. These values are similar to those calculated based on the GCRMA normalized data, indicating that, although GCRMA normalization does
seem to increase the median value of the distributions,
the increase is slight, and no qualitative difference in
how the methods compare to one another is apparent.
In an attempt to determine why the BLAST-based
prediction fared poorly, we examined how various
modes of orthology assignment influence the fitted
EVD. We show the histograms and fitted EVD for two
further datasets (Fig. 5D). The first set was generated by
randomly pairing sequences from within our set of true
orthologs (black) and the second by accepting all se-
384
Plant Physiol. Vol. 146, 2008
Downloaded from on June 18, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
AffyTrees
quence homologs present in the AffyTrees phylogenies
as sequence orthologs (pink). These phylogenies provide a large number of groupings of homologous
sequences. We know a large number of the trees to
contain paralogous sequences, and misassigning sequence paralogs as orthologs is one of the key difficulties in accurately detecting sequence orthologs. The
graph shows that an EVD fitted to the random orthology assignments (black) has its maximum close to zero.
Indiscriminately assigning all sequence homologs present in a tree as sequence orthologs generates many
more orthology predictions, as visible by the increased
amplitude of the EVD. However, the maximum of the
fitted EVD is close to 0.5, well below the 0.68 maximum
we determined for the EVD of the set of true orthologs
(green). We therefore expect the maximum of EVDs
fitted to various methods of orthology assignment, for
this dataset, to lie within 0 and 0.7. The closer the maximum lies to 0.7 or above, the better the prediction
method is likely to be. Not differentiating between
orthology and homology, thereby causing too many
sequences to be assigned as sequence orthologs, shifts
the maximum of the fitted EVD to around 0.5. BLASTbased predictions more frequently assigned multiple
sequence orthologs to genes than the AffyTrees predictions. This might explain why the maximum of the
BLAST-EVD lies at 0.47. The best BLAST approach,
while quite suited to detecting sequence homologs,
therefore does not appear very accurate when used to
distinguish between sequence orthologs and other homologs. The AffyTrees method, in contrast, appears far
better at reliably determining orthologous sequences.
CONCLUSION
AffyTrees provides a repository of phylogenetic trees
inferred from every consensus sequence represented
on a variety of Affymetrix plant microarray chips. This
repository can be used to gain insights into the relationship of sequence homologs, improve annotation
data, or automatically generate a list of sequence
orthologs between a species and the consensus sequences represented on a specific microarray chip. The
inclusion of sequences from the ‘‘nr’’ database and our
method of detecting sequence orthologs circumvent the
problems reciprocal best hit approaches have when
dealing with co-orthologous genes. For sequences
represented on Affymetrix plant microarray chips,
AffyTrees can identify sequence orthologs present on
other Affymetrix plant microarray chips, as well as
sequence orthologs present in the ‘‘nr’’ database.
The ability to filter chip phylomes for specific selection criteria allows discrepancies or systematic biases
between the sequence complements of chips and the
corresponding genomes to be detected. Affymetrix
chips were designed to measure the transcription of
genes and therefore are biased toward highly expressed and protein-coding genes. This is a known
and useful bias of these chips. However, other biases,
for example, systematic preference for long or short
sequences, differences in the EST libraries on which the
chips were based, or differences in the ability to successfully predict short genes in different species, will
have affected which sequences were included on a chip
and thereby influence the results.
We provide a means of comparing the sequence
complement of microarray chips to the publicly available sequence data of the corresponding organism as
well as to the microarrays of other species. Robust ways
of assessing sequence orthologs and knowledge about
systematic differences in the sequence complement of
various chips are prerequisites to making cross-species
analyses of microarray expression data feasible. Without knowledge of the sequence orthologs present on
other microarray chips, there is no way of determining
which probe sets are comparable across chips. Similarly, without a way of estimating sequence biases or
genes missing on a chip, the conclusions drawn from
the presence or absence of groups of genes derived
from expression data are likely to be flawed.
We show, to the extent that the limitations of the
available experimental data permitted, that the majority of genes predicted to be orthologous show a similar
expression across the two examined species. We also
show that AffyTrees is able to assign sequence orthologs to more genes than a comparable approach relying on reciprocal best BLAST hits and, by comparing
the expression of predicted sequence orthologs, that
the AffyTrees orthologs appear more reliable than the
BLAST-based predictions.
AffyTrees provides prediction of sequence orthologs
for a wide variety of species at greater accuracy than
reciprocal best BLAST hits. Combined with the available phylogenetic trees, sequence alignments, and additional utilities, AffyTrees should provide a useful
resource for comparative analyses of transcriptomes
and proteomes.
MATERIALS AND METHODS
The sequences we based our sequence-similarity searches on originated from
either the ‘‘nr’’ database, downloaded from NCBI (ftp://ftp.ncbi.nlm.nih.gov/
blast/db/FASTA/nr.gz), or from 6-frame translations of exemplar sequences
for a variety of Affymetrix chips. The nucleotide exemplar sequences were
downloaded, after registration, from the Affymetrix Web site by following the
links to the various species (http://www.affymetrix.com/support/technical/
byproduct.affx?cat5exparrays). BLAST searches were performed against the
NCBI nonredundant protein database ‘‘nr’’ and 6-frame translation of consensus sequences for the Affymetrix microarray chips ATH1-121501, AtGenome1,
Barley1, Citrus, Cotton, Grape, Maize, Medicago, Poplar, Rice, Soybean, Sugar
Cane, Tomato, and Wheat. The BLAST results for sequences represented on the
Arabidopsis (Arabidopsis thaliana) ATH1-121501 and Medicago truncatula chips
were retrieved via the AffyTrees Web interface. Putative sequence orthologs
between M. truncatula and Arabidopsis sequences were predicted as described
above (scaling factor 5 0.5) based on the phylogenies provided by AffyTrees.
To keep the results as comparable as possible, the same cutoffs used to generate
the phylogenies (i.e. .70% coverage of the query and E values better than 1e-5)
were used as a lower limit for analysis of the reciprocal best BLAST hits. BLAST
hits that did not satisfy these cutoffs were not taken into account. In cases where
multiple BLAST hits had identical best E values, all of these best hits were taken
into account. This made it possible for some genes to be assigned multiple
reciprocal best BLAST hits. The method of orthology prediction we describe
allows genes in one species to be assigned multiple orthologs in another. In such
cases, all of the predicted sequence orthologs were taken into account. A
Plant Physiol. Vol. 146, 2008
385
Downloaded from on June 18, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.
Frickey et al.
noticeable discrepancy was apparent in the number of predicted sequence
orthologs compared to the number of reciprocal best BLAST hits. To keep both
approaches of detecting sequence orthologs as comparable as possible, we
compared reciprocal AffyTrees orthologs to the reciprocal best BLAST hits. This
allowed both methods to use ‘‘reciprocality’’ as a further criterion to reduce the
number of false positive orthology predictions.
For each plant species, the Affymetrix CEL files of the experiments we
wanted to compare were normalized using both GCRMA (Wu et al., 2004) and
MAS5 (Hubbell et al., 2002) for comparison. All experimental files for a species
were normalized at the same time, as normalizing each set of experiments
individually would have artificially increased the differences observed between the experimental conditions. Linear correlation coefficients were calculated using the average expression value of each gene over the three available
experimental replicates.
Availability and Requirements
The tool is freely accessible at http://bioinfoserver.rsbs.anu.edu.au/utils/
affytrees/. Further information and help is available at http://bioinfoserver.
rsbs.anu.edu.au/utils/affytrees/help.php. Javascript should be enabled in
the browser and a Java1.5 or above browser plugin should be installed for
visualization of phylogenetic trees.
Received September 23, 2007; accepted December 3, 2007; published
December 7, 2007.
LITERATURE CITED
Alexeyenko A, Tamas I, Liu G, Sonnhammer EL (2006) Automatic clustering of orthologs and inparalogs shared by multiple proteomes.
Bioinformatics 22: e9–e15
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W,
Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res 25: 3389–3402
Chiu JC, Lee EK, Egan MG, Sarkar IN, Coruzzi GM, DeSalle R (2006)
OrthologID: automation of genome-scale ortholog identification within
a parsimony framework. Bioinformatics 22: 699–707
Edgar R, Domrachev M, Lash AE (2002) Gene Expression Omnibus: NCBI
gene expression and hybridization array data repository. Nucleic Acids
Res 30: 207–210
Frickey T, Lupas AN (2004) PhyloGenie: automated phylome generation
and analysis. Nucleic Acids Res 32: 5231–5238
Horan K, Lauricha J, Bailey-Serres J, Raikhel N, Girke T (2005) Genome
cluster database. A sequence family analysis platform for Arabidopsis
and rice. Plant Physiol 138: 47–54
Hubbell E, Liu WM, Mei R (2002) Robust estimators for expression
analysis. Bioinformatics 18: 1585–1592
Johnson X, Brcich T, Dun EA, Goussot M, Haurogne K, Beveridge CA,
Rameau C (2006) Branching genes are conserved across species. Genes
controlling a novel signal in pea are coregulated by other long-distance
signals. Plant Physiol 142: 1014–1026
Koski LB, Golding GB (2001) The closest BLAST hit is often not the nearest
neighbor. J Mol Evol 52: 540–542
Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of ortholog
groups for eukaryotic genomes. Genome Res 13: 2178–2189
Lim WK, Wang K, Lefebvre C, Califano A (2007) Comparative analysis of
microarray normalization procedures: effects on reverse engineering
gene networks. Bioinformatics 23: 282–288
O’Brien KP, Remm M, Sonnhammer EL (2005) Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res 33: D476–480
Saitou N, Nei M (1987) The neighbor-joining method: a new method for
reconstructing phylogenetic trees. Mol Biol Evol 4: 406–425
Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron M,
Scholkopf B, Weigel D, Lohmann JU (2005) A gene expression map of
Arabidopsis thaliana development. Nat Genet 5: 501–506
Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on
protein families. Science 278: 631–637
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV,
Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al (2003)
The COG database: an updated version includes eukaryotes. BMC
Bioinformatics 4: 41
Wu Z, Irizarry RA, Gentleman R, Murillo FM, Spencer F (2004) A Model
Based Background Adjustment for Oligonucleotide Expression Arrays.
Technical Report. Department of Biostatistics Working Papers. John
Hopkins University, Baltimore, MD
386
Plant Physiol. Vol. 146, 2008
Downloaded from on June 18, 2017 - Published by www.plantphysiol.org
Copyright © 2008 American Society of Plant Biologists. All rights reserved.