DAGchainer: a tool for mining segmental genome

BIOINFORMATICS APPLICATIONS NOTE
Vol. 20 no. 18 2004, pages 3643–3646
doi:10.1093/bioinformatics/bth397
DAGchainer: a tool for mining segmental genome
duplications and synteny
Brian J. Haas∗ ,† , Arthur L. Delcher† , Jennifer R. Wortman and
Steven L. Salzberg
The Institute for Genomic Research, 9712 Medical Center Drive, Rockville,
MD 20850, USA
Received on May 26, 2004; accepted on June 29, 2004
Advance Access publication July 9, 2004
ABSTRACT
Summary: Given the positions of protein-coding genes along
genomic sequence and probability values for protein alignments between genes, DAGchainer identifies chains of gene
pairs sharing conserved order between genomic regions, by
identifying paths through a directed acyclic graph (DAG).
These chains of collinear gene pairs can represent segmentally duplicated regions and genes within a single genome
or syntenic regions between related genomes. Automated
mining of the Arabidopsis genome for segmental duplications
illustrates the use of DAGchainer.
Contact: [email protected]
INTRODUCTION
The occurrence of collinear gene order in comparisons
between genomes can have varied implications. For distantly
related organisms, it may be indicative of some functional
relevance, such as coordinated regulation of gene expression as found in bacterial operons (Ermolaeva et al., 2001).
For more closely related organisms, collinear gene order, in
this context termed synteny, is expected simply because not
enough evolutionary time has passed to accumulate change
(Bennetzen and Freeling, 1997); breaks in gene order are often
due to large-scale genome rearrangements including chromosomal segment inversions, deletions and translocations. The
identification of syntenic regions between related genomes is
required in order to fully leverage scientific knowledge gained
from the study of model organisms with sequenced genomes,
such as from mouse to human (Nobrega and Pennacchio,
2004), and from Arabidopsis and rice to important plant crop
species, including broccoli, corn and wheat (Hall et al., 2002;
Shimamoto and Kyozuka, 2002).
In addition to detecting homologous regions between
different genomes, internally duplicated chromosome segments are readily recognized by the conserved ordering of
∗ To
whom correspondence should be addressed.
† The
authors wish it to be known that, in their opinion, the first two authors
should be regarded as joint First Authors.
Bioinformatics vol. 20 issue 18 © Oxford University Press 2004; all rights reserved.
paralogous genes. Whole-genome duplications are part of the
evolutionary history of eukaryotes, including yeast (Wolfe and
Shields, 1997), Arabidopsis (Blanc et al., 2003; Ermolaeva
et al., 2003) and vertebrates (Durand, 2003). The duplicated
genome segments yield a source of functional redundancy and
provide substrates for the evolution of new gene functions
(Kondrashov et al., 2002). Dissecting the duplicated regions
within a genome provides us with insight into the molecular events responsible for the current genome architecture,
and enables researchers to consider the potential functional
redundancy provided by gene duplicates as they attempt to
unravel the function of each gene.
Given raw genome sequences, genome architecture can
be studied by performing comparisons at the nucleotide
level. Tools exist for such comparisons, including MUMmer
(Kurtz et al., 2004), BLASTZ (Schwartz et al., 2003) and
LAGAN (Brudno et al., 2003). These are powerful, computationally intensive tools and in the absence of gene
annotations, provide an excellent starting point for identifying candidate duplicated or syntenic regions. They do
not, however, directly yield the identity of the syntenic or
duplicated genes.
With fully annotated genomes, alternative approaches can
be taken which rely more heavily or exclusively on gene
content. DiagHunter coupled with GenoPix2D (Cannon et al.,
2003) directly identifies collinear genes by finding diagonals of paired genes in a two-dimensional (2D) plot, where
each axis represents a contiguous genomic sequence and
genes are paired by BLAST (Altschul et al., 1997) matches.
DAGchainer operates within the same paradigm, but computes chains of collinear genes using a distinctly different
algorithm and scoring function, and provides an alternative
tool for mining gene synteny and genome duplications in the
post-genomic era.
PROGRAM OVERVIEW
DAGchainer requires a single tab-delimited input file which
describes genes paired by BLAST matches and the E-value
for each match. Each gene description includes the identity of
3643
B.J.Haas et al.
Fig. 1. Chromosome segmental duplication map for Arabidopsis thaliana as computed by DAGchainer. Dot plots are shown for each pairwise
comparison between chromosomes. Chains of duplicated genes found in the same forward order on both chromosomes are shown in red and
inversions are shown in blue.
the genomic sequence that contains it, a unique gene identifier
and the coordinate span for the gene on the sequence.
To compare two ordered sequences of genes, A =
a1 , . . . , am and B = b1 , . . . , bn , we regard each gene pair
(ai , bj ) as a point (xi , yj ) in 2D space, where xi and yj
are the midpoint positions of genes ai and bj on their
respective genomic sequences, with x1 < x2 < · · · < xm and
y1 < y2 < · · · < yn . In an order-preserving chain of gene
pairs, gene pair (ai , bj ) can precede pair (ak , b ) if and only
if i < k and j < . Thus, we have a directed acyclic graph
3644
(DAG), where nodes are gene pairs and edges connect pairs
that can be in chains. Paths in this graph correspond exactly
to conserved-order gene chains.
We define a score for each path that combines the quality of
the matches and their proximity to one another. Specifically,
the score for a path is the sum of:
• The match score of each gene-pair node ν on the path,
defined as MatchScore(ν) = min{− log10 E-val(ν),
MaxMatch}, where E-val(ν) is the E-value of the
DAGchainer
gene-pair match and MaxMatch is a configurable parameter; and
• A gap penalty between each two consecutive gene pairs
u = (ai , bj ) and ν = (ak , b ) on the path, which takes
into account the distance between gene pairs and diagonal
properties, defined by
NumGaps
(xk − xi ) + (y − yj ) + (xk − xi ) − (y − yj )
=
+ 0.5
2 · GapUnitLen
GapPenalty(u, ν)



−∞



GapOpen+
=


(NumGaps · GapExtend)



0
if max{xk − xi , y − yj }
> MaxDist
if NumGaps > 0
otherwise,
where GapOpen, GapExtend, GapUnitLen and
MaxDist are all configurable parameters.
Dynamic programming is then used to find the highest-scoring
paths, where the score of the best path ending at node ν is
given by
PathScore(ν)
ACKNOWLEDGEMENTS
= MatchScore(ν)
+
max
u precedes ν
finished (Blanc et al., 2003; Wortman et al., 2003), and therefore provide a useful reference for evaluating the efficacy of
newly developed bioinformatics tools. Using the latest release
(version 5) of the Arabidopsis genome annotation (available at
ftp://ftp.tigr.org/pub/data/a_thaliana/ath1), DAGchainer was
employed to identify genes found in chromosomal segmental duplications (Fig. 1). The proteins corresponding
to the splicing isoform encoding the longest open reading
frame of each of 26 207 protein-coding genes were searched
all-against-all using WU-BLASTP [Gish, W. (1996–2004),
http://blast.wustl.edu, parameters ‘V = 5 B = 5 E = 1e − 10−
filter seg’]. DAGchainer was run with default settings:
GapUnitLen = 10 000, GapOpen = 0, GapExtend = − 3,
MaxMatch = 50, MaxDist = 200 000 and only chains with
at least six gene pairs were reported. DAGchainer processed a
dataset of 40 834 gene pairs and identified 8153 genes found
in 292 chains within 6 s using a single 2.4 GHz Pentium 4
processor. There were 129 chains that contained more than
10 members and the longest chain contained 287 gene pairs.
The results are mostly consistent with those reported by Blanc
et al. with agreement among 94 and 92% of the genes classified as recent and old gene duplicates, respectively, as well as
reporting over 2000 additional candidate gene duplicates.
{PathScore(u) + GapPenalty(u, ν), 0}.
After finding the highest-scoring path in the DAG, its nodes
are removed and the next highest-scoring path is found. This
cycle continues until there is no path scoring higher than a
specified minimum score threshold. It is worth noting that
node scores are recomputed only when the highest-scoring
path contains previously removed nodes, thereby speeding-up
the computation.
To find chains of corresponding genes in the opposite
orientation, the DAG is recreated with a reversed version of
the second genomic sequence. All subsequent calculations are
performed exactly as described above.
SOFTWARE IMPLEMENTATION
The DAGchainer algorithm was implemented as a C++
program. A Perl script wrapper handles parsing of
the input data and invoking the C++ program for
each pairwise comparison between genomic contigs in
the input dataset. The software is freely available at
<ftp://ftp.tigr.org/pub/software/DAGchainer>.
EXAMPLE APPLICATION: ARABIDOPSIS
GENOME DUPLICATION
The Arabidopsis genome architecture and annotation have
been reviewed heavily since the sequence was declared
This research was supported in part by NIH grant
R01-LM007938 to S.L.S.
REFERENCES
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z.,
Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs.
Nucleic Acids Res., 25, 3389–3402.
Bennetzen,J.L. and Freeling,M. (1997) The unified grass genome:
synergy in synteny. Genome Res., 7, 301–306.
Blanc,G., Hokamp,K. and Wolfe,K.H. (2003) A recent polyploidy
superimposed on older large-scale duplications in the Arabidopsis
genome. Genome Res., 13, 137–144.
Brudno,M., Do,C.B., Cooper,G.M., Kim,M.F., Davydov,E.,
Green,E.D., Sidow,A. and Batzoglou,S. (2003) LAGAN and
Multi-LAGAN: efficient tools for large-scale multiple alignment
of genomic DNA. Genome Res., 13, 721–731.
Cannon,S.B., Kozik,A., Chan,B., Michelmore,R. and Young,N.D.
(2003) DiagHunter and GenoPix2D: programs for genomic
comparisons, large-scale homology discovery and visualization.
Genome Biol., 4, R68.
Durand,D. (2003) Vertebrate evolution: doubling and shuffling with
a full deck. Trends Genet., 19, 2–5.
Ermolaeva,M.D., White,O. and Salzberg,S.L. (2001) Prediction
of operons in microbial genomes. Nucleic Acids Res.,
29, 1216–1221.
Ermolaeva,M.D., Wu,M., Eisen,J.A. and Salzberg,S.L. (2003) The
age of the Arabidopsis thaliana genome duplication. Plant Mol.
Biol., 51, 859–866.
3645
B.J.Haas et al.
Hall,A.E., Fiebig,A. and Preuss,D. (2002) Beyond the Arabidopsis
genome: opportunities for comparative genomics. Plant Physiol.,
129, 1439–1447.
Kondrashov,F.A., Rogozin,I.B., Wolf,Y.I. and Koonin,E.V. (2002)
Selection in the evolution of gene duplications. Genome Biol., 3,
RESEARCH0008.
Kurtz,S., Phillippy,A., Delcher,A.L., Smoot,M., Shumway,M.,
Antonescu,C. and Salzberg,S.L. (2004) Versatile and open
software for comparing large genomes. Genome Biol., 5, R12.
Nobrega,M.A. and Pennacchio,L.A. (2004) Comparative genomic
analysis as a tool for biological discovery. J. Physiol.,
554, 31–39.
3646
Schwartz,S., Kent,W.J., Smit,A., Zhang,Z., Baertsch,R.,
Hardison,R.C., Haussler,D. and Miller,W. (2003) Human–mouse
alignments with BLASTZ. Genome Res., 13, 103–107.
Shimamoto,K. and Kyozuka,J. (2002) Rice as a model for comparative genomics of plants. Annu. Rev. Plant Biol., 53, 399–419.
Wolfe,K.H. and Shields,D.C. (1997) Molecular evidence for an
ancient duplication of the entire yeast genome. Nature, 387,
708–713.
Wortman,J.R., Haas,B.J., Hannick,L.I., Smith,R.K.,Jr, Maiti,R.,
Ronning,C.M., Chan,A.P., Yu,C., Ayele,M., Whitelaw,C.A.,
White,O.R. and Town,C.D. (2003) Annotation of the Arabidopsis
genome. Plant Physiol., 132, 461–468.