BIOINFORMATICS APPLICATIONS NOTE Vol. 20 no. 18 2004, pages 3643–3646 doi:10.1093/bioinformatics/bth397 DAGchainer: a tool for mining segmental genome duplications and synteny Brian J. Haas∗ ,† , Arthur L. Delcher† , Jennifer R. Wortman and Steven L. Salzberg The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA Received on May 26, 2004; accepted on June 29, 2004 Advance Access publication July 9, 2004 ABSTRACT Summary: Given the positions of protein-coding genes along genomic sequence and probability values for protein alignments between genes, DAGchainer identifies chains of gene pairs sharing conserved order between genomic regions, by identifying paths through a directed acyclic graph (DAG). These chains of collinear gene pairs can represent segmentally duplicated regions and genes within a single genome or syntenic regions between related genomes. Automated mining of the Arabidopsis genome for segmental duplications illustrates the use of DAGchainer. Contact: [email protected] INTRODUCTION The occurrence of collinear gene order in comparisons between genomes can have varied implications. For distantly related organisms, it may be indicative of some functional relevance, such as coordinated regulation of gene expression as found in bacterial operons (Ermolaeva et al., 2001). For more closely related organisms, collinear gene order, in this context termed synteny, is expected simply because not enough evolutionary time has passed to accumulate change (Bennetzen and Freeling, 1997); breaks in gene order are often due to large-scale genome rearrangements including chromosomal segment inversions, deletions and translocations. The identification of syntenic regions between related genomes is required in order to fully leverage scientific knowledge gained from the study of model organisms with sequenced genomes, such as from mouse to human (Nobrega and Pennacchio, 2004), and from Arabidopsis and rice to important plant crop species, including broccoli, corn and wheat (Hall et al., 2002; Shimamoto and Kyozuka, 2002). In addition to detecting homologous regions between different genomes, internally duplicated chromosome segments are readily recognized by the conserved ordering of ∗ To whom correspondence should be addressed. † The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Bioinformatics vol. 20 issue 18 © Oxford University Press 2004; all rights reserved. paralogous genes. Whole-genome duplications are part of the evolutionary history of eukaryotes, including yeast (Wolfe and Shields, 1997), Arabidopsis (Blanc et al., 2003; Ermolaeva et al., 2003) and vertebrates (Durand, 2003). The duplicated genome segments yield a source of functional redundancy and provide substrates for the evolution of new gene functions (Kondrashov et al., 2002). Dissecting the duplicated regions within a genome provides us with insight into the molecular events responsible for the current genome architecture, and enables researchers to consider the potential functional redundancy provided by gene duplicates as they attempt to unravel the function of each gene. Given raw genome sequences, genome architecture can be studied by performing comparisons at the nucleotide level. Tools exist for such comparisons, including MUMmer (Kurtz et al., 2004), BLASTZ (Schwartz et al., 2003) and LAGAN (Brudno et al., 2003). These are powerful, computationally intensive tools and in the absence of gene annotations, provide an excellent starting point for identifying candidate duplicated or syntenic regions. They do not, however, directly yield the identity of the syntenic or duplicated genes. With fully annotated genomes, alternative approaches can be taken which rely more heavily or exclusively on gene content. DiagHunter coupled with GenoPix2D (Cannon et al., 2003) directly identifies collinear genes by finding diagonals of paired genes in a two-dimensional (2D) plot, where each axis represents a contiguous genomic sequence and genes are paired by BLAST (Altschul et al., 1997) matches. DAGchainer operates within the same paradigm, but computes chains of collinear genes using a distinctly different algorithm and scoring function, and provides an alternative tool for mining gene synteny and genome duplications in the post-genomic era. PROGRAM OVERVIEW DAGchainer requires a single tab-delimited input file which describes genes paired by BLAST matches and the E-value for each match. Each gene description includes the identity of 3643 B.J.Haas et al. Fig. 1. Chromosome segmental duplication map for Arabidopsis thaliana as computed by DAGchainer. Dot plots are shown for each pairwise comparison between chromosomes. Chains of duplicated genes found in the same forward order on both chromosomes are shown in red and inversions are shown in blue. the genomic sequence that contains it, a unique gene identifier and the coordinate span for the gene on the sequence. To compare two ordered sequences of genes, A = a1 , . . . , am and B = b1 , . . . , bn , we regard each gene pair (ai , bj ) as a point (xi , yj ) in 2D space, where xi and yj are the midpoint positions of genes ai and bj on their respective genomic sequences, with x1 < x2 < · · · < xm and y1 < y2 < · · · < yn . In an order-preserving chain of gene pairs, gene pair (ai , bj ) can precede pair (ak , b ) if and only if i < k and j < . Thus, we have a directed acyclic graph 3644 (DAG), where nodes are gene pairs and edges connect pairs that can be in chains. Paths in this graph correspond exactly to conserved-order gene chains. We define a score for each path that combines the quality of the matches and their proximity to one another. Specifically, the score for a path is the sum of: • The match score of each gene-pair node ν on the path, defined as MatchScore(ν) = min{− log10 E-val(ν), MaxMatch}, where E-val(ν) is the E-value of the DAGchainer gene-pair match and MaxMatch is a configurable parameter; and • A gap penalty between each two consecutive gene pairs u = (ai , bj ) and ν = (ak , b ) on the path, which takes into account the distance between gene pairs and diagonal properties, defined by NumGaps (xk − xi ) + (y − yj ) + (xk − xi ) − (y − yj ) = + 0.5 2 · GapUnitLen GapPenalty(u, ν) −∞ GapOpen+ = (NumGaps · GapExtend) 0 if max{xk − xi , y − yj } > MaxDist if NumGaps > 0 otherwise, where GapOpen, GapExtend, GapUnitLen and MaxDist are all configurable parameters. Dynamic programming is then used to find the highest-scoring paths, where the score of the best path ending at node ν is given by PathScore(ν) ACKNOWLEDGEMENTS = MatchScore(ν) + max u precedes ν finished (Blanc et al., 2003; Wortman et al., 2003), and therefore provide a useful reference for evaluating the efficacy of newly developed bioinformatics tools. Using the latest release (version 5) of the Arabidopsis genome annotation (available at ftp://ftp.tigr.org/pub/data/a_thaliana/ath1), DAGchainer was employed to identify genes found in chromosomal segmental duplications (Fig. 1). The proteins corresponding to the splicing isoform encoding the longest open reading frame of each of 26 207 protein-coding genes were searched all-against-all using WU-BLASTP [Gish, W. (1996–2004), http://blast.wustl.edu, parameters ‘V = 5 B = 5 E = 1e − 10− filter seg’]. DAGchainer was run with default settings: GapUnitLen = 10 000, GapOpen = 0, GapExtend = − 3, MaxMatch = 50, MaxDist = 200 000 and only chains with at least six gene pairs were reported. DAGchainer processed a dataset of 40 834 gene pairs and identified 8153 genes found in 292 chains within 6 s using a single 2.4 GHz Pentium 4 processor. There were 129 chains that contained more than 10 members and the longest chain contained 287 gene pairs. The results are mostly consistent with those reported by Blanc et al. with agreement among 94 and 92% of the genes classified as recent and old gene duplicates, respectively, as well as reporting over 2000 additional candidate gene duplicates. {PathScore(u) + GapPenalty(u, ν), 0}. After finding the highest-scoring path in the DAG, its nodes are removed and the next highest-scoring path is found. This cycle continues until there is no path scoring higher than a specified minimum score threshold. It is worth noting that node scores are recomputed only when the highest-scoring path contains previously removed nodes, thereby speeding-up the computation. To find chains of corresponding genes in the opposite orientation, the DAG is recreated with a reversed version of the second genomic sequence. All subsequent calculations are performed exactly as described above. SOFTWARE IMPLEMENTATION The DAGchainer algorithm was implemented as a C++ program. A Perl script wrapper handles parsing of the input data and invoking the C++ program for each pairwise comparison between genomic contigs in the input dataset. The software is freely available at <ftp://ftp.tigr.org/pub/software/DAGchainer>. EXAMPLE APPLICATION: ARABIDOPSIS GENOME DUPLICATION The Arabidopsis genome architecture and annotation have been reviewed heavily since the sequence was declared This research was supported in part by NIH grant R01-LM007938 to S.L.S. REFERENCES Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Bennetzen,J.L. and Freeling,M. (1997) The unified grass genome: synergy in synteny. Genome Res., 7, 301–306. Blanc,G., Hokamp,K. and Wolfe,K.H. (2003) A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res., 13, 137–144. Brudno,M., Do,C.B., Cooper,G.M., Kim,M.F., Davydov,E., Green,E.D., Sidow,A. and Batzoglou,S. (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res., 13, 721–731. Cannon,S.B., Kozik,A., Chan,B., Michelmore,R. and Young,N.D. (2003) DiagHunter and GenoPix2D: programs for genomic comparisons, large-scale homology discovery and visualization. Genome Biol., 4, R68. Durand,D. (2003) Vertebrate evolution: doubling and shuffling with a full deck. Trends Genet., 19, 2–5. Ermolaeva,M.D., White,O. and Salzberg,S.L. (2001) Prediction of operons in microbial genomes. Nucleic Acids Res., 29, 1216–1221. Ermolaeva,M.D., Wu,M., Eisen,J.A. and Salzberg,S.L. (2003) The age of the Arabidopsis thaliana genome duplication. Plant Mol. Biol., 51, 859–866. 3645 B.J.Haas et al. Hall,A.E., Fiebig,A. and Preuss,D. (2002) Beyond the Arabidopsis genome: opportunities for comparative genomics. Plant Physiol., 129, 1439–1447. Kondrashov,F.A., Rogozin,I.B., Wolf,Y.I. and Koonin,E.V. (2002) Selection in the evolution of gene duplications. Genome Biol., 3, RESEARCH0008. Kurtz,S., Phillippy,A., Delcher,A.L., Smoot,M., Shumway,M., Antonescu,C. and Salzberg,S.L. (2004) Versatile and open software for comparing large genomes. Genome Biol., 5, R12. Nobrega,M.A. and Pennacchio,L.A. (2004) Comparative genomic analysis as a tool for biological discovery. J. Physiol., 554, 31–39. 3646 Schwartz,S., Kent,W.J., Smit,A., Zhang,Z., Baertsch,R., Hardison,R.C., Haussler,D. and Miller,W. (2003) Human–mouse alignments with BLASTZ. Genome Res., 13, 103–107. Shimamoto,K. and Kyozuka,J. (2002) Rice as a model for comparative genomics of plants. Annu. Rev. Plant Biol., 53, 399–419. Wolfe,K.H. and Shields,D.C. (1997) Molecular evidence for an ancient duplication of the entire yeast genome. Nature, 387, 708–713. Wortman,J.R., Haas,B.J., Hannick,L.I., Smith,R.K.,Jr, Maiti,R., Ronning,C.M., Chan,A.P., Yu,C., Ayele,M., Whitelaw,C.A., White,O.R. and Town,C.D. (2003) Annotation of the Arabidopsis genome. Plant Physiol., 132, 461–468.
© Copyright 2026 Paperzz