'$ &' * BIOINFORMATICS CRAWview: for viewing splicing variation, gene families, and polymorphism in clusters of ESTs and full-length sequences %& !', & '!& ,)# &'% &')%+"* )',( & -*+%* ))"*'& +)+ ,"+ #$& Abstract Introduction Motivation: DNA sequence clustering has become a valuable method in support of gene discovery and gene expression analysis. Our interest lies in leveraging the sequence diversity within clusters of expressed sequence tags (ESTs) to model gene structure for the study of gene variants that arise from, among other things, alternative mRNA splicing, polymorphism, and divergence after gene duplication, fusion, and translocation events. In previous work, CRAW was developed to discover gene variants from assembled clusters of ESTs. Most importantly, novel gene features (the differing units between gene variants, for example alternative exons, polymorphisms, transposable elements, etc.) that are specialized to tissue, disease, population, or developmental states can be identified when these tools collate DNA source information with gene variant discrimination. While the goal is complete automation of novel feature and gene variant detection, current methods are far from perfect and hence the development of effective tools for visualization and exploratory data analysis are of paramount importance in the process of sifting through candidate genes and validating targets. Results: We present CRAWview, a Java based visualization extension to CRAW. Features that vary between gene forms are displayed using an automatically generated color coded index. The reporting format of CRAWview gives a brief, high level summary report to display overlap and divergence within clusters of sequences as well as the ability to ‘drill down’ and see detailed information concerning regions of interest. Additionally, the alignment viewing and editing capabilities of CRAWview make it possible to interactively correct frame-shifts and otherwise edit cluster assemblies. We have implemented CRAWview as a Java application across windows NT/95 and UNIX platforms. Availability: A beta version of CRAWview will be freely available to academic users from Pangea Systems (http://www.pangeasystems.com). Contact: [email protected] The large quantity of single-read sequence from the ends of sufficiently expressed mRNAs (known as Expressed Sequence Tags or ESTs; Wilcox et al., 1991; Adams et al., 1991; Okubo et al., 1991) has led to the discovery of many genes before the completion of genomic sequencing of the human or other organismal genomes (Adams et al., 1992; Venter, 1993; Matsubara and Okubo, 1993). EST data has also facilitated large-scale expression studies (Okubo et al.,1992, 1994; Adams et al., 1995), the construction of a physical map of the genome (Hudson et al., 1995), and a gene map that localizes many genes with respect to markers of the physical map (Schuler et al., 1996). The creation of standardized data repositories (Boguski et al., 1993; Benson et al., 1994) has improved the reliability and concurrence of EST data. 376 EST clustering and gene indexing projects Several projects are underway to construct gene indices, where EST data and known gene sequence data can be consolidated and placed in correct mapping, expression, and physiological context. Although specific methods vary between projects, all gene indices are constructed using some form of cluster analysis, where distance is defined based upon the sequence similarity of transcripts. The central idea of EST clustering is that ESTs be grouped into the same cluster if and only if they are derived from the same gene. Published gene indexing efforts include UniGene (Boguski et al., 1995; Boguski and Schuler, 1995) from NCBI; the TIGR Gene Index (TGI) from the Institute for Genomic Research (http://www.tigr.org/tdb/hgi/hgi.html; Sutton et al., 1995; White and Kerlavage, 1996); the Merck-Washington University Gene Index (Williamson et al., 1995; Eckman et al., 1997; http://www.merck.com/mrl/merck_gene_index.2. html; Aaronson et al., 1996); the GenExpress project (Houlgatte et al., 1995) and the STACK project from the South Oxford University Press Splicing and polymorphism in EST clusters African National Bioinformatics Institute (SANBI) (Hide et al., 1994, 1997; Miller et al., 1997). Representing variations within an EST cluster The visualization and quantification of gene variants within clusters has not been the primary focus of most gene indexing projects. For example, the UniGene project does not attempt to make assemblies and hence provides no visual report of how transcripts in a cluster overlap. The TIGR Gene Index (TGI) was apparently the first project to provide a space-compressed report, called a THC report (Adams et al., 1995), to display overlap in assemblies of ESTs with respect to tentative consensi (TC) and full-length sequence. Other tools not associated with any specific gene indexing project, yet of great value for viewing sequence assemblies, are Consed (Gordon et al., 1998) and phrapview (P. Green, unpublished). An iterative search method for constructing EST assemblies for single genes of interest has also been proposed (Gill et al., 1997). These methods, however, focus on presenting a single assembly and do not generalize easily to the case where multiple consensi are needed simultaneously to model the information in a sequence cluster, as is the case when sufficiently divergent gene variants are present. Nor do these methods automatically detect the presence of polymorphisms for display. The STACK project, on the other hand, uses CRAW analysis (Burke et al., 1998) as a post-processing step to clustering in order to automatically discriminate between and simultaneously view distinct gene variants. The CRAW approach to gene variants and EST clusters CRAW functions by partitioning sequence clusters into subclusters based upon sequence dissimilarity. Specifically, a greedy method is used to construct maximal sub-clusters. Membership in the sub-cluster is restricted in that a constraint is put on the divergence within a global alignment between members and the sub-cluster consensus. When the original clusters are created with similarity threshold (equivalent to minimal-linkage) clustering, as is the case with STACK and UniGene, any two sequences that share an identical domain of sufficient length will be in the same cluster. The creation of sub-clusters is necessary to resolve inconsistencies (for instance, the inclusion of alternate exons in different isoforms of the same gene) through partitioning into one or more sub-clusters. In addition to segregating clusters into distinct gene isoforms, the partitioning is used to identify false joins caused by ESTs derived from chimeric clones, genomic contamination, and other artifacts. Apparently, the first use of a loose grouping followed by stricter separation approach for biological sequence databases was the conserved regions database in BEAUTY (Worley et al., 1995), in which minimal linkage is used to perform an initial joining of protein sequences containing similar domains, and maximal linkage is used to resolve inconsistencies in the clusters caused by the ‘chaining effect’ (Johnson and Wichern, 1992). It seems that the first published application of this concept to EST data was ‘THC_build’ system used to generate TGI (G. Sutton, personal communication; Adams et al., 1995) where pairwise sequence similarity results from BLAST (Altschul et al., 1990) and a modified FASTA (Pearson, 1990) algorithm are collated with a relational database to form loose clusters of related sequences that are aligned with tigr_assembler (Sutton et al., 1995) under conservative parameter sets and strict constraints. CRAWview Here we present in detail CRAWview, a Java implementation of a visualization extension to CRAW. CRAWview provides brief cluster reports that display consensus sequences from EST assemblies even when, due to the presence of gene variants, more than one possible cluster consensus exists. CRAWview also highlights regions of divergence and conservation between alternate consensi, and automatically flags polymorphic or otherwise divergent regions. cDNA library information is also included in the reports to aid in the detection of state-specific gene variants as well as the identification of disease associated polymorphism and alternative exon usage. Systems and methods CRAWview may be run on Windows NT/95 and UNIX systems. A Java Runtime Environment (JRE) is required which may be obtained free of charge from Sun Microsystems for PC or SUN platforms (http://java.sun.com/products/ jdk/1.1/jre/index.html). For non-SUN UNIX architectures, the JRE may be obtained from the hardware manufacturer. For LINUX the JRE may be obtained from: http://browserwatch.internet.com/news/story/java35.html. Algorithms and implementation of CRAWview Upon completion of CRAW analysis, an EST cluster will have been assembled and partitioned into sub-clusters. A separate consensus sequence will also have been derived for each sub-cluster. CRAWview accepts as its input the output flat file generated by CRAW and presents a graphical view of overlap patterns and sequence divergence within EST clusters. The CRAWview report represents a cluster assembly of alignment_length positions with num_cols columns and a row for every sequence in the cluster. Within a row, in order to display sequence alignment information within num_cols columns, each column symbol represents the sequence diversity of 377 A.Chou and J.Burke Table 1. Pseudo-code for representing divergence and identity between sub-groups: pseudo-code of CRAWview color-index assignment for a cluster that has been partitioned into (maximum_group) sub-groups by CRAW. Variables are displayed in bold while variables that are parameters are in italics Group_num = 1; while( group_num <= maximum_group ) { for all sequences (i) in group group_num: { for all non-overlapping windows in (i): { if( > num_gaps gaps in window ) display a ‘gap’ symbol else if( > num_ambig indeterminate bases in window ) display ‘unknown’ symbol else { look_group = 1; while( look_group <= group_num ) { if( >= (d - num_diffs) identical bases with consensus of look_group ) { display ‘look_group’ symbol stop for this window } look_group++; } display ‘diverge’ symbol } } /* finished window */ } /* finished sequence */ group_num++ } /* finished group */ d = floor (alignment_length/num_cols) positions except for the last column which represents d + (alignmnet_length mod num_cols) positions. So that CRAWview reports fit within a printed page, we typically use num_cols = 60. Within each window of d positions, discrete domains of sequence identity between sub-cluster consensi, as well as divergence of individual sequences from consensi, are represented by assignment of color index symbols as described in the pseudo-code in Table 1. The CRAWview color index symbols typically consist of the following. Lines to indicate gaps in the multiple alignment. Bar colors are assigned as follows: red is reserved for 378 divergence from the consensus sequence (such as when a single nucleotide polymorphism, or SNP, is encountered) and white indicates indeterminate sequence (an example of this is the ‘N’ commonly used in DNA sequence to represent an unknown base). Other colors indicate discrete regions of sequence identity. In addition to providing a high-level view of EST assemblies. CRAWview provides the user with the ability to ‘drill down’ on interesting features by calling upon alignment viewing/editing features. If it is suspected that manual editing of the multiple alignment may produce better results, the multiple alignment may be edited and resubmitted to CRAW for sub-group reassignment. Splicing and polymorphism in EST clusters Fig. 1. CRAWview report for the human dishevelled 3 gene: sub-group two is identical to group one except for a missing domain (a putative alternative exon), a feature unique to NCGAP_Co9, a colon cancer library. The two yellow vertical bars that have been set are set by the user to drill down on the 3′ end of the missing domain. This causes the MSA editor/viewer component to be spawned at the correct zoom and center to display this region in detail. For example Figure 1 shows the CRAWview report for an EST cluster from the UniGene gene index. The corresponding full length gene is Human Disheveled 3/RACK 8 Protein Kinase. Divergence is seen in the second sub-group which contains two ESTs specific to the library NCGAP_Co3 representing a colon cancer state. The distinguishing feature between the two shown sub-groups is a region (a putative exon) missing from the ESTs derived from colon cancer libraries. As an example of ‘drill-down’ analysis, the user has highlighted a small portion from the 3′ end of the missing domain and the assembly viewing/editing components are spawned such that the zoom and center of the assembly viewer covers the highlighted region. CRAWview is a Java application and is implemented in a combination Java Foundation Classes (JFC)/swing (http://java.sun.com/products/jfc/swingdoc-api-1.0.3/frame.html) and Abstract Window Toolkit (AWT) (Zukowski, 1997). CRAWview is rendered and displayed in a JScrollPane, which contains a main viewport, a horizontal heading viewport, and optional vertical and horizontal scrollbars. The JScrollPane resides in a JFrame along with JMenuBar and JToolBar. The header port contains the legend and ruler; they always stay at the top and the ruler is calibrated to alignment positions. The main viewport section of the CRAWview report is scrollable and contains the sequence overlap diagram as well as supplementary sequence information such as accession number, clone identifier, and cDNA library information. CRAWview allows for standard printing through AWT PrintJob, or the user can save the color CRAWview report as a GIF file. CRAWview GIF file generation uses GIFEncoder developed by Adam Doppelt (unpublished). Using CRAWview In order to use CRAWview it is necessary either to use CRAW output or to emulate the CRAW output formats. In CRAW, sub-cluster membership, variation, and sequence alignment information is conveyed in two files: *.draw and *.ali. An academic version of CRAW may be obtained through the University of Houston (contact [email protected]). Better performance is obtained from the commercial version of CRAW (available from Pangea Systems, www.pangeasystems.com). If one does not wish to or is not able to use CRAW, then CRAWview can still be used 379 A.Chou and J.Burke if one emulates the file formats of *.draw and *.ali. The *.draw file contains sub-cluster membership information as well as a text version of the color output generated by CRAWview. The details of *.draw format with many examples can be found previous work (Burke et al., 1998). The *.ali file contains the actual sequence information with gaps. The sequences are listed in the same order as in the *.draw file and are in the older style ‘gde’ format, i.e. every sequence is listed as: For further clarity several complete example cluster files with *.ali and *.draw files are available online at the Bioinformatics website. CRAW can be used directly with sequence aligners such as CLUSTALW. Given a multiple FASTA format file, say 0, an example of complete CRAW/CRAWview usage would be: 1. clustalw -output=gde 0. 2. cat 0.gde | craw 0.5 50 60 > 0.draw (the 0.ali file is generated automatically). 3. Run CRAWview and choose 0.draw from the file/open menu. (The file 0 as well as 0.ali and 0.draw are available at the Bioinformatics website.) A beta version of CRAWview is available to academics free of charge. Discussion EST data can be used to model gene structure in the absence of full genomic sequence or a matching positionally cloned gene. Additionally, it provides a cheap and abundant source of information concerning gene variability and, to some extent, expression. We have presented the capabilities of CRAWview, a tool for browsing and performing exploratory data analysis of EST clusters with the purpose of assisting in the identification of state-specific gene variability and novel disease associated features in mature gene transcripts. In theory, due to the large amounts of data to be processed, high-throughput bioinformatics procedures should be fully automatic. However, since many biological features have yet to be fully characterized by ‘rule-sets’, it is inevitable that the optimal results according to an algorithm’s objective function will sometimes be incorrect or sub-optimal in a biological context. This caveat applies to sequence alignment algorithms in general and is especially true for multiple sequence alignment or assembly. For this reason, CRAWview combines high level viewing and automatic feature reporting capabilities with an alignment editing feature that allows a user to interactively improve multiple alignments of transcripts. Another interesting application of CRAWview is cross-validation of different clustering and assembly protocols. We often use CRAWview to display the results of other assembly and clustering programs in the context of the consensi derived by 380 CRAW. This is especially important given that multiple alignment is often performed through heuristic procedures that are not guaranteed to converge to the correct ‘biological’ answer and the comparison of results produced by different methods is necessary for validation. In future CRAWview developments we plan to include an open reading frame finder. We will increase the functionality of the assembly editor and make communications between the editor and the ORF finder instantaneous so that the user may immediately see the effects of edits on consensus coding potential. Additionally, we plan to allow the user flexibility in assigning color codes. Acknowledgments The authors would like to express thanks to Chris Tarnas for assistance with the optimization of CRAWview and to Kristina Chi for assistance in typing this manuscript. The authors are especially grateful to Matthew Huang for careful review and helpful ideas. This work was supported by Pangea Systems. References Aaronson,J.S., Eckman,B., Blevins,R.A., Borowski,J.A., Myerson,J., Imran,S. and Elliston,K.O. (1996) Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data. Genome Res., 6, 829–845. Adams,M.D., Kelley,J.M., Gocayne,J.D., Dubnick,M., Polymeropoulos,M.H., Xiao,H., Merril,C.R., Wu,A., Olde,B., Moreno,R.F., Kerlavage,A.R., McConbie,W.R. and Venter,J.C. (1991) Complementary DNA sequencing: expressed sequence tags and human genome project. Science, 252, 1651–1656. Adams,M.D., Dubnick,M., Kerlavage,A.R., Moreno,R., Kelley,J.M., Utterback,T.R., Nagle,J.W., Fields,C. and Venter,J.C. (1992) Sequence identification of 2375 human brain genes. Nature, 355, 632–634. Adams,M.D., Kerlavage,A.R., Flieschmann,R.D., Fuldner,R.A., Bult,C.J., Lee,N.H., Kirkness,E.F., Weinstock,K.G., Gocayne,J.D., White,O., Sutton,G., Blake,J.A., Brandon,R.C., Chiu,M.W., Clayton,R.A., Cline,R.T., Cotton,M.D., Earle-Huges,J., Fine,L.D., FitzGerald,L.M., FitzHugh,W.M., Fritchman,J.L., Geoghagen,N.S.M., Glodek,A., Gnehm,C.L., Venter,C. et al. (1995) Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature, 377(suppl.), 3–17. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. Benson,D.A., Boguski,M.S., Lipman,D.J. and Ostell,J. (1994) GenBank. Nucleic Acids Res., 22, 3441–3444. Boguski,M.S. and Schuler,G.D. (1995) ESTablishing a human transcript map. Nature Genetics, 10, 369–371. Boguski,M.S., Lowe,T.M. and Tolstohev,C.M. (1993) DbEST: database for ‘expressed sequence tags’. Nature Genetics, 4, 332–333. Burke,J., Wang,H., Hide,W. and Davison,D. (1998) Alternative gene form discovery and candidate gene selection from gene indexing projects. Genome Res., 8, 276–290. Splicing and polymorphism in EST clusters Eckman,B.A., Aaronson,J.S., Borkowski,J.A., Bailey,W.J., Elliston,K.O., Williamson,A.R. and Blevins,R.A. (1998) The Merck Gene Index Browser: an extensible data integration system for gene finding, gene characterization and EST data mining. Bioinformatics, 14, 2–13. Gill,R., Hodgman,T., Littler,C., Oxer,M., Montgomery,D., Taylor,S. and Sanseau,P. (1997) A new dynamic tool to perform assembly of expressed sequence tags (ESTs). CABIOS, 13, 453–457. Gordon,D., Abajian,C. and Green,P. (1998) Consed: a graphical tool for sequence finishing. Genome Res., 8, 195–202. Hide,W., Burke,J. and Davison,D. (1994) Biological evaluation of d2, an algorithm for high-performance sequence comparison. J. Comp. Biol., 1, 199–215. Hide,W., Burke,J., Christoffels,A. and Miller,R. (1997) A novel approach towards a comprehensive consensus representation of the expressed human genome. In Miyano,S. and Takagi,T. (eds), Genome Informatics 1997. Universal Academy Press, Tokyo, pp. 187–196. Houlgatte,R., Mariage-Samson,R., Duprat,S., Tesslier,A., Bentolila,S., Lamy,B. and Auffray,C. (1995) The GenExpress Index: a resource for gene discovery and the genic map of the human genome. Genome Res., 5, 272–304. Hudson,T.J., Stein,L.D., Gerety,S.S., Ma,J., Castle,A.B., Silva,J., Slonim,D.K., Baptista,R., Kruglyak,L., Xu,S., Hu,X., Colbert,A.M.E., Rosenberg,C., Reeve-Daly,M.P., Rozen,S., Hui,L., Wu,X., Vastergaard,C., Wilson,K.M., Sae,J.S., Maitra,S., Ganiatsas,S., Evans,C.A., DeAngelis,M.M., Ingalls,K.A., Nahf,R.W., Horton,L.T., Anderson,M.O., Collymore,A.J., Ye,W., Koyoumjian,V., Zemsteva,I.S., Tam,J., Devine,R., Courtney,D.F., Renauld,M.T., Nguyen,H., Fizames,C., Faure,S., Gyapay,G., Dib,C., Morissette,J., Orlin,J.B., Birren,B.W., Goodman,N., Weissenbach,J., Hawkins,T.L., Foote,S., Page,D.C. and Lander,E.S. (1995) An STS-based map of the human genome. Science, 270, 1945–1954. Johnson,R.A. and Wichern,D.W. (1992) Applied Multivariate Statistical Methods, 3rd edn. Englewood Cliffs, NJ. Matsubara,K. and Okubo,K. (1993) Identification of new genes by systematic analysis of cDNAs and database construction. Curr. Opinion Biotech., 4, 672–677. Miller,R., Burke,J., Christoffels,A. and Hide,W. (1997) Towards a more comprehensive conceptual consensus of the expressed genome: development of sequence tag alignment and consensus knowledgebase (STACK) a novel error analytical approach to EST consensus databases. Ninth International Genome Sequencing and Analysis Conference. Okubo,K., Hori,H., Matuba,R., Niiyama,T. and Matsubara,K. (1991) A novel system for large-scale sequencing of cDNA by PCR amplification. DNA Sequence, 2, 137–144. Okubo,K., Hori,H., Matuba,R., Niiyama,T., Fukushima,A., Kiojima,Y. and Matsubara,K. (1992) Large-scale cDNA sequencing analysis of quantitative and qualitative aspects of gene expression. Nature Genetics, 2, 173–179. Okubo,K., Yoshii,J., Yokouchi,H., Kameyama,M. and Matsubara,K. (1994) An expression profile of active genes in human colonic mucosa. DNA Res., 1, 37–45. Pearson,W.R. (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. In Doolittle,R.F. (ed.), Molecular Evolution: Computer Analysis of Protein and Nucleic Acid Sequences, Methods in Enzymology. Academic Press, San Diego, pp. 63–98. Schuler,G.D., Boguski,M.S., Stewart,E.A., Stein,L.D., Gyapay,G., Rice,K., White,R.E., Rodriguez-Tome,P., Aggarwal,A., Bajorek,E., Bentolila,S., Birren,B.B., Butler,A., Castle,A.B., Chiannilkulchai,N., Chu,A., Clee,C., Cowles,S., Day,P.J.R., Dibling,T., Drouot,N., Dunham,I., Duprat,S., East,C., Edwards,C., Fan,J.B., Fang,N., Fizames,C., Garrett,C. Green,L., Hudson,T.J. et al. (1996) A gene map of the human genome. Science, 274, 540–546. Sutton,G., White,O., Adams,M.D. and Kerlavage,A.R. (1995) TIGR assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci. Technol., 1, 9–18. Venter,J.C. (1993) Identification of new human receptor and transporter genes by high throughput cDNA (EST) sequencing. J. Pharm. Pharmacol., 45(suppl. 1), 355–360. White,O. and Kerlavage,A.R. (1996) TDB: new databases for biological discovery. Methods Enzymol., 206, 27–41. Wilcox,A.S., Khan,A.S., Hopkins,J.A. and Sikela,J.M. (1991) Use of 3′ untranslated sequences of human cDNAs for rapid chromosomal assignment and conversion to STSs: implications for an expression map of the genome. Nucleic Acids Res., 19, 1837–1843. Williamson,A.R., Elliston,K.O. and Sturchio,J.L. (1995) The Merck Gene Index, a public resource for genomics research. J. NIH Res., 7, 61–63. Worley,K.C., Wiese,B.A. and Smith,R. (1995) BEAUTY: an enhanced BLAST-based search tool that integrates multiple biological information resources into sequence similarity results. Genome Res., 5, 173–184. Zukowski,J. (1997) Java AWT Reference. O’Reilly Press, Sebastopol, CA. 381
© Copyright 2026 Paperzz