CRAWview: for viewing splicing variation, gene families, and

'$ &' * BIOINFORMATICS
CRAWview: for viewing splicing variation, gene
families, and polymorphism in clusters of ESTs
and full-length sequences
%& !', & '!& ,)#
&'% &')%+"* )',( & -*+%* ))"*'& +)+ ,"+ #$& Abstract
Introduction
Motivation: DNA sequence clustering has become a valuable method in support of gene discovery and gene
expression analysis. Our interest lies in leveraging the
sequence diversity within clusters of expressed sequence tags
(ESTs) to model gene structure for the study of gene variants
that arise from, among other things, alternative mRNA
splicing, polymorphism, and divergence after gene duplication, fusion, and translocation events. In previous work,
CRAW was developed to discover gene variants from
assembled clusters of ESTs. Most importantly, novel gene
features (the differing units between gene variants, for
example alternative exons, polymorphisms, transposable
elements, etc.) that are specialized to tissue, disease,
population, or developmental states can be identified when
these tools collate DNA source information with gene variant
discrimination. While the goal is complete automation of
novel feature and gene variant detection, current methods
are far from perfect and hence the development of effective
tools for visualization and exploratory data analysis are of
paramount importance in the process of sifting through
candidate genes and validating targets.
Results: We present CRAWview, a Java based visualization
extension to CRAW. Features that vary between gene forms
are displayed using an automatically generated color coded
index. The reporting format of CRAWview gives a brief, high
level summary report to display overlap and divergence
within clusters of sequences as well as the ability to ‘drill
down’ and see detailed information concerning regions of
interest. Additionally, the alignment viewing and editing
capabilities of CRAWview make it possible to interactively
correct frame-shifts and otherwise edit cluster assemblies.
We have implemented CRAWview as a Java application
across windows NT/95 and UNIX platforms.
Availability: A beta version of CRAWview will be freely
available to academic users from Pangea Systems
(http://www.pangeasystems.com).
Contact: [email protected]
The large quantity of single-read sequence from the ends of
sufficiently expressed mRNAs (known as Expressed Sequence Tags or ESTs; Wilcox et al., 1991; Adams et al.,
1991; Okubo et al., 1991) has led to the discovery of many
genes before the completion of genomic sequencing of the
human or other organismal genomes (Adams et al., 1992;
Venter, 1993; Matsubara and Okubo, 1993). EST data has
also facilitated large-scale expression studies (Okubo et
al.,1992, 1994; Adams et al., 1995), the construction of a
physical map of the genome (Hudson et al., 1995), and a
gene map that localizes many genes with respect to markers
of the physical map (Schuler et al., 1996). The creation of
standardized data repositories (Boguski et al., 1993; Benson
et al., 1994) has improved the reliability and concurrence of
EST data.
376
EST clustering and gene indexing projects
Several projects are underway to construct gene indices,
where EST data and known gene sequence data can be consolidated and placed in correct mapping, expression, and
physiological context. Although specific methods vary between projects, all gene indices are constructed using some
form of cluster analysis, where distance is defined based
upon the sequence similarity of transcripts. The central idea
of EST clustering is that ESTs be grouped into the same
cluster if and only if they are derived from the same gene.
Published gene indexing efforts include UniGene (Boguski
et al., 1995; Boguski and Schuler, 1995) from NCBI; the
TIGR Gene Index (TGI) from the Institute for Genomic Research (http://www.tigr.org/tdb/hgi/hgi.html; Sutton et al.,
1995; White and Kerlavage, 1996); the Merck-Washington
University Gene Index (Williamson et al., 1995; Eckman et
al., 1997; http://www.merck.com/mrl/merck_gene_index.2.
html; Aaronson et al., 1996); the GenExpress project (Houlgatte et al., 1995) and the STACK project from the South
Oxford University Press
Splicing and polymorphism in EST clusters
African National Bioinformatics Institute (SANBI) (Hide et
al., 1994, 1997; Miller et al., 1997).
Representing variations within an EST cluster
The visualization and quantification of gene variants within
clusters has not been the primary focus of most gene indexing projects. For example, the UniGene project does not attempt to make assemblies and hence provides no visual report of how transcripts in a cluster overlap. The TIGR Gene
Index (TGI) was apparently the first project to provide a
space-compressed report, called a THC report (Adams et al.,
1995), to display overlap in assemblies of ESTs with respect
to tentative consensi (TC) and full-length sequence. Other
tools not associated with any specific gene indexing project,
yet of great value for viewing sequence assemblies, are
Consed (Gordon et al., 1998) and phrapview (P. Green, unpublished). An iterative search method for constructing EST
assemblies for single genes of interest has also been proposed (Gill et al., 1997). These methods, however, focus on
presenting a single assembly and do not generalize easily to
the case where multiple consensi are needed simultaneously
to model the information in a sequence cluster, as is the case
when sufficiently divergent gene variants are present. Nor do
these methods automatically detect the presence of polymorphisms for display. The STACK project, on the other
hand, uses CRAW analysis (Burke et al., 1998) as a post-processing step to clustering in order to automatically discriminate between and simultaneously view distinct gene variants.
The CRAW approach to gene variants and EST
clusters
CRAW functions by partitioning sequence clusters into subclusters based upon sequence dissimilarity. Specifically, a
greedy method is used to construct maximal sub-clusters.
Membership in the sub-cluster is restricted in that a constraint is put on the divergence within a global alignment between members and the sub-cluster consensus. When the
original clusters are created with similarity threshold (equivalent to minimal-linkage) clustering, as is the case with
STACK and UniGene, any two sequences that share an
identical domain of sufficient length will be in the same
cluster. The creation of sub-clusters is necessary to resolve
inconsistencies (for instance, the inclusion of alternate exons
in different isoforms of the same gene) through partitioning
into one or more sub-clusters. In addition to segregating
clusters into distinct gene isoforms, the partitioning is used
to identify false joins caused by ESTs derived from chimeric
clones, genomic contamination, and other artifacts. Apparently, the first use of a loose grouping followed by stricter
separation approach for biological sequence databases was
the conserved regions database in BEAUTY (Worley et al.,
1995), in which minimal linkage is used to perform an initial
joining of protein sequences containing similar domains, and
maximal linkage is used to resolve inconsistencies in the
clusters caused by the ‘chaining effect’ (Johnson and Wichern, 1992). It seems that the first published application of
this concept to EST data was ‘THC_build’ system used to
generate TGI (G. Sutton, personal communication; Adams
et al., 1995) where pairwise sequence similarity results from
BLAST (Altschul et al., 1990) and a modified FASTA (Pearson, 1990) algorithm are collated with a relational database
to form loose clusters of related sequences that are aligned
with tigr_assembler (Sutton et al., 1995) under conservative
parameter sets and strict constraints.
CRAWview
Here we present in detail CRAWview, a Java implementation
of a visualization extension to CRAW. CRAWview provides
brief cluster reports that display consensus sequences from
EST assemblies even when, due to the presence of gene variants, more than one possible cluster consensus exists.
CRAWview also highlights regions of divergence and conservation between alternate consensi, and automatically
flags polymorphic or otherwise divergent regions. cDNA library information is also included in the reports to aid in the
detection of state-specific gene variants as well as the
identification of disease associated polymorphism and alternative exon usage.
Systems and methods
CRAWview may be run on Windows NT/95 and UNIX systems. A Java Runtime Environment (JRE) is required which
may be obtained free of charge from Sun Microsystems for
PC or SUN platforms (http://java.sun.com/products/
jdk/1.1/jre/index.html). For non-SUN UNIX architectures,
the JRE may be obtained from the hardware manufacturer.
For LINUX the JRE may be obtained from: http://browserwatch.internet.com/news/story/java35.html.
Algorithms and implementation of CRAWview
Upon completion of CRAW analysis, an EST cluster will
have been assembled and partitioned into sub-clusters. A
separate consensus sequence will also have been derived for
each sub-cluster. CRAWview accepts as its input the output
flat file generated by CRAW and presents a graphical view
of overlap patterns and sequence divergence within EST
clusters.
The CRAWview report represents a cluster assembly of
alignment_length positions with num_cols columns and a
row for every sequence in the cluster. Within a row, in order
to display sequence alignment information within num_cols
columns, each column symbol represents the sequence diversity of
377
A.Chou and J.Burke
Table 1. Pseudo-code for representing divergence and identity between sub-groups: pseudo-code of CRAWview color-index assignment for a cluster that
has been partitioned into (maximum_group) sub-groups by CRAW. Variables are displayed in bold while variables that are parameters are in italics
Group_num = 1;
while( group_num <= maximum_group )
{
for all sequences (i) in group group_num:
{
for all non-overlapping windows in (i):
{
if( > num_gaps gaps in window )
display a ‘gap’ symbol
else
if( > num_ambig indeterminate bases in window )
display ‘unknown’ symbol
else
{
look_group = 1;
while( look_group <= group_num )
{
if( >= (d - num_diffs) identical bases with consensus of look_group )
{
display ‘look_group’ symbol
stop for this window
}
look_group++;
}
display ‘diverge’ symbol
}
} /* finished window */
} /* finished sequence */
group_num++
} /* finished group */
d = floor (alignment_length/num_cols)
positions except for the last column which represents
d + (alignmnet_length mod num_cols)
positions. So that CRAWview reports fit within a printed
page, we typically use num_cols = 60.
Within each window of d positions, discrete domains of
sequence identity between sub-cluster consensi, as well as
divergence of individual sequences from consensi, are represented by assignment of color index symbols as described in
the pseudo-code in Table 1.
The CRAWview color index symbols typically consist of
the following. Lines to indicate gaps in the multiple alignment. Bar colors are assigned as follows: red is reserved for
378
divergence from the consensus sequence (such as when a
single nucleotide polymorphism, or SNP, is encountered)
and white indicates indeterminate sequence (an example of
this is the ‘N’ commonly used in DNA sequence to represent
an unknown base). Other colors indicate discrete regions of
sequence identity.
In addition to providing a high-level view of EST assemblies. CRAWview provides the user with the ability to ‘drill
down’ on interesting features by calling upon alignment
viewing/editing features. If it is suspected that manual editing of the multiple alignment may produce better results, the
multiple alignment may be edited and resubmitted to CRAW
for sub-group reassignment.
Splicing and polymorphism in EST clusters
Fig. 1. CRAWview report for the human dishevelled 3 gene: sub-group two is identical to group one except for a missing domain (a putative
alternative exon), a feature unique to NCGAP_Co9, a colon cancer library. The two yellow vertical bars that have been set are set by the user
to drill down on the 3′ end of the missing domain. This causes the MSA editor/viewer component to be spawned at the correct zoom and center
to display this region in detail.
For example Figure 1 shows the CRAWview report for an
EST cluster from the UniGene gene index. The corresponding full length gene is Human Disheveled 3/RACK 8 Protein Kinase. Divergence is seen in the second sub-group
which contains two ESTs specific to the library
NCGAP_Co3 representing a colon cancer state. The distinguishing feature between the two shown sub-groups is a region (a putative exon) missing from the ESTs derived from
colon cancer libraries. As an example of ‘drill-down’ analysis, the user has highlighted a small portion from the 3′ end
of the missing domain and the assembly viewing/editing
components are spawned such that the zoom and center of
the assembly viewer covers the highlighted region.
CRAWview is a Java application and is implemented in a
combination Java Foundation Classes (JFC)/swing (http://java.sun.com/products/jfc/swingdoc-api-1.0.3/frame.html) and
Abstract Window Toolkit (AWT) (Zukowski, 1997). CRAWview is rendered and displayed in a JScrollPane, which contains
a main viewport, a horizontal heading viewport, and optional
vertical and horizontal scrollbars. The JScrollPane resides in a
JFrame along with JMenuBar and JToolBar. The header port
contains the legend and ruler; they always stay at the top and the
ruler is calibrated to alignment positions. The main viewport
section of the CRAWview report is scrollable and contains the
sequence overlap diagram as well as supplementary sequence
information such as accession number, clone identifier, and
cDNA library information. CRAWview allows for standard
printing through AWT PrintJob, or the user can save the color
CRAWview report as a GIF file. CRAWview GIF file generation uses GIFEncoder developed by Adam Doppelt (unpublished).
Using CRAWview
In order to use CRAWview it is necessary either to use
CRAW output or to emulate the CRAW output formats. In
CRAW, sub-cluster membership, variation, and sequence
alignment information is conveyed in two files: *.draw and
*.ali. An academic version of CRAW may be obtained
through the University of Houston (contact [email protected]). Better performance is obtained from the
commercial version of CRAW (available from Pangea Systems, www.pangeasystems.com). If one does not wish to or
is not able to use CRAW, then CRAWview can still be used
379
A.Chou and J.Burke
if one emulates the file formats of *.draw and *.ali. The
*.draw file contains sub-cluster membership information as
well as a text version of the color output generated by
CRAWview. The details of *.draw format with many
examples can be found previous work (Burke et al., 1998).
The *.ali file contains the actual sequence information with
gaps. The sequences are listed in the same order as in the
*.draw file and are in the older style ‘gde’ format, i.e. every
sequence is listed as:
For further clarity several complete example cluster files
with *.ali and *.draw files are available online at the Bioinformatics website.
CRAW can be used directly with sequence aligners such
as CLUSTALW. Given a multiple FASTA format file, say 0,
an example of complete CRAW/CRAWview usage would
be:
1. clustalw -output=gde 0.
2. cat 0.gde | craw 0.5 50 60 > 0.draw (the 0.ali file is
generated automatically).
3. Run CRAWview and choose 0.draw from the file/open
menu.
(The file 0 as well as 0.ali and 0.draw are available at the
Bioinformatics website.) A beta version of CRAWview is
available to academics free of charge.
Discussion
EST data can be used to model gene structure in the absence
of full genomic sequence or a matching positionally cloned
gene. Additionally, it provides a cheap and abundant source
of information concerning gene variability and, to some extent, expression. We have presented the capabilities of
CRAWview, a tool for browsing and performing exploratory
data analysis of EST clusters with the purpose of assisting in
the identification of state-specific gene variability and novel
disease associated features in mature gene transcripts.
In theory, due to the large amounts of data to be processed,
high-throughput bioinformatics procedures should be fully
automatic. However, since many biological features have yet
to be fully characterized by ‘rule-sets’, it is inevitable that the
optimal results according to an algorithm’s objective function will sometimes be incorrect or sub-optimal in a biological context. This caveat applies to sequence alignment algorithms in general and is especially true for multiple sequence
alignment or assembly. For this reason, CRAWview combines high level viewing and automatic feature reporting capabilities with an alignment editing feature that allows a user
to interactively improve multiple alignments of transcripts.
Another interesting application of CRAWview is cross-validation of different clustering and assembly protocols. We often
use CRAWview to display the results of other assembly and
clustering programs in the context of the consensi derived by
380
CRAW. This is especially important given that multiple alignment is often performed through heuristic procedures that are
not guaranteed to converge to the correct ‘biological’ answer
and the comparison of results produced by different methods is
necessary for validation. In future CRAWview developments
we plan to include an open reading frame finder. We will increase the functionality of the assembly editor and make communications between the editor and the ORF finder instantaneous so that the user may immediately see the effects of edits
on consensus coding potential. Additionally, we plan to allow
the user flexibility in assigning color codes.
Acknowledgments
The authors would like to express thanks to Chris Tarnas for
assistance with the optimization of CRAWview and to
Kristina Chi for assistance in typing this manuscript. The
authors are especially grateful to Matthew Huang for careful
review and helpful ideas. This work was supported by Pangea Systems.
References
Aaronson,J.S., Eckman,B., Blevins,R.A., Borowski,J.A., Myerson,J.,
Imran,S. and Elliston,K.O. (1996) Toward the development of a
gene index to the human genome: an assessment of the nature of
high-throughput EST sequence data. Genome Res., 6, 829–845.
Adams,M.D., Kelley,J.M., Gocayne,J.D., Dubnick,M., Polymeropoulos,M.H., Xiao,H., Merril,C.R., Wu,A., Olde,B., Moreno,R.F.,
Kerlavage,A.R., McConbie,W.R. and Venter,J.C. (1991) Complementary DNA sequencing: expressed sequence tags and human
genome project. Science, 252, 1651–1656.
Adams,M.D., Dubnick,M., Kerlavage,A.R., Moreno,R., Kelley,J.M.,
Utterback,T.R., Nagle,J.W., Fields,C. and Venter,J.C. (1992) Sequence identification of 2375 human brain genes. Nature, 355,
632–634.
Adams,M.D., Kerlavage,A.R., Flieschmann,R.D., Fuldner,R.A.,
Bult,C.J., Lee,N.H., Kirkness,E.F., Weinstock,K.G., Gocayne,J.D.,
White,O., Sutton,G., Blake,J.A., Brandon,R.C., Chiu,M.W., Clayton,R.A., Cline,R.T., Cotton,M.D., Earle-Huges,J., Fine,L.D., FitzGerald,L.M., FitzHugh,W.M., Fritchman,J.L., Geoghagen,N.S.M.,
Glodek,A., Gnehm,C.L., Venter,C. et al. (1995) Initial assessment of
human gene diversity and expression patterns based upon 83 million
nucleotides of cDNA sequence. Nature, 377(suppl.), 3–17.
Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990)
Basic local alignment search tool. J. Mol. Biol., 215, 403–410.
Benson,D.A., Boguski,M.S., Lipman,D.J. and Ostell,J. (1994) GenBank. Nucleic Acids Res., 22, 3441–3444.
Boguski,M.S. and Schuler,G.D. (1995) ESTablishing a human transcript map. Nature Genetics, 10, 369–371.
Boguski,M.S., Lowe,T.M. and Tolstohev,C.M. (1993) DbEST: database for ‘expressed sequence tags’. Nature Genetics, 4, 332–333.
Burke,J., Wang,H., Hide,W. and Davison,D. (1998) Alternative gene
form discovery and candidate gene selection from gene indexing
projects. Genome Res., 8, 276–290.
Splicing and polymorphism in EST clusters
Eckman,B.A., Aaronson,J.S., Borkowski,J.A., Bailey,W.J., Elliston,K.O., Williamson,A.R. and Blevins,R.A. (1998) The Merck
Gene Index Browser: an extensible data integration system for gene
finding, gene characterization and EST data mining. Bioinformatics,
14, 2–13.
Gill,R., Hodgman,T., Littler,C., Oxer,M., Montgomery,D., Taylor,S.
and Sanseau,P. (1997) A new dynamic tool to perform assembly of
expressed sequence tags (ESTs). CABIOS, 13, 453–457.
Gordon,D., Abajian,C. and Green,P. (1998) Consed: a graphical tool
for sequence finishing. Genome Res., 8, 195–202.
Hide,W., Burke,J. and Davison,D. (1994) Biological evaluation of d2,
an algorithm for high-performance sequence comparison. J. Comp.
Biol., 1, 199–215.
Hide,W., Burke,J., Christoffels,A. and Miller,R. (1997) A novel
approach towards a comprehensive consensus representation of the
expressed human genome. In Miyano,S. and Takagi,T. (eds),
Genome Informatics 1997. Universal Academy Press, Tokyo, pp.
187–196.
Houlgatte,R., Mariage-Samson,R., Duprat,S., Tesslier,A., Bentolila,S., Lamy,B. and Auffray,C. (1995) The GenExpress Index: a
resource for gene discovery and the genic map of the human
genome. Genome Res., 5, 272–304.
Hudson,T.J., Stein,L.D., Gerety,S.S., Ma,J., Castle,A.B., Silva,J.,
Slonim,D.K., Baptista,R., Kruglyak,L., Xu,S., Hu,X., Colbert,A.M.E., Rosenberg,C., Reeve-Daly,M.P., Rozen,S., Hui,L.,
Wu,X., Vastergaard,C., Wilson,K.M., Sae,J.S., Maitra,S., Ganiatsas,S., Evans,C.A., DeAngelis,M.M., Ingalls,K.A., Nahf,R.W.,
Horton,L.T., Anderson,M.O., Collymore,A.J., Ye,W., Koyoumjian,V., Zemsteva,I.S., Tam,J., Devine,R., Courtney,D.F., Renauld,M.T., Nguyen,H., Fizames,C., Faure,S., Gyapay,G., Dib,C.,
Morissette,J., Orlin,J.B., Birren,B.W., Goodman,N., Weissenbach,J., Hawkins,T.L., Foote,S., Page,D.C. and Lander,E.S. (1995)
An STS-based map of the human genome. Science, 270,
1945–1954.
Johnson,R.A. and Wichern,D.W. (1992) Applied Multivariate Statistical Methods, 3rd edn. Englewood Cliffs, NJ.
Matsubara,K. and Okubo,K. (1993) Identification of new genes by
systematic analysis of cDNAs and database construction. Curr.
Opinion Biotech., 4, 672–677.
Miller,R., Burke,J., Christoffels,A. and Hide,W. (1997) Towards a
more comprehensive conceptual consensus of the expressed genome: development of sequence tag alignment and consensus
knowledgebase (STACK) a novel error analytical approach to EST
consensus databases. Ninth International Genome Sequencing and
Analysis Conference.
Okubo,K., Hori,H., Matuba,R., Niiyama,T. and Matsubara,K. (1991)
A novel system for large-scale sequencing of cDNA by PCR
amplification. DNA Sequence, 2, 137–144.
Okubo,K., Hori,H., Matuba,R., Niiyama,T., Fukushima,A., Kiojima,Y. and Matsubara,K. (1992) Large-scale cDNA sequencing
analysis of quantitative and qualitative aspects of gene expression.
Nature Genetics, 2, 173–179.
Okubo,K., Yoshii,J., Yokouchi,H., Kameyama,M. and Matsubara,K.
(1994) An expression profile of active genes in human colonic
mucosa. DNA Res., 1, 37–45.
Pearson,W.R. (1990) Rapid and sensitive sequence comparison with
FASTP and FASTA. In Doolittle,R.F. (ed.), Molecular Evolution:
Computer Analysis of Protein and Nucleic Acid Sequences, Methods
in Enzymology. Academic Press, San Diego, pp. 63–98.
Schuler,G.D., Boguski,M.S., Stewart,E.A., Stein,L.D., Gyapay,G.,
Rice,K., White,R.E., Rodriguez-Tome,P., Aggarwal,A., Bajorek,E.,
Bentolila,S., Birren,B.B., Butler,A., Castle,A.B., Chiannilkulchai,N., Chu,A., Clee,C., Cowles,S., Day,P.J.R., Dibling,T.,
Drouot,N., Dunham,I., Duprat,S., East,C., Edwards,C., Fan,J.B.,
Fang,N., Fizames,C., Garrett,C. Green,L., Hudson,T.J. et al. (1996)
A gene map of the human genome. Science, 274, 540–546.
Sutton,G., White,O., Adams,M.D. and Kerlavage,A.R. (1995) TIGR
assembler: a new tool for assembling large shotgun sequencing
projects. Genome Sci. Technol., 1, 9–18.
Venter,J.C. (1993) Identification of new human receptor and transporter genes by high throughput cDNA (EST) sequencing. J.
Pharm. Pharmacol., 45(suppl. 1), 355–360.
White,O. and Kerlavage,A.R. (1996) TDB: new databases for
biological discovery. Methods Enzymol., 206, 27–41.
Wilcox,A.S., Khan,A.S., Hopkins,J.A. and Sikela,J.M. (1991) Use of
3′ untranslated sequences of human cDNAs for rapid chromosomal
assignment and conversion to STSs: implications for an expression
map of the genome. Nucleic Acids Res., 19, 1837–1843.
Williamson,A.R., Elliston,K.O. and Sturchio,J.L. (1995) The Merck
Gene Index, a public resource for genomics research. J. NIH Res., 7,
61–63.
Worley,K.C., Wiese,B.A. and Smith,R. (1995) BEAUTY: an enhanced
BLAST-based search tool that integrates multiple biological information resources into sequence similarity results. Genome Res.,
5, 173–184.
Zukowski,J. (1997) Java AWT Reference. O’Reilly Press, Sebastopol,
CA.
381