CrusView: A Java-Based Visualization Platform

CrusView: A Java-Based Visualization Platform
for Comparative Genomics Analyses in
Brassicaceae Species[OPEN]
Hao Chen and Xiangfeng Wang*
School of Plant Sciences, University of Arizona, Tucson, Arizona, 85721
ORCID IDs: 0000-0003-1224-0688 (H.C.); 0000-0002-6406-5597 (X.W.).
In plants and animals, chromosomal breakage and fusion events based on conserved syntenic genomic blocks lead to conserved
patterns of karyotype evolution among species of the same family. However, karyotype information has not been well utilized in
genomic comparison studies. We present CrusView, a Java-based bioinformatic application utilizing Standard Widget Toolkit/
Swing graphics libraries and a SQLite database for performing visualized analyses of comparative genomics data in Brassicaceae
(crucifer) plants. Compared with similar software and databases, one of the unique features of CrusView is its integration of
karyotype information when comparing two genomes. This feature allows users to perform karyotype-based genome assembly
and karyotype-assisted genome synteny analyses with preset karyotype patterns of the Brassicaceae genomes. Additionally,
CrusView is a local program, which gives its users high flexibility when analyzing unpublished genomes and allows users to
upload self-defined genomic information so that they can visually study the associations between genome structural variations
and genetic elements, including chromosomal rearrangements, genomic macrosynteny, gene families, high-frequency recombination sites, and tandem and segmental duplications between related species. This tool will greatly facilitate karyotype,
chromosome, and genome evolution studies using visualized comparative genomics approaches in Brassicaceae species.
CrusView is freely available at http://www.cmbb.arizona.edu/CrusView/.
The Brassicaceae (crucifer) plant family contains
more than 3,700 species, including the model plant
organism Arabidopsis (Arabidopsis thaliana); economically important crop species, such as Brassica rapa and
Brassica napus; and close relatives of Arabidopsis used
in abiotic stress research, such as Eutrema salsugineum
and Schrenkiella parvula. Because Brassicaceae plants
have high scientific and economic importance, several
whole-genome sequencing projects of the species in
this family have been recently launched (http://www.
brassica.info). Moreover, Brassicaceae is also a good
system for population genomics. The 1001 Arabidopsis
Genomes Project (http://www.1001genomes.org/)
plans to generate complete genome sequences for 1,001
Arabidopsis strains to study the associations between
genetic variation and phenotypic diversity. The Valuedirected Evolutionary Genomics Initiative project aims
to understand the genome evolution of Brassicaceae
species by sequencing several close relatives of Arabidopsis, such as Arabidopsis lyrata and Capsella rubella.
Recent advances in high-throughput sequencing technology have greatly expedited these whole-genome
sequencing projects of versatile nonmodel organisms.
* Address correspondence to [email protected].
The author responsible for distribution of materials integral to the
findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is:
Xiangfeng Wang ([email protected]).
[OPEN]
Articles can be viewed online without a subscription.
www.plantphysiol.org/cgi/doi/10.1104/pp.113.219444
354
Although increasingly longer reads can now be
produced from high-throughput sequencing experiments, de novo assembler tools can only generate
contig and/or scaffold sequences from high-throughput
sequencing reads. These tools cannot generate complete
chromosome sequences without genetic and/or physical maps that typically require years to create. This
limitation makes chromosome-scale structural variation
(i.e. translocation, inversion, deletion and insertion, and
segmental and tandem duplication) and genomic macrosynteny analyses difficult to perform.
In both plants and animals, genomes of species
within the same family have evolved with conserved
karyotype patterns due to the rearrangements of large
chromosomal segments. Chromosomal karyotypes can
be obtained from comparative chromosomal painting
(CCP) experiments by performing in situ hybridization experiments on bacterial artificial chromosome
sequences between related species. The genome of
each Brassicaceae member is composed of 24 conserved genomic blocks that have been considered as
the basic units of chromosomal rearrangement during
genome evolution (Lysak et al., 2006). The sizes of
these conserved blocks range from several to dozens
of megabases. Currently, karyotypes profiled by CCP
experiments in approximately 20 Brassicaceae species
are available; such karyotypes include those from
Arabidopsis (n = 5), Homungia alpine (n = 6), Eutrema
spp. (n = 7), A. lyrata (n = 8), B. rapa (n = 10), and
Polyctenium fremontii (n = 14). By utilizing the karyotype information in Brassicaceae, we have developed
a tool, KGBassembler (for Karyotype-based Genome
Plant PhysiologyÒ, September 2013, Vol. 163, pp. 354–362, www.plantphysiol.org Ó 2013 American Society of Plant Biologists. All Rights Reserved.
Downloaded from on June 15, 2017 - Published by www.plantphysiol.org
Copyright © 2013 American Society of Plant Biologists. All rights reserved.
Visualized Comparative Genomics Analytical Software
assembler for Brassicaceae), to finalize the assembly of
chromosomes from scaffolds/contigs without relying
on a genetic/physical map (Ma et al., 2012).
Over the past 2 years, complete whole-genome
sequences of several Brassicaceae species have been
released, including the aforementioned A. lyrata,
S. parvula, B. rapa, and E. salsugineum (Dassanayake
et al., 2011; Hu et al., 2011; Wang et al., 2011; Wright
and Agren, 2011; Wu et al., 2012; Yang et al., 2013).
These genomic resources have opened a new era of
comparative genomics in Brassicaceae to better understand the genomic evolution (Cheng et al., 2012).
Numerous tools and databases are available for performing comparative genomics analysis in plants.
CoGe is a comparative genomics analysis platform
that is now a part of the iPlant Collaborative Project
(Goff et al., 2011). The CoGe database currently includes nearly 2,000 genome sequences of approximately 1,500 organisms, allowing users to perform
online visual analyses of genome synteny and duplication events (Tang and Lyons, 2012). PLAZA and
Vista are also Web-based databases that provide
comparative analysis services on the genomic data
deposited in the databases (Frazer et al., 2004; Van Bel
et al., 2012). Other stand-alone bioinformatic applications for comparative genomic analysis, such as Easyfig
and genoPlotR, are commonly used to generate synteny
plots of given genome segments at a scale ranging from
a single gene to one chromosome (Guy et al., 2010;
Sullivan et al., 2011).
In this work, we present a Java-based bioinformatic
application, CrusView, for performing visualized
analyses of genome synteny and karyotype evolution
in Brassicaceae species. CrusView features a userfriendly graphical user interface (GUI) implemented
with Standard Widget Toolkit (SWT)/Swing graphics
libraries and a SQLite database used to manage local
genomic data. Compared with the most commonly
used tools in comparative genomics, one of the
unique features of CrusView is that available karyotype data of a Brassicaceae species are incorporated to
facilitate karyotype-based chromosome assembly and
analyses of chromosomal structural evolution. Compared with Web-based tools, the stand-alone CrusView tool was also designed to give users higher
flexibility in analyzing currently unpublished genome
data and integrating self-defined genomic information based on the users’ interests, such as gene families, gene duplications, chromosomal break points,
Gene Ontology terms, and groups of orthologs/
paralogs, with the genomic synteny maps. In addition, CrusView can generate images representing
genomic synteny between two compared genomes
in PNG/SVG/PDF high-resolution formats that are
suitable for publication.
RESULTS AND DISCUSSION
To demonstrate the basic functionality of CrusView,
we prepared two example genomes and related data
sets from Arabidopsis (n = 5) and E. salsugineum (n = 7)
to perform visualized comparative genomics analyses.
E. salsugineum (also known as salt cress and Thellungiella halophila) is a halophytic relative of Arabidopsis;
it inhabits the seashore saline soils of eastern China.
Because E. salsugineum and Arabidopsis share similar
life cycles, morphological characters, and genetic
composition, E. salsugineum has been widely used in
plant salt-tolerance studies using the genetic systems
and molecular tools previously established in Arabidopsis. The E. salsugineum genome (243 Mb) contains
seven chromosomes and approximately 24,000
protein-coding genes (Yang et al., 2013). The karyotype maps derived from CCP experiments of both
E. salsugineum and Arabidopsis are currently available (Lysak et al., 2006). We used these two genomes
to demonstrate the karyotype-based genome assembly of the E. salsugineum chromosomes and the comparative analyses of E. salsugineum and Arabidopsis
with integrated karyotype information.
Overview of the Functional Panels in CrusView
CrusView can be launched via Web start at http://
www.cmbb.arizona.edu/crusview. The navigation
panel includes quick buttons that perform basic operations in CrusView. The published karyotypes of 20
Brassicaceae species have been integrated into CrusView, and they are shown in the left “karyotype”
panel. We will constantly collect the published karyotypes generated based on CCP experiments. Each
time CrusView is launched, the program will automatically query the CrusView server to update the
local karyotype database. Genomic data files from
E. salsugineum and Arabidopsis can be imported into
the SQLite database to run a demonstration for users
who run CrusView for the first time. The primary visualization window shows the seven chromosomes
of the primary E. salsugineum genome (Fig. 1). The
protein-coding genes of E. salsugineum are designated
with the corresponding colors based on the conserved genomic blocks in which they are located.
The top right panel shows the color schemes and the
letter labels for the 24 genomic blocks (A–X), while
the bottom right panel shows the five chromosomes
of the secondary Arabidopsis genome (Fig. 1). The
information window displays the genomic annotations of the genes in the primary genome recorded in
the Browser Extensible Data (BED) file, including the
gene identifiers (IDs), chromosomal locations, genomic block IDs, orthologous group IDs, sequence
similarities with the homologs in the secondary genome, gene functional descriptions, and other userdefined information (Fig. 1). The user can switch the
primary and secondary genomes, zoom in/out of the
chromosome images, perform a query for genes of
interest, and invoke a chromosome-level comparison
window using the quick buttons in the navigation
panel.
Plant Physiol. Vol. 163, 2013
355
Downloaded from on June 15, 2017 - Published by www.plantphysiol.org
Copyright © 2013 American Society of Plant Biologists. All rights reserved.
Chen and Wang
Figure 1. Functional panels in the CrusView main screen. A, Navigation panel. B, List of available karyotypes in Brassicaceae
species. C, Main window showing the primary genome (E. salsugineum). D, Color scheme and letter labels of the 24 conserved
genomic blocks. E, Window showing the secondary genome (Arabidopsis). F, Gene annotation panel. G, Digital ancestral
karyotype of A. lyrata. H, Digital karyotype of Arabidopsis. I, Digital karyotype of E. salsugineum.
Visualized Karyotype Comparison between E. salsugineum
and Arabidopsis
One of the unique functions of CrusView is that it
can generate the digital karyotype of a genome,
allowing users to visually compare the chromosomal
karyotypes of the primary and secondary genomes.
The A. lyrata (n = 8) genome represents an ancestral
karyotype in the Brassicaceae family in which each
member’s genome is composed of 24 conserved genomic blocks according to the karyotype analyses of
several representative species in the family using CCP
experiments (Lysak et al., 2006). Each conserved genomic block is a large chromosomal segment that can
be represented by a group of Arabidopsis genes in
synteny with their orthologs in the genomes of other
Brassicaceae species. Thus, the Arabidopsis genes can
be used as markers to infer the assignment of the 24
conserved genomic blocks to another species’ genome in Brassicaceae (Lysak et al., 2006; Yang et al.,
2013). Our previously developed software program
KGBassembler includes a pipeline to assign the genes
in a Brassicaceae species genome to the 24 conserved
genome blocks, with a color scheme and a letter label
(A–X) based on the homology with Arabidopsis genes
(Ma et al., 2012). Here, we elucidate this procedure
using E. salsugineum as a newly sequenced genome
based on three basic steps: first, the Arabidopsis amino
acid sequences were mapped to the E. salsugineum
scaffold sequences using BLAST, followed by the selection of the best aligned locations; second, the Arabidopsis genes mapped onto the E. salsugineum
scaffolds were used to infer the conserved genomic
blocks, followed by the assignment of the color
schemes and letter labels of the 24 blocks to the
E. salsugineum genes; and third, pseudochromosome
sequences were generated based on the CCP-derived
(n = 7) karyotype of E. salsugineum. This pipeline was
integrated into CrusView and can be applied to any
newly sequenced Brassicaceae species genome to perform karyotype-based genome assembly and generate
digital karyotypes for comparison purposes.
In CrusView, the digital karyotypes of the primary
and secondary genomes will greatly facilitate visualized
genomic comparison and the identification of major
chromosomal rearrangement events causing the genomic evolution of the chromosomal karyotype in the
studied Brassicaceae species genome. For example,
Arabidopsis chromosome 2 (AtChr2) resulted from the
merging of E. salsugineum chromosome 4 (EsChr4) and
the long arm (14–37 Mb) of EsChr3 (Fig. 1). Moreover,
when compared with the ancestral karyotype of the
eight A. lyrata chromosomes, users may study the different evolutionary paths of the karyotype in another
species. For example, although AtChr1 resulted from
356
Plant Physiol. Vol. 163, 2013
Downloaded from on June 15, 2017 - Published by www.plantphysiol.org
Copyright © 2013 American Society of Plant Biologists. All rights reserved.
Visualized Comparative Genomics Analytical Software
the merging of A. lyrata AlChr1 and AlChr2, the structure
of EsChr1 remains unchanged compared with AlChr1
(Fig. 1). Furthermore, users can search for gene of interest
IDs or ortholog group IDs from the navigation panel and
map their positions on the compared primary and secondary genomic karyotypes.
Visualized Fine Adjustment of Pseudochromosome
Assembly in CrusView
The automatic generation of pseudochromosome
sequences based on the KGBassembler algorithm may
miss or misplace certain scaffolds that do not contain
sufficient gene synteny information for inferring the
assignment of conserved genomic blocks, which are
either relatively short or contain too many repetitive
sequences. Additionally, de novo scaffold assembly is
usually interrupted at the edges of highly repeated
centromere sequences. Thus, manual adjustment of the
pseudochromosomes may be necessary. Different from
KGBassembler, in which users need to edit a text file
for manual adjustment, CrusView allows users to
perform visualized fine adjustment of pseudochromosome assembly in GUI and to consider additional
genomic information, such as positions of genetic
markers, centromere-specific (CentO) tandem repeats,
and the density of protein-coding genes during the
adjustment. Users can directly load the project result
produced in KGBassembler for visualized fine adjustment or use the “assembling” function in CrusView to
assemble pseudochromosomes from the scaffold sequences. When the assembling function in CrusView is
run for the first time, users must indicate the working
folder containing the required input files described in
“Materials and Methods” and an output folder to save
the generated chromosome sequences. Users may set
up necessary parameters in the “parameter panel” and
save the parameters into a Windows Initialization
(INI) configuration file that can be directly loaded to
run the assembling function (Fig. 2). The details of the
parameters were explained in the KGBassembler
manual, and users may wish to apply different parameter settings to produce the most optimal assembly, which is largely dependent on the quality of the
scaffold sequences themselves as generated by de novo
assembler tools.
To fine-tune the draft pseudochromosome sequences, CrusView allows users to add files containing
genetic markers and CentO tandem repeats. In plants,
CentO sequences are approximately 170-bp motifs that
are tandemly arrayed and specifically located in the
core centromeric regions (Benson, 1999). CentO repeats
located at one terminal of a long scaffold are generally
indicative of the centromeric end of a scaffold (Fig. 2).
Moreover, the density of protein-coding genes is typically higher in the euchromatic regions of short and
Figure 2. Genome assembling function. A, Digital karyotype of E. salsugineum. B, Unplaced short-scaffold sequences.
C, Parameter panel. D, Menu bar. E, Main working panel for the manual curation of the genome assembly of E. salsugineum.
F, Density of protein-coding genes on scaffolds. G, CentO tandem repeat. H, Genetic marker track.
Plant Physiol. Vol. 163, 2013
357
Downloaded from on June 15, 2017 - Published by www.plantphysiol.org
Copyright © 2013 American Society of Plant Biologists. All rights reserved.
Chen and Wang
long arms than in the pericentromeric heterochromatic
regions (Fig. 2). Thus, these types of information are
very useful in assisting users to further inspect and
adjust the scaffold layouts and orientations on the
chromosomes as well as the genomic positions of
the genetic markers. Users can simply perform dragand-drop actions with a mouse to correct potentially
misplaced scaffolds or to adjust the orientation of
scaffolds. When a manual adjustment is performed,
users can save the pseudochromosome sequences to a
FASTA file and simultaneously generate the gene annotation file. Finally, users can use the “push to main
screen” function to directly add the assembled pseudochromosome and perform further visualized comparative analyses.
Visualization of Genomic Synteny between Two Genomes
The “compare two genomes” function in CrusView
can provide a visualization of genomic synteny for
each pair of homologous chromosomes for the primary
and secondary genomes. Chromosome-scale genomic
synteny can be visualized in two manners, a chromosomal karyotype with homologous genes linked between the two chromosomes and a dot plot indicating
chromosomal macrosynteny with duplication events
(Fig. 3A). For example, a comparison of the karyotypes
of EsChr4 and AtChr2 indicated that Arabidopsis
chromosome 2 resulted from an event in which the
entire chromosome 4 (genomic blocks I and J) merged
with the long arm of chromosome 3 (genomic blocks
K, G, and H) in E. salsugineum (Fig. 3A). In addition,
the visualized chromosomal synteny with karyotype
information can also allow users to examine the differences in the chromosome structures between the
two genomes. For instance, the 18-Mb-long region from
27 to 35 Mb of block J on EsChr4 remains highly similar
to the 17-Mb-long region from 13 to 20 Mb on AtChr2,
whereas the 25-Mb-long I block of EsChr4 has seemingly expanded dramatically, with highly enriched
repetitive sequences and transposable elements, compared with the corresponding approximately 17-Mb I
Figure 3. Visualization of genome synteny and gene alignment. A, Panels for genome synteny visualization: a, navigation bar;
b, primary genome; c, secondary genome; d, chromosome synteny; e, dot plot; f, genes in the selected area; g, action list;
h, selection of segmental duplication; i, genes in the ortholog groups. B, Alignment of multiple gene members in the CDPK
family showing tandem duplication events. C, Exon-level alignment of the SALT OVERLY SENSITIVE1 (SOS1) genes between
Arabidopsis and E. salsugineum.
358
Plant Physiol. Vol. 163, 2013
Downloaded from on June 15, 2017 - Published by www.plantphysiol.org
Copyright © 2013 American Society of Plant Biologists. All rights reserved.
Visualized Comparative Genomics Analytical Software
block region on AtChr2. More interestingly, a small
region of EsChr4 between the positions 10 to 11 Mb was
found to result from the inverted translocation of a region from AtChr2. The selection of a genomic region
with the mouse can invoke the information window,
which contains the genes located in the regions of interest. By clicking on a gene homologous to the corresponding Arabidopsis gene, users will be redirected to
The Arabidopsis Information Resource database, which
contains detailed gene function information.
Chromosome-scale genomic synteny can also be visualized as a dot plot in CrusView to facilitate the
identification of segmental duplication and tandem
duplication events between the two compared species.
From the dot-plot screen, users can select the regions
containing duplication events of interest with the
mouse to obtain information regarding the genes located in the selected regions (Fig. 3A). Right clicking
the mouse will invoke a pull-down list of advanced
actions, such as querying selected genes in the external
The Arabidopsis Information Resource database to
view detailed functional descriptions, retrieving gene
sequences to a FASTA file, performing exon-level
sequence alignment for a single gene, and aligning
multiple genes in a user-defined synteny region using
AJaligner. Figure 3B demonstrates a genomic region
between 23.8 and 24.1 Mb on AtChr4 encompassing
two tandem duplication events of the gene members in
the calcium-dependent protein kinase (CDPK) family
that may be involved in stress-responsive pathways
in Arabidopsis. While AtCDPK27 and AtCDPK31
represent a pair of tandemly duplicated genes that
correspond to the single-copy E. salsugineum gene
Thhalv10028618m.g, AtCDPK21 and AtCDPK23 correspond to the single-copy gene Thhalv10028567m.g
(Fig. 3B). An exon-level sequence alignment of a pair of
interesting orthologous genes will reveal exon-level
structural variations, amino acid variations, insertions and deletions, and single-nucleotide polymorphisms, which is illustrated by the comparison of
SALT OVERLY SENSITIVE1 in Arabidopsis and its
E. salsugineum ortholog (Fig. 3C).
ortholog groups (OG5_127192) that showed high variation in copy number contained 148 and 130 F-box
genes in Arabidopsis and E. salsugineum, respectively. In plants, F-box genes consist of a large superfamily encoding an E3 ubiquitin ligase that is involved
in substrate-specific protein degradation. First, using
the “predict tandem duplication” function in CrusView, highly homologous genes defined with a cutoff
of 40% protein identity and located adjacent to each
other within a 5-kb window were highlighted in green
in the dot plot of EsChr3 and AtChr3 (Fig. 4). The
protein identity cutoff and window size can both be
adjusted by the user when predicting tandem duplications. Then, using the “keyword search” function, a
group of genes of interest is displayed in the current
dot plot. For instance, when searching ID OG5_127192,
F-box genes classified in this ortholog group by
OrthoMCL were highlighted in red in the same dotplot image (Fig. 4). From the overlapping green dots
(tandemly duplicated genes) and red dots (F-box genes
in group OG5_127192), we observed a macrosyntenic
block covering an approximately 5-Mb region on AtChr3
and an approximately 15-Mb region on EsChr3,
encompassing 59 and 78 tandemly arrayed F-box
genes in Arabidopsis and E. salsugineum, respectively
(Fig. 4).
Similarly, users can also add additional genomic
information to the BED file to allow searching for selfdefined keywords, such as Gene Ontology terms, gene
functional descriptions, or gene families. CrusView
also allows users to filter a list of genes or genomic
positions of interest from the user-defined genomic
information file, which can be displayed on the dotplot synteny map. Users can define the color schemes
for different gene groups on the plots using the setting
function of CrusView. Finally, the digital karyotype
maps, macrosynteny plots based on the 24 color-coded
genomic blocks, and dot-plot synteny map showing
duplication events and mapped genes of interest can
be saved as high-quality PNG/SVG/PDF publicationquality images.
CONCLUSION
Visualization of a User-Defined List of Genes, Duplication
Events, and Copy Number Variations in a Genomic
Synteny Plot
Using CrusView, users may visualize a group of
genes of interest in the two compared genomes to
determine their associations with genomic synteny
and possible duplication events. We demonstrate
this utility by analyzing the tandemly duplicated
F-box superfamily that has been found to display great
copy number variations between Arabidopsis (505
genes) and E. salsugineum (613 genes). First, the genes
in E. salsugineum were assigned to the orthologous
groups annotated in the OrthoMCL database (Li et al.,
2003). Each ortholog group indicated by a unique ID
contains the putative orthologous genes in Arabidopsis and E. salsugineum. We found that one of the
In this work, we developed a Java-based bioinformatic application, CrusView, using the powerful
SWI/Swing graphics libraries in the Java and SQLite
databases; this application was designed to facilitate
research in comparative genomics. We demonstrated
the basic functionality of CrusView by performing a
visual comparison of the Arabidopsis and E. salsugineum
genomes in the plant Brassicaceae (crucifer) family.
Compared with other bioinformatic tools that have
been developed for similar purposes, one of CrusView’s unique features is its incorporation of genomic
karyotype information derived from CCP experiments.
The karyotype of a species associated with the genome
structure visualized in CrusView can greatly assist
users in identifying chromosomal rearrangements,
genomic synteny, and major duplication events among
Plant Physiol. Vol. 163, 2013
359
Downloaded from on June 15, 2017 - Published by www.plantphysiol.org
Copyright © 2013 American Society of Plant Biologists. All rights reserved.
Chen and Wang
Figure 4. Mapping duplication events and genes of interest onto the dot-plot synteny map. A dot-plot synteny map of EsChr3
and AtChr3 is shown. The blue dots represent homologous gene pairs in the Arabidopsis and E. salsugineum genomes. The blue
dots arranged along the diagonal line indicate a macrosynteny region. The aligned blue dots deviating from the diagonal line
indicate segmental duplications. The green dots represent potential tandemly duplicated genes selected using a protein identity
cutoff of 40% and a 5-kb window size. The red dots represent F-box genes selected by a keyword search. The overlapping red
dots and green dots indicate the tandemly duplicated F-box genes on EsChr3 and AtChr3.
the related species. Thus, this unique CrusView feature
may facilitate the understanding of karyotype, chromosome, and genome evolution based on a comparative genomics approach. Furthermore, by considering
the advantage of a species’ karyotype, CrusView provides a unique function to infer pseudochromosome
sequences from scaffold sequences generated by de
novo assemblers based on conserved genomic blocks.
This feature is especially convenient for nonmodel
species that lack a genetic and/or physical map. However, users should be aware that CrusView does not
replace de novo assembler tools, and its performance
in finalizing the assembly of a pseudochromosome sequence depends largely on the quality of the scaffolds
and contigs produced from whole-genome shotgun
sequencing projects.
CrusView also includes an array of utilities that can
be used to visualize genome synteny and duplication
events and to map a list of genes of interest associated
with syntenic regions between the two analyzed genomes. Compared with database-based comparative
genomics tools, CrusView is much more flexible in
the ability to analyze unpublished genomes; it allows
users to integrate self-defined genomic information,
such as Gene Ontology classifications, gene families of
interest, hot spots of chromosomal breakage/fusion
points, high-frequency recombination sites, and tandem duplication to study their correlations with genomic variations and duplication events. User-defined
information and genome synteny plots can be exported
as high-resolution, publication-quality PNG/SVG/PDF
images.
Karyotype mapping based on in situ hybridization
experiments is a common genomic technique that is
widely used in animals and plants. Conserved patterns
of chromosomal rearrangements based on syntenic
genomic blocks as basic units of chromosomal breakage and fusion events are commonly observed in
the animal and plant kingdoms (Lysak et al., 2006;
Ferguson-Smith and Trifonov, 2007). Therefore, although CrusView was primarily developed and preset
based on the karyotype evolution patterns in the
Brassicaceae family (primarily for the convenience of
the Brassicaceae community), this software program
may also be used to perform karyotype-based genome
assembly or karyotype-assisted genome synteny
analysis in other plant families or in other organisms
for which karyotype data exist. If users wish to use the
360
Plant Physiol. Vol. 163, 2013
Downloaded from on June 15, 2017 - Published by www.plantphysiol.org
Copyright © 2013 American Society of Plant Biologists. All rights reserved.
Visualized Comparative Genomics Analytical Software
current version of CrusView for non-Brassicaceae species, they can access the “setting” function to define the
color schemes and letter labels of the conserved genomic blocks based on the karyotype evolution patterns
of the species of interest. Additionally, to promote the
broad use of CrusView in other organisms, the source
code of CrusView has been released through Sourceforge.net to allow academic users to freely download
and modify the programs.
to be used in CrusView. The user must provide the genome/scaffold
sequences in FASTA format, the gene annotation file in GFF or GTF format, and one additional karyotype file if the user wants to perform
karyotype-based assembly of pseudochromosome sequences. The user is
also prompted to submit protein sequences to the OrthoMCL online database (http://www.orthomcl.org) to assign the genes to the corresponding
ortholog groups to facilitate genome comparisons, gene duplication analyses, and copy number variation analyses. To assign the 24 conserved genome block IDs to the genes, the user must provide a BLAST result of the
protein sequences of the analyzed genome against Arabidopsis proteins.
Additional genomic information that the user wishes to include will be
integrated into the last column of the BED file to enable the keyword search
function in CrusView.
MATERIALS AND METHODS
Basic Input Files for CrusView
CrusView utilizes the Java Web-start function so that it can be launched
through the CrusView homepage. When it is run for the first time, CrusView
creates a “CrusView” folder on the user’s local computer and automatically
installs the programs and basic data set in the folder. CrusView simultaneously creates a local Java SQLite database to manage the genomic data that
the user wishes to analyze. The data files include a FASTA file containing
chromosome or scaffold/contig sequences and a GFF file containing gene
model annotation that will be imported into the SQLite database. The user
must also prepare a BED file in the “bed” folder to provide additional information, such as ortholog group IDs, genome block IDs, and protein sequence
identities between the primary and secondary genomes. To enable the advanced search function, the BED file may also include the user’s self-defined
genomic information and functional descriptions added in the last column,
such as Gene Ontology terms, gene families, recombination hot spots, and so
on. To analyze a specific group of genes of interests, the user can load a TXT
file containing the gene IDs or genomic positions and their further descriptions
into CrusView through provided functions.
Input Files for Karyotype-Based Genome Assembly
For species only containing scaffold sequences but with an available CCPderived karyotype map, a karyotype-based genome assembly of pseudochromosomes from scaffold sequences is recommended. The KGBassembler
will be invoked by the “assembling” function in CrusView. The assembly
function requires the following input files: a KARYOTYPE file containing
CCP-based karyotype information obtained from the CrusView Web site or
prepared by the user based on instructions, a PSL file containing Arabidopsis
(Arabidopsis thaliana) genes aligned on the scaffolds, and a FASTA file containing scaffold sequences. The user can either provide a configuration file in
Windows Initialization (INI) format or edit the “Parameter” tab in the CrusView
interface to set up necessary parameters for assembly. If a genetic map with
gene marker information is prepared by the user as a GMM file with designated
format described in the CrusView manual, CrusView may also incorporate this
information during the manual adjustment of the pseudochromosomes. To facilitate the prediction of scaffold orientations on the pseudochromosomes, the
user may run the tandem repeat finder software program (Benson, 1999) to
identify the scaffolds containing CentO sequences. CentO repeat locations formatted as a BED file can be loaded into CrusView as an additional track.
After the KGBassembler has generated the pseudochromosome sequences,
the user may use CrusView to perform fine adjustments to the orientations and
orders of the scaffolds on the pseudochromosomes based on the additional
information provided by the user, such as the density of protein-coding genes,
user-customized genetic markers, and the locations of CentO tandem repeats
on the scaffolds. CrusView has been implemented with an enhanced GUI that
can be used to further adjust the pseudochromosome assembly using
dragging-and-placing mouse actions. By clicking the “save assembly” button,
the pseudochromosome sequences and gene annotation information will be
saved in a FASTA file and a GFF file, respectively.
Conversion of a User’s Unpublished Genome Sequence
and Self-Defined Gene Annotation to Input Files
Compatible with CrusView
To facilitate the user to analyze a yet-to-be-published genome sequence,
CrusView includes a function to help the user prepare the input files necessary
Inference of Genomic Macrosynteny Based on Conserved
Genomic Blocks
The genomes of the Brassicaceae species share 24 conserved genomic blocks
(large chromosomal segments) designated A to X. The additional ID “0” is used
by CrusView to label undetermined regions that are not assigned to any genomic blocks. The chromosomal locations of the 24 genomic blocks can be
inferred from the CCP-derived karyotype. Each gene located within the same
conserved genomic block is assigned a designated color code to illustrate the
digital karyotype of the studied species. Genes shared within the same genomic block IDs are considered to be in the same genomic macrosyntenic
regions. To analyze a genome lacking a CCP-derived karyotype or a genome
in other families of plant or animal organisms that have different conserved
genomic blocks, the user can self-define the block IDs with hexadecimal color
codes in the BED file.
Visualization of Chromosomal Karyotype, Genomic
Synteny, and Gene Alignment
CrusView was implemented with the Java SWT/Swing libraries to develop
the GUI interface and visualization functions. Visualization of the genomic
data of an analyzed species can be performed at three levels: the genome level,
the chromosome level, and the gene level. If the karyotype information has
been associated with the studied genome, all of the chromosomes will be visualized with the 24 genomic block IDs with corresponding colors. The user
can select any two chromosomes of interest in the two compared species to
visualize chromosomal synteny. When comparing the karyotypes of two
chromosomes, the pairs of orthologous genes between the two species are
linked to indicate major chromosomal rearrangement events. CrusView also
generates a dot plot for each pair of selected chromosomes to visualize tandem
and segmental duplication events. The user may select a group of genes from
the dot plot using a mouse framing action to trigger gene-level visualization.
A multigene alignment within a designated genomic region (less than 1 Mb)
between the two genomes and an exon-to-exon alignment of one pair of
orthologous genes with single-nucleotide polymorphism information can be
visualized.
Output Image Files Generated from CrusView
One of the useful utilities of CrusView is to generate high-resolution images
and save them in PNG/SVG/PDF formats for publication use. Such images
include digital karyotypes, genome synteny plots, dot plots of two chromosomes, multigene alignments within a genomic region, exon-to-exon alignment
plots, plots of genomic duplication events, and mapping of a list of genes of
interest in the genomic synteny plots.
Software Availability
CrusView is publicly available online (http://www.cmbb.arizona.edu/
crusview) and has been implemented as a Java Web-start application under
Windows and Linux 32/64-bit systems with options for different memory
sizes. Sample data sets from Arabidopsis and Eutrema salsugineum are provided to demonstrate the basic functions of CrusView. The software manual
and a series of video tutorials for CrusView are also provided online (http://
www.cmbb.arizona.edu/crusview/video_tutorial).
Received April 10, 2013; accepted July 20, 2013; published July 29, 2013.
Plant Physiol. Vol. 163, 2013
361
Downloaded from on June 15, 2017 - Published by www.plantphysiol.org
Copyright © 2013 American Society of Plant Biologists. All rights reserved.
Chen and Wang
LITERATURE CITED
Benson G (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res 27: 573–580
Cheng F, Wu J, Fang L, Wang X (2012) Syntenic gene analysis between
Brassica rapa and other Brassicaceae species. Front Plant Sci 3: 198
Dassanayake M, Oh DH, Haas JS, Hernandez A, Hong H, Ali S, Yun DJ,
Bressan RA, Zhu JK, Bohnert HJ, et al (2011) The genome of the extremophile crucifer Thellungiella parvula. Nat Genet 43: 913–918
Ferguson-Smith MA, Trifonov V (2007) Mammalian karyotype evolution.
Nat Rev Genet 8: 950–962
Frazer KA, Pachter L, Poliakov A, Rubin EM, Dubchak I (2004) VISTA:
computational tools for comparative genomics. Nucleic Acids Res 32:
W273–W279
Goff SA, Vaughn M, McKay S, Lyons E, Stapleton AE, Gessler D,
Matasci N, Wang L, Hanlon M, Lenards A, et al (2011) The iPlant
collaborative: cyberinfrastructure for plant biology. Front Plant Sci 2: 34
Guy L, Kultima JR, Andersson SG (2010) genoPlotR: comparative gene
and genome visualization in R. Bioinformatics 26: 2334–2335
Hu TT, Pattyn P, Bakker EG, Cao J, Cheng JF, Clark RM, Fahlgren N,
Fawcett JA, Grimwood J, Gundlach H, et al (2011) The Arabidopsis
lyrata genome sequence and the basis of rapid genome size change. Nat
Genet 43: 476–481
Li L, Stoeckert CJ Jr, Roos DS (2003) OrthoMCL: identification of ortholog
groups for eukaryotic genomes. Genome Res 13: 2178–2189
Lysak MA, Berr A, Pecinka A, Schmidt R, McBreen K, Schubert I (2006)
Mechanisms of chromosome number reduction in Arabidopsis thaliana
and related Brassicaceae species. Proc Natl Acad Sci USA 103: 5224–5229
Ma C, Chen H, Xin M, Yang R, Wang X (2012) KGBassembler: a karyotypebased genome assembler for Brassicaceae species. Bioinformatics 28:
3141–3143
Sullivan MJ, Petty NK, Beatson SA (2011) Easyfig: a genome comparison
visualizer. Bioinformatics 27: 1009–1010
Tang H, Lyons E (2012) Unleashing the genome of Brassica rapa. Front
Plant Sci 3: 172
Van Bel M, Proost S, Wischnitzki E, Movahedi S, Scheerlinck C, Van de
Peer Y, Vandepoele K (2012) Dissecting plant genomes with the PLAZA
comparative genomics platform. Plant Physiol 158: 590–600
Wang X, Wang H, Wang J, Sun R, Wu J, Liu S, Bai Y, Mun JH, Bancroft I,
Cheng F, et al (2011) The genome of the mesopolyploid crop species
Brassica rapa. Nat Genet 43: 1035–1039
Wright SI, Agren JA (2011) Sizing up Arabidopsis genome evolution.
Heredity (Edinb) 107: 509–510
Wu HJ, Zhang ZH, Wang JY, Oh DH, Dassanayake M, Liu BH, Huang QF,
Sun HX, Xia R, Wu YR, et al (2012) Insights into salt tolerance from the
genome of Thellungiella salsuginea. Proc Natl Acad Sci USA 109: 12219–
12224
Yang R, Jarvis DE, Chen H, Beilstein M, Grimwood J, Jenkins J, Shu S,
Prochnik S, Xin M, Ma C, et al (2013) The reference genome of the
halophytic plant Eutrema salsugineum. Front Plant Sci 4: 46
362
Plant Physiol. Vol. 163, 2013
Downloaded from on June 15, 2017 - Published by www.plantphysiol.org
Copyright © 2013 American Society of Plant Biologists. All rights reserved.