Conserved RNA secondary structures in viral genomes: a survey

Vol. 20 no. 10 2004, pages 1495–1499
doi:10.1093/bioinformatics/bth108
BIOINFORMATICS
Conserved RNA secondary structures in
viral genomes: a survey
I.L. Hofacker1, ∗, P.F. Stadler1,2 and R.R. Stocsits1
1 Institut
für Theoretische Chemie und Molekulare Strukturbiologie, Universität Wien,
Währingerstraße 17, Vienna, A-1090, Austria and 2 Bioinformatik, Institut für Informatik,
Universität Leipzig, Leipzig, D-04103, Germany
Received on October 12, 2003; accepted on February 3, 2004
ABSTRACT
Summary: The genomes of RNA viruses often carry conserved RNA structures that perform vital functions during the
life cycle of the virus. Such structures can be detected using a
combination of structure prediction and co-variation analysis.
Here we present results from pilot studies on a variety of viral
families performed during bioinformatics computer lab courses
in past years.
Contact: [email protected]
INTRODUCTION
In addition to coding for proteins, the genomes of RNA viruses
often contain functionally active RNA structures that are vital
during the various stages of the viral life cycle. In many cases
only a small part of the viral genome is functionally relevant
at the level of RNA. Detecting these motifs is a difficult and
tedious task since the secondary structures of such regions
in general do not look significantly different from structures
formed by random sequences.
RNA secondary structures are however very fragile to accumulation of point mutations: computer simulations (Schuster
et al., 1994) showed that a small number of point mutations is
very likely to cause large changes in the secondary structures:
mutations in some 10% of the sequence positions already lead
almost certainly to unrelated structures if the mutated positions are chosen randomly. Secondary structure elements that
are consistently present in a group of sequences with less than,
say 95%, average pairwise identity are therefore most likely
the result of stabilizing selection, not a consequence of the
high degree of sequence conservation.
This fact is exploited by the alidot algorithm for a systematic search of conserved secondary structure patterns in
large RNA. In brief, independent predictions of the secondary structure for each of the sequences and a multiple
sequence alignment that is obtained without any reference
to the predicted secondary structures are combined to a
list of homologous base pairs. This list is then sorted
∗ To
whom correspondence should be addressed.
Bioinformatics 20(10) © Oxford University Press 2004; all rights reserved.
by means of hierarchical credibility criteria that explicitly
take both thermodynamic information and information on
sequence covariation into account. A detailed description
of the method can be found in Hofacker et al. (1998) and
Hofacker and Stadler (1999), and the programs are available
from http://www.tbi.univie.ac.at/RNA/ as part of the Vienna
RNA Package.
A comprehensive analysis of the genomic secondary features is available for Picornaviridae (Witwer et al., 2001),
Flaviviridae (Thurner et al., 2004), and Hepadnaviridae
(Stocsits et al., 1999; Kidd-Ljunggren et al., 2000). In this
contribution we report on preliminary surveys of a variety of
virus families.
LEVIVIRIDAE
Leviviridae are small ssRNA phages without an envelope and
tail. The replication cycle includes no DNA stage. The total
genome length is 3466–4276 nt depending on the type of
strain. Most Levivirus species have four (partly) overlapping
genes; some exceptions exist that contain only three open
reading frames. Currently, two structural virion proteins have
been found and identified.
We have investigated eight sequences of the Levivirus
genus. Currently, there are five types of strains represented in
the GenBank database as complete genomes: the Enterobacteria phages MS2, KU1, GA, fr and AP205. The last one was
eliminated from the data set analyzed here because of large
changes in genomic structure.
Figure 1 gives an overview of potentially conserved structures in Leviviridae, in the form of a mountain plot. In the
mountain plot, each base pair (i, j ) is represented as a slab
ranging from i to j with a height corresponding to the mean
probability of the pair. In the gray-scale version shown here,
shades of gray, from light to dark, indicate an increasing
number of consistent mutations. The base pairs marked by
sparse and dense white stripes have one or two non-compatible
sequences, respectively. Detailed results for each of these
regions can be found online at http://rna.tbi.univie.ac.at/virus.
Of particular interest is the 5 terminal end, where we detect
1495
I.L.Hofacker et al.
GG
G
C
C
U
A
G
CC G
C G
C G
C G
C G
U C
U
AB
C
Adsorption A2
DE
F
G
Coat
H
I
Polymerase
Fig. 1. The Hogeweg mountain plot on the left gives an overview of the potentially conserved secondary structures of the complete RNA
genome of Levivirus. The underlying alignment is obtained using code2aln (Stocsits, 2003). In the genome chart below credible RNA
structures and positions of the open reading frames are indicated. The position numbers of these elements (compiled in the small table on
the right) refer to the alignment available at http://rna.tbi.univie.ac.at/virus/. The 5 -end structure A is homologous to the 5 -terminal RNA
replicase recognition site of the allolevivirus Qβ where it is required for the recognition of the template and its amplification. Element F
could correspond to the coat protein binding site that is located just upstream of the polymerase gene in Qβ .
a short GC-rich hairpin (tetraloop) that follows an unpaired
GGG element (see upper right of Fig. 1). This element is most
likely the analogon to the stem-loop-structure that is recognized by the Qβ replicase enzyme and is essential for viral
replication (Biebricher and Luce, 1993).
NEGATIVE-STRANDED RNA VIRUSES
The latest version of Virus Taxonomy (Büchen-Osmond,
2003) classifies the families Paramyxoviridae, Rhabdoviridae,
Filoviridae and Bornaviridae as members of the order
Mononegavirales. The families Orthomyxoviridae, Bunyaviridae and Arenaviridae remain unassigned.
Paramyxoviridae have genome sizes of about 15–16 kb.
The only unambiguous RNA structure is a panhandle structure that is highly conserved. In addition, there are a few
weak signals for specific small RNA features in the genera
Respirovirus and Morbillivirus that deserve a more detailed
analysis.
Rhabdoviridae have a genome size of about 12 kb. No unambiguous signals have been detected. In the genus Rhabdovirus
1496
conserved gene-end and -start sequences have been reported (Wertz et al., 1994) that might be associated also with
conserved structural features.
Filoviridae. Complete sequences are available only for
Reston and Zaire Ebola and for the Marburg virus. These
three sequences are insufficient for the alidot approach.
The program qrna (Rivas and Eddy, 2001) was therefore
used to identify possible conserved RNA features in a pairwise
alignment. An example is shown in Figure 2.
Bunyaviridae exhibit panhandle structures that are confirmed by the existence of compensatory mutations. Neither
the genus Hantavirus nor the genus Bunyavirus (which has
a genome consisting of three distinct chromosomes) exhibit
convincing additional RNA secondary structure motifs in
either the (+) or the (−)-strand although there are some
small hairpins containing a few consistent and compensatory
mutations reported in Fekete (2000).
Orthomyxoviridae have genomes that contain a moderate
number of small segments (eight in the case of the influenza virus). As for most negative-stranded RNA viruses our
computations confirm the well-known panhandle structures
(Cheong et al., 2003). A hairpin loop structure at both the 5
Conserved RNA structures in viral genomes
G
U C UC
AG AG
A CC A A A CG
AA
U A C
C
UU A A CUC
C
A
A C
A A A UU A A C
C U C C C A A GA
U
U
G
AGG_ A A U A A A C
C C
GA A AGAG
A A A A C A U
GA AGA CU A
U
U
A C A UUUU A A C
A CGGGGG
CUUU A A C
UGUG
A
C
A
A
U
A
A
AG
A A U A AUACA A G G G
A
A
U A A CUCC
A
U
A
A
A A A
U
C
U
U
A
AA
AA
C A CA
A
Fig. 2. Common RNA structure of Reston Ebola and Marburg virus in the region from 13 100 to 13 270 of their clustalw alignment. The
signal was detected by qrna and then re-analyzed using alidot. Circles mark sequence variation supporting the structure.
and 3 ends (which competes with the panhandle and hence
does not show in our data) is required for efficient endonuclease activity of influenza virus RNA polymerase, an activity
that is required for the cap-snatching activity of primers from
host pre-mRNA (Leahy et al., 2001).
Arenaviridae also exhibit panhandle structures that have
been repeatedly reported in the literature. Conserved intergenic regions containing poly-U stretches are reminiscent of
Rhabdoviridae. We did not find evidence that these sequences
form conserved secondary structures.
Panhandle structures appear to be the only common RNA
secondary structure features of negative stranded RNA viruses, despite at least weak indications of additional conserved
features in some groups.
POSITIVE-STRANDED RNA VIRUSES
The survey of positive-stranded RNA viruses is far from complete at this point. Apart from the families Picornaviridae
(Witwer et al., 2001) and Flaviviridae (Thurner et al., 2004),
which have been studied in detail, preliminary data are
available for three additional unrelated groups.
Dicistroviridae are an only recently described family of
arthropod viruses. The genome contains a 5 untranslated
region of 500–800 nt followed by two ORFs of around 5500 nt
and 2600 nt separated by an untranslated intergenic region
(IGR) of approximately 190 nt. This di-cistronic structure
of the genome is special among RNA viruses. The IGR
between the replicase and the four capsid proteins contains
an IRES region while the 5 end contains another IRES motif
despite the fact that the translation is fMet-initiated. Preliminary data indicate that the IGR–IRES is structurally well
conserved.
Comoviridae. A number of conserved elements with a larger number of compensatory mutations can be found within
the three genera Comovirus, Fabavirus and Nepovirus. A
hairpin structure with a well-conserved stem and a relatively
large A/U-rich loop region (pos. 1257–1271 of the Nepovirus
alignment) is reminiscent of the cis-acting replication element
(CRE) of Picornaviridae. The close similarity of the life cycles
of Picornaviridae and Comoviridae lends credibility to this
conjecture.
Potyviridae have genomes of almost 10 000 nt consisting of
a single ORF, coding for a polyprotein, and short UTRs at the
5 and 3 ends. Many complete sequences are available since
the family includes important plant pathogens. Because of
high sequence diversity, it is difficult to get good alignments;
nevertheless, a number of interesting features were found.
As an example Figure 3 shows the conserved structure in the
3 UTR of Potyvirus.
RETROVIRIDAE
Retroviridae, in contrast, exhibit a large number of functional
RNA secondary structures. In the case of the lentivirus HIV,
some of them are very well characterized both experimentally
and computationally, such as the TAR hairpin at the 5 end,
the gag/pol frameshift hairpin (also present in Mammalian
Type B Retrovirus) and the RRE structure.
Our pilot study yields indications of functional RNA structures throughout the family. However, there seems to be little
similarity in the motifs used by different genera. A comparison between different genera is difficult because the low
sequence homology does not permit reliable alignments. A
much more detailed analysis will be necessary to determine whether some secondary motifs are shared between all
Retroviridae.
DISCUSSION
By combining thermodynamic structure prediction with
co-variation analysis, conserved RNA secondary structures
in viral RNA genomes can be detected reliably. The analysis
works best if at least five sequences with pairwise identities of around 80% are available. For many types of viruses
such sequence data are now available, and we have begun
a systematic survey of viral RNA structures, the results of
which are made available in an online database at http://
rna.tbi.univie.ac.at/virus/
All the RNA viruses studied so far contain at least some
functional RNA structures. These conserved structures occur
particularly often in the 5 and 3 UTRs, but even within coding
regions conserved RNA structures are not rare.
To our surprise, the bibliographic search for experimental
evidence of and further information on ‘RNA secondary
structures’ in a given group of virus—a seemingly rather
straightforward task—turned out to be more tedious than
the work on the actual sequence and structure data. We
have therefore developed the litsift tool (Faulstich et al.,
2003) to make a bibliographic search in the context of
the Atlas of Viral Secondary Structures more effective. To
1497
I.L.Hofacker et al.
UU
G
A
G
A C G
A U
A A
A
U G
G U
A G
U U
G UA
G A
C
A
C
G
U AC
U
A
A
A UA
A CC
C
G AG
A
G C
G UU
U U
C
G A
A
C GA
A
A
U C
U
G GU
U
U
G U
A U GU
G UG
A A
A C
G
A A
A C
U U
U AA
A
U CA
C
CG UG G
A
C G C
U U CGU
U
A A
C
U U A C C G A A G A A AG A A
A A
U U A AUU G
G
GGC
G
U
GC U
A
A
GG
G
U G A CG
U
A A A
C G U AA
G
A A C U G C C CC
GU A A
U AA U _ A _
C
AAUC
G
UA A
G
G
C
G
U
_
A
G
G
A
UA A
U
G U U _
GU
GU
G
C G A
U UU U A C A C G G
A
A
CU CAC UG
U
_
U
U
C
A
U
C
C
U
A
G
C
G
U
A
U
A
A
A AG
G A U G UG
U
U A A GA GAG
A G
U
G
C
U
C A
G A AA AC
U
U AA C
U
U
A
CC
C
G GU
A C
C
U
_ UC GA
U
G GG
G
C
A
A C GU U U
G U GC AG
A
A A G GU G
A
A
U
ACA
C
CC UA UC UA
G
G
AU
A
A
AGA
A
A
C U
A
UA
GC
CG
CG
UA
GA U
A
CG
U
G
G
G CA AC U
U G A C AU A
G
UA GU C G
C U A
UG AC
G
A U
GC
GG
U A G C U CG U
GC
A AG
G G A U
U GU
A
AC U A AU G U
G
A C A C
A
A C
U
GC
A
A GC
G
U
AU
C
AA U
A
A CG
A
A
U
CUG
U
UA C
GC
CG
GC
AA
GC
AU
A UA
C
UG
G C A UU A
A U U UG C
G C A AA U
UG U CU A
G
U
UG
U G
UA
CG
U U
A
U
G
G
Fig. 3. Conserved secondary structures in the 3 end of Potyvirus. The region spans from approximately 150 nt before to 400 nt after the stop
codon. Potyvirus sequences are too divers to obtain a reliable alignment of all sequences. Here we used five sequences (NC_001517 Pepper
Mottle Virus, NC_001616 Potato Virus Y, NC_004010 Potato Virus V, NC_004426 Wild Potato Mosaic Virus, NC_004573 Peru Tomato
Mosaic Virus) aligned using clustalw.
this end, we use classifiers trained on sample corpora in
a system that filters and ranks search results from bibliographic databases such as Pubmed. Preliminary results
indicate that a classifier trained on one virus group can indeed
be applied successfully to search the literature on other virus
groups.
ACKNOWLEDGEMENTS
This work is supported by the Austrian Fonds zur Förderung
der Wissenschaftlichen Forschung, Project Nos. P-13545MAT and P-15893, and the German DFG Bioinformatics Initiative. Pilot studies on individual virus groups were conducted
as part of bioinformatics computer lab courses in Vienna
and Leipzig—Winter 2000: Leviviridae (Erwin Gaubitzer,
Philipp Hohensinner); Winter 2001: Retroviridae (Peter
Andorfer, Veronika Bayer, Alexander Biedermann, Matthias
Binder, Claudia Blaukopf, Ulrich Braunschweig, Claudia
Fried, Stefan Fringer, Sebastian Glatt, David Grote, Michaela
Gruber, Jakob Haglmüller, Leonhard Heinz, Astrid Hrdina,
Alexander Kassler, Vedhu Krystufek, Christian Lahsnig,
Beate Lichtenberger, Christian Martschitsch, Georg Mitterer,
Michael Müller, Gregor Obernosterer, Florian Pauler, Paul
Perco, Thomas Perlot, Sonja Prohaska, Markus Reschke,
Miriam Satler, Manfred Schifrer, Yvonne Schindlegger,
Ulrike Seifert, Sabine Stampfl, Stefanie Strobach, Andrea
Tanzer, Thomas Taylor, Siegfried Ussar, Stefan Washietl,
1498
Wolfgang Winkler); Winter 2002: Mononegavirales (Philipp
Adaktylos, Katja Balazs, Florian Baumgart, Stefan Bruckner,
Christoph Buchmann, Valerie Diederichs, Karin Ebner,
Jörg Ettenauer, Wolfgang Fischl, Andrea Förster, Andreas
Gruber, Susanne Haider, Philipp Hainzl, Jennifer Hetzl,
Hans-Peter Kantner, Martina Knapp, Stefan Kraus, Roman
Ferdinand Kreindl, Dominik Muggenhumer, Murat Özcelik,
Jürgen Pfatschbacher, Petra Pokorny, Josef Ruppert, Johann
Schmidt, Simone Schopper, Andreas Verhounig, Michael
Wittinger) Spring 2003: Dicistroviridae (Christine Körner,
Torsten Glomb); Potyviridae (Franziska Heinze, Sonja
Karam, Jörg Lehmann, Markus Rohrschneider); Comoviridae
(Jan Buck, Alexander Groß, Jana Hertel, Manuela
Lindemeyer, Peter Menzel, Alexander Schubert, Stefan
Seemann).
REFERENCES
Biebricher,C.K. and Luce,R. (1993) Sequence analysis of RNA
species synthesized by Qbeta replicase without template.
Biochemistry, 32, 4848–4854.
Büchen-Osmond,C. (2003) The universal virus database ICTVdB.
Comput. Sci. Eng., 5, 16–25.
Cheong,H.-K., Cheong,C., Lee,Y.-S., Seong,B.L. and Choi,B.-S.
(2003) Structure of influenza virus panhandle RNA studied by
NMR spectroscopy and molecular modeling. Nucleic Acids Res.,
27, 1392–1397.
Conserved RNA structures in viral genomes
Faulstich,L.C., Stadler,P.F., Thurner,C. and Witwer,C. (2003)
litsift: Automated text categorization in bibliographic
search. In Scheffer,T. and Leser,U. (eds), Data Mining and
Text Mining for Bioinformatics, (ECML/PKDD 2003), Berlin,
pp. 20–25.
Fekete,M. (2000) Scanning RNA virus genomes for functional
secondary structures. Ph.D. thesis, University of Vienna.
Hofacker,I.L. and Stadler,P.F. (1999) Automatic detection of conserved base pairing patterns in RNA virus genomes. Comp. Chem.,
23, 401–414.
Hofacker,I.L., Fekete,M., Flamm,C., Huynen,M.A., Rauscher,S.,
Stolorz,P.E. and Stadler,P.F. (1998) Automatic detection of conserved RNA structure elements in complete RNA virus genomes.
Nucleic Acids Res., 26, 3825–3836.
Kidd-Ljunggren,K., Zuker,M., Hofacker,I.L. and Kidd,A.H. (2000)
The hepatitis B virus pregenome: prediction of RNA structure
and implications for the emergence of deletions. Intervirology,
43, 154–164.
Leahy,M.B., Dobbyn,H.C. and Brownlee,G.G. (2001). Hairpin loop
structure in the 3 arm of the influenza A virus virion RNA
promoter is required for endonuclease activity. J. Virol., 75,
7042–7049.
Rivas,E. and Eddy,S.R. (2001). Noncoding RNA gene detection
using comparative sequence analysis. BMC Bioinformatics, 2,
1–19.
Schuster,P., Fontana,W., Stadler,P.F. and Hofacker,I.L. (1994) From
sequences to shapes and back: A case study in RNA secondary
structures. Proc. R. Soc. London B, 255, 279–284.
Stocsits,R., Hofacker,I.L. and Stadler,P.F. (1999) Conserved secondary structures in hepatitis B virus RNA. In Computer Science
in Biology. University of Bielefeld, Bielefeld, D, pp. 73–79.
Proceedings of the GCB’99, Hannover, D.
Stocsits,R. (2003) Nucleic acid sequence alignments of partly coding
regions. Ph.D. thesis, University of Vienna.
Thurner,C., Witwer,C., Hofacker,I. and Stadler,P.F. (2004) Conserved RNA secondary structures in Flaviviridae genomes. J. Gen.
Virol., submitted.
Wertz,G., Whelan,S., LeGrone,A. and Ball,L.A. (1994) Extent of
terminal complementarity modulates the balance between transcription and replication of vesicular stomatitis virus RNA. Proc.
Natl Acad. Sci., USA, 91, 8587–8591.
Witwer,C., Rauscher,S., Hofacker,I.L. and Stadler,P.F. (2001) Conserved RNA secondary structures in picornaviridae genomes.
Nucleic Acids Res., 29, 5079–5089.
1499