Detection of polyadenylation signals in human DNA sequences

Gene 231 (1999) 77–86
Detection of polyadenylation signals in human DNA sequences
Jack E. Tabaska*, Michael Q. Zhang
Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
Received 27 October 1998; received in revised form 15 February 1999; accepted 16 February 1999; Received by E.Y. Chen
Abstract
We present polyadq, a program for detection of human polyadenylation signals. To avoid training on possibly flawed data,
the development of polyadq began with a de novo characterization of human mRNA 3∞ processing signals. This information was
used in training two quadratic discriminant functions that polyadq uses to evaluate potential polyA signals. In our tests, polyadq
predicts polyA signals with a correlation coefficient of 0.413 on whole genes and 0.512 in the last two exons of genes, substantially
outperforming other published programs on the same data set. polyadq is also the only program that is able to consistently detect
the ATTAAA variant of the polyA signal. © 1999 Elsevier Science B.V. All rights reserved.
Keywords: Bioinformatics; Downstream element; Quadratic discriminant analysis
1. Introduction
Although improvements in computer gene-finding
programs have made it relatively easy to detect internal
protein-coding exons in genomic sequences, it remains
difficult to find the terminal exons of genes (Claverie,
1997). There are several reasons for this, but perhaps
the most significant is the fact that these programs rely
heavily on measures of codon bias (Fickett and Tung,
1992; Fickett, 1996). Terminal exons, however, consist
largely of the non-coding untranslated regions ( UTRs)
of mRNAs (Zhang, 1998), sequences that are apparently
constrained only insofar as they may contain signals
that regulate the mRNA’s translation or stability (e.g.
Pesole et al., 1994, 1997). One may therefore expect
that accurate terminal exon recognition would depend
primarily upon detection of such signals, rather than
upon bulk sequence characteristics.
Being information-rich and nearly omnipresent
downstream of the coding regions of human genes, the
Abbreviations: Ad2, adenovirus type 2; CC, correlation coefficient;
CDS, coding sequence; CPSF, cleavage and polyadenylation specificity
factor; CStF, cleavage stimulation factor; DE, downstream element;
EST, expressed sequence tag; FN, false negative; FP, false positive;
LDF, linear discriminant function; PAS, polyadenylation signal; QDF,
quadratic discriminant function; RBD, RNA binding domain; SN,
sensitivity; SP, specificity; TN, true negative; TP, true positive; UTR,
untranslated region.
* Corresponding author. Fax: +1-516-367-8461.
E-mail address: [email protected] (J.E. Tabaska)
sequences that direct mRNA cleavage and polyadenylation are a natural choice to serve as the basis for a 3∞
terminal exon prediction method. The core of the 3∞
processing signal ( Fig. 1) consists of two sequence elements that bracket the cleavage/polyadenylation
(polyA) site (for a recent review, see Colgan and Manley,
1997). The familiar hexamer AAUAAA (and the
common variant AUUAAA) comprises the upstream
portion of the signal, known as the polyA signal (PAS ).
The downstream element (DE ) consists of a much less
well-characterized U- or GU-rich sequence. The PAS
and DE are bound, respectively, by cleavage and polyadenylation specificity factor (CPSF ) and cleavage stimulation factor (CStF ). These two factors bind the mRNA
cooperatively, so a strong DE can compensate for a
weak PAS, and vice versa ( Wahle, 1995; Colgan and
Manley, 1997). Cleavage preferentially occurs immediately downstream of a CA dinucleotide, apparently as
directed by CStF ( Wahle, 1995; Colgan and Manley,
1997). The efficiency of 3∞ end processing may be
modulated by various cis- and trans-acting factors
Fig. 1. mRNA 3∞ end processing site and associated signals. Distances
are as described by Colgan and Manley (1997).
0378-1119/99/$ – see front matter © 1999 Elsevier Science B.V. All rights reserved.
PII: S0 3 7 8 -1 1 1 9 ( 9 9 ) 0 0 10 4 - 3
78
J.E. Tabaska, M.Q. Zhang / Gene 231 (1999) 77–86
involving sequences that may occur either upstream or
downstream of the polyA site ( Keller, 1995; Wahle,
1995; Colgan and Manley, 1997). A single gene may
have multiple polyA sites, the choice of which may
affect the gene’s post-transcriptional regulation
( Edwalds-Gilbert and Milcarek, 1995; Takagaki et al.,
1996; Lou et al., 1998).
Since the PAS hexamer may occur quite frequently
in genomic sequences (approximately once every 4096
bases), accurate recognition of mRNA 3∞ ends depends
critically upon reliable identification of DEs. Some effort
has therefore been made to better understand these
elements. The DE is generally found approximately 50
bases 3∞ of the PAS (Chou et al., 1994; Chen et al.,
1995), but may occur much farther downstream if the
secondary structure of the mRNA can bring the two
elements into close spatial proximity (Ahmed et al.,
1991; Brown et al., 1991). It has also been observed
that a single polyA site can have several DEs associated
with it ( Wahle, 1995). Early sequence analysis by
McLauchlan et al. (1985) suggested a consensus of
YGUGUUYY for the DE. Experiments by Chou et al.
(1994) and Chen et al. (1995) suggest that a pentamer
containing at least four U residues is required to exist
between 10 and 30 bases downstream of an mRNA
cleavage site for efficient processing to occur. More
recently, SELEX experiments with the RNA binding
domain (RBD) of the 64 kDa subunit of CStF resulted
in binding sites consisting of a short (two to four
residues) G/U-rich segment upstream of a variablelength run of C residues, followed by another region
that is G/U-rich, but rarely contains GG dinucleotides
( Takagaki and Manley, 1997). A similar experiment
using the entire 64 kDa subunit suggested a bipartite
structure for the DE, consisting of a short segment
similar to the McLauchlan consensus followed by a
longer pyrimidine-rich segment (Beyer et al., 1997).
Several attempts have been made to detect 3∞ processing sites in DNA sequences in silico. Yada et al. (1994)
have used class II quantification theory to generate a
weight matrix representing the region from −80 to +48
with respect to the PAS (where +1 is the 5∞-most base
of the PAS ). Kondrakhin et al. (1994) have constructed
a generalized consensus matrix of positional triplet
counts in a 68 base region surrounding the polyA sites
of 60 mRNAs. The gene prediction program GRAIL II
(Matis et al., 1996) also contains a weight matrix
covering the −6 to +65 region around the PAS. In
contrast to these solely matrix-based approaches,
Salamov and Solovyev (1997) have recently described a
method based on a linear discriminant function (LDF )
of eight variables characterizing the −100 to +100
region around the PAS.
As part of our efforts to improve the detection of the
3∞ terminal exons of genes, we have developed a program,
polyadq, which finds polyA signals in human DNA
sequences using a pair of quadratic discriminants in
three variables. As we will show here, polyadq outperforms the PAS detection methods described previously,
and is the first that can detect significant numbers of
ATTAAA-type signals. Along the way, we have created
a database of known active polyA sites and have further
refined the characteristics of the PAS, the DE, and the
relationship between the two.
2. Materials and methods
2.1. PolyA site database
Expressed sequence tags ( ESTs) with a polyA tail
(which we defined as a run of 30 or more Ts at the
beginning of a 3∞ EST, or 30 or more As at the end of
a 5∞ EST ) were taken from build 24 of the human
UniGene database (NCBI News, August 1996; database
dated 2/20/98). ESTs belonging to clusters labeled with
repeat element warnings were excluded. When a cluster
was represented by more than one EST, those ESTs
were aligned using FASTA (Pearson and Lipman, 1988),
and a minimal set covering all unique polyA sites
determined. The polyA site of each of these ESTs was
defined as the last non-A or first non-T base in the
sequence, as applicable.
Using FASTA, each EST was aligned with all of the
DNA and non-EST mRNA sequences in that EST’s
UniGene cluster. If the alignment showed >90% identity
for more than 50 bases upstream of the EST’s polyA
site, the relative position of the polyA site within the
DNA or mRNA was noted, and that sequence/site pair
added to our database. Likely cases of internal priming
were discarded. If there was more than one good match
to a particular EST, at most one of each sequence type
(DNA or mRNA) was placed in the database, generally
the best scoring match of the type. Occasionally, an
mRNA sequence was polyadenylated at a site slightly
different (usually within ±5 bases) from its matching
EST. In these cases, the polyA site of the mRNA
sequence was used in the database.
During the compilation of these sequences, we
observed a recurring type of chimerism that attached
the 3∞ terminal 22 bases of the mitochondrial ATP6 gene
mRNA plus its polyA tail to specific sites within nuclear
mRNA sequences. This produced many (at least 30)
apparent polyadenylated ESTs with no PAS. Analysis
of the cases where it happened to ESTs from known
genes ruled out the possibility that we were observing
alternative splicing or a new mechanism for nuclear
mRNA 3∞ end processing. Since we can have no knowledge of how often this kind of artifact might occur
involving as yet undiscovered genes, we decided to
discard any EST with no good DNA or mRNA match.
To the DNA database were added sequences from
79
J.E. Tabaska, M.Q. Zhang / Gene 231 (1999) 77–86
GenBank release 105 (Benson et al., 1998; database
dated 2/98) containing a ‘polyA_signal’ or ‘polyA_site’
feature tag carrying an ‘evidence=experimental’ label.
Such sequences were added only if they were tagged
with a reference that could be used to verify the position
of the annotated signal or site. The resulting database
of known polyA sites contains 280 mRNA sequences
and 136 DNA sequences. The DNA sequences contain
144 polyA sites.
To obtain a negative training set for polyadq, a region
encompassing the coding sequence (including introns)
of each gene in the polyA database was scanned for
occurrences of AATAAA and ATTAAA. We purposely
excluded the 3∞ UTRs and intergenic regions of these
sequences from this scan, since these areas may contain
active PASes whose use has not yet been observed. The
negative training set contains 462 pseudosignals.
All sequence data are available from the authors
upon request.
2.2. Weight matrices
PAS and DE weight matrices were generated using
the program gibbs-seq, which implements a templated
exponential perceptron as described by Heumann et al.
(1994). When presented with a set of unaligned sequence
fragments known to bind to a protein, gibbs-seq trains
a neural network to find sites of a given length that can
best distinguish the input fragments from a background
‘genome’. For this work, we used a genome consisting
of equiprobable random bases, which makes the network
training procedure equivalent to finding an alignment
that maximizes information content (Heumann et al.,
1994). The output of gibbs-seq is a list of sites recognized
in the input fragments and the perceptron’s weight
matrix. The latter can be used to search for sites in new
sequences.
Training proceeded as follows. For a given set of
sequence fragments, training runs were made assuming
various binding site sizes. For each size, gibbs-seq was
run 25 times starting from random seed weights. The
25 resulting weight matrices were then collected and
aligned. This alignment was performed by sliding each
matrix past a basis matrix (usually the matrix from the
first run) and looking for the maximum dot product of
the overlapping regions. In doing so, it was found that
the matrices aligned perfectly up to some critical length
L , beyond which the alignments showed a conserved
c
core of L bases flanked by ragged ends. It was reasoned
c
that the L core bases were the most important for
c
recognition of the input fragments, as apparently any
bases in excess of L could be drawn from anywhere
c
outside the core without substantially improving
recognition. The ‘best’ binding site size to use was
therefore fixed at L . We then chose from the 25 matrices
c
of size L the one giving the highest average score on
c
the training fragments.
2.2.1. PolyA signal matrix
gibbs-seq was trained on 50-base fragments immediately upstream of the polyA sites of the 280 mRNA
sequences in our database. Matrix sets were acquired
for site sizes of 6, 8, 10, and 12 bases. The best site size
was judged by alignment of the matrices to be 6 bases.
The best PAS matrix we found is shown in Table 1.
2.2.2. Downstream element matrix
The gibbs-seq DE training set consisted of 106 fragments of 100 bases. These were taken from the regions
immediately downstream of the polyA sites in the DNA
sequences in the polyA database. Sequences with less
than 100 bases downstream of the polyA site or with
ambiguous base symbols in this region were excluded.
For the DE training runs, we set gibbs-seq to search for
two sites per fragment. This was done chiefly because
of the bipartite structure of the DE described by Beyer
et al. (1997), but also because manual inspection of the
downstream flanking sequences of the genes in our
database revealed that many (approximately 50%) of
these contain multiple T-rich segments. Furthermore,
the DEs of several other genes apparently consist simply
of one very long (>20 bases) run of T and G residues
in their downstream regions. In two-site mode, gibbsseq could represent these large DEs as two abutting
short DEs. The assumption of two sites per fragment
therefore effectively increased the size of our training
set by some 60%, thereby giving better (i.e. less degenerate and better conserved across training trials) matrices
than could be obtained by training on a single site
per sequence. gibbs-seq was trained for binding site sizes
of 6, 8, 10, 12, and 14 bases. Through alignment of
these matrices, the best site size was decided to be 10
bases. The DE matrix used in polyadq is shown in
Table 2. Since the gibbs-seq program in two-site mode
searches for multiple repetitions of a motif in a given
sequence, rather than two distinct motifs, this matrix
does not necessarily represent a ‘distal’ or ‘proximal’
DE, but essentially an average of the two.
Table 1
Weight matrix for polyA signals
A
C
G
T
1
2
3
4
5
6
1.806
−0.644
−0.581
−0.581
1.449
−0.756
−0.617
−0.076
−0.569
−0.632
−0.579
1.780
1.836
−0.616
−0.637
−0.583
1.774
−0.550
−0.674
−0.550
1.839
−0.634
−0.634
−0.571
80
J.E. Tabaska, M.Q. Zhang / Gene 231 (1999) 77–86
Table 2
Weight matrix for downstream elements
A
C
G
T
1
2
3
4
5
6
7
8
9
10
−0.426
−0.396
−0.080
0.903
−0.557
−0.323
0.093
0.788
−0.363
0.108
−0.377
0.632
−0.480
1.377
−0.589
−0.307
−0.372
−0.551
−0.662
1.585
−0.666
0.062
−0.038
0.642
−0.559
−0.555
−0.469
1.584
−0.743
−0.437
0.417
0.764
−0.471
−0.646
0.148
0.969
−0.666
−0.510
0.543
0.633
2.3. polyadq
polyadq finds polyA signals in DNA sequences using
two quadratic discriminant functions (QDFs; the ‘dq’
in polyadq stands for ‘double QDF’). The QDFs are
both functions of the three variables described below,
but one is specific for AATAAA-type PASes, and the
other for ATTAAA-type PASes.
When presented with a sequence, polyadq scans the
sequence to find occurrences of AATAAA and
ATTAAA. Once a candidate PAS is found, the program
evaluates the following variables.
Average DE score. The bases from +16 to +100 are
scanned with the DE weight matrix shown in Table 2,
and the mean score calculated.
Weighted average DE hit position. During the DE
matrix scan, positions with a positive score (‘hits’) are
noted. The weighted average hit position is simply
calculated as:
P=
∑ Dd
∑D
where D is the weight matrix score, d is the distance
between the candidate PAS and the hit position, and
the sums are over all hits encountered.
Downstream dimer preference. Preference scores for
each of the 16 dinucleotides were generated by determining the frequency at which that dimer occurs in the +1
to +100 region downstream of true PASes ( f ) and
t
pseudosignals ( f ). The dinucleotide’s preference score
p
is then f /( f +f ). Given these preference scores, polyadq
t t p
scans the +1 to +100 region downstream of a candidate
PAS and averages the preference scores of each dinucleotide encountered.
Once calculated, these values are supplied to the
appropriate QDF, giving a score for the candidate signal.
Note that polyadq does not use any sequence information upstream of a candidate PAS. In testing, it was
found that QDFs of upstream variables or combinations
of upstream and downstream variables generally performed worse than did those of downstream variables
only.
QDF training was performed using the S-PLUS software package (MathSoft, Seattle, WA, USA, 1997) by
means of the qda function described in Venables and
Ripley (1994). The AATAAA QDF was trained on a
set of 81 true and 258 false AATAAA signals, while the
ATTAAA QDF was trained on 17 true and 204 false
ATTAAA signals. The parameters describing these
QDFs were exported from S-PLUS for use in polyadq.
Readers are referred to Venables and Ripley (1994) and
Zhang (1997) for a more complete description of quadratic discriminant analysis.
polyadq is available through the World Wide Web.
Sequences in FASTA format may be submitted to polyadq through http://www.cshl.org/mzhanglab.
2.4. PAS prediction test
Along with polyadq, we tested the following
programs.
Simple. Intended solely as a baseline for PAS prediction performance, Simple is a minimal PAS finder that
just calls every instance of AATAAA in a sequence
a PAS.
Quant2. The Quant2 software was obtained through
http://www.icot.or.jp/AITEC/IFS/IFS-abst/098.html.
Because neither the training set used in Yada et al.
(1994) nor the sample data supplied with the program
contained ATTAAA-type PASes, we trained our local
copy of Quant2 on the polyadq training set. The training
fragments consisted of the −80 to +48 region of each
signal or pseudosignal, as described by Yada et al.
(1994). Since our pseudosignal set contains no nonAATAAA/ATTAAA PASes, Quant2 could not be
trained to recognize such signals. As such, we only
tested the program on 76 of the 78 true signals in the
test set (see Section 2.4.1).
POLYAH. Sequences were submitted to POLYAH
through the Baylor College of Medicine Gene Finder
web site (http://dot.imgen.bcm.tmc.edu:9331/genefinder/gf.html ) or the e-mail server (service@theory.
bchs.uh.edu).
GRAIL. Sequences were submitted to the GRAIL
web server (http://compbio.ornl.gov/Grail-bin/Empty
GrailForm). Predicted PASes on the minus strands of
the sequences were ignored.
2.4.1. Test procedure
For several reasons — the programs GRAIL and
POLYAH do not report negative predictions, so we
could not get an actual true negative ( TN ) count for
J.E. Tabaska, M.Q. Zhang / Gene 231 (1999) 77–86
them; Quant2 is intended ( judging by its usage in Yada
et al., 1994 and our own experience with the program)
only to evaluate particular sequence fragments, not to
scan whole sequences; the various programs have
different ‘dead zones’ near the 5∞ and 3∞ ends of a
sequence where they cannot make any calls; and to
ensure that all of the programs were working with the
same data — we decided to test the PAS prediction
programs only on the calls they made at predetermined
negative sites in our test genes and on calls made within
a reasonable distance of a true polyA site.
For our test sequence set, human DNA sequences
dated 1995 or later (selected because this would avoid
the training sets of POLYAH and GRAIL) and containing both ‘polyA_signal’ and ‘polyA_site’ feature tags
were drawn from GenBank release 106 (4/98). These
sequences were inspected to ensure that the annotated
sites seemed reasonable with respect to the literature on
mRNA 3∞ end processing. For example, tagged PASes
on a gene’s non-coding strand were considered irrelevant
and therefore disregarded. Sequences already in our
polyA site database (i.e. the polyadq training set) were
also excluded.
Two sets of negative sites were constructed. For the
first, the ‘full gene’ test, the entire coding sequence
(CDS) and introns of the test sequences were scanned
for AATAAA and ATTAAA hexamers, as described in
Section 2.1 for polyadq’s negative training set. The 3∞
UTRs were excluded because they may contain functional but unobserved PASes. For the second negative
set, the ‘last two exons’ test, we scanned only the one
or two (depending on availability) 3∞-most introns and
exons of each gene for AATAAA and ATTAAA, again
excluding the 3∞ UTRs. We used the last two exons and
introns because this made the number of negative sites
in this test approximately equal to the number of
positive sites.
The test set contains 74 sequences encoding 75 genes
with 78 PASes (66 AATAAA, 10 ATTAAA, two others).
There are 307 pseudosignals (186 AATAAA, 121
ATTAAA) in the full gene test set, and 72 (49 AATAAA,
23 ATTAAA) in the last two exons test set. Because
PAS prediction programs may utilize up to 100 bases
on either side of putative PAS, signals and pseudosignals
not meeting this criterion are not included in these
figures.
Called PASes within 50 bases upstream of a positive
site in the test set were scored as true positives ( TP). In
several cases, there are multiple possible PASes associated with a polyA site. If a program called more than
one of these it was only counted as a single TP. One
sequence in the test set (GenBank accession U31767)
apparently has a PAS more than 50 bases from its polyA
site, and an allowance was made for this. Lack of a
called PAS within the 50 base window was counted as
a false negative ( FN ). False positives (FP) and TNs
81
were counted, respectively, when a program called or
did not call a positive at a test pseudosite. All other
calls — generally those within a gene’s 3∞ UTR or
intergenic region — were classified as Nulls.
Note that as a consequence of our scoring system, if
a program called a positive at a nonAATAAA/ATTAAA site within range of a true polyA
site, it was counted as a TP, but similar calls made
elsewhere were not scored as FPs. Since polyadq cannot
make non-AATAAA/ATTAAA calls, but GRAIL and
POLYAH theoretically can, the latter programs benefited from this scoring scheme. GRAIL and POLYAH
were given a further potential advantage in that our test
set contains two non-AATAAA/ATTAAA positives.
3. Results
3.1. Characterization of polyA signals
Ironically, our initial attempts at creating a PAS
prediction program were hindered by the fact that it is
so widely known that the hexamer AATAAA is the
signal sequence for polyadenylation. Early on, it became
fairly obvious from compilations of GenBank sequences
containing polyA_signal feature tags that the submitters
often attached such tags to any AATAAA (and sometimes anything remotely similar) downstream of a gene’s
CDS. To base our program on these data could only
serve to reinforce preconceived notions about the nature
of the PAS. Further complications arose from reports
that run counter to the conventional wisdom: that the
consensus PAS may actually be represented by the
octamer CAATAAAY ( Yada et al., 1994) and that the
PAS may not be present in the majority of mRNAs
(Claverie, 1997). It was therefore decided that it would
be best to build our polyA database from scratch, by
first searching for polyadenylated ESTs and then using
these to identify the polyA sites in their corresponding
gene or mRNA sequences. We could then use the
sequence motif-finding program gibbs-seq to ‘rediscover’
the signals in the vicinity of the polyA site.
When trained on the 50 bases upstream of the polyA
sites of each sequence in our database, gibbs-seq quite
readily finds the canonical PAS hexamer. Fig. 2 is a
sequence logo of the PASes found by gibbs-seq. There
is a slight sequence bias toward A and C in the position
immediately 5∞ to the signal hexamer — the actual base
counts are 109 A, 84 C, 32 G, 55 T — but clearly if this
position were considered to be part of the PAS, it would
make no significant contribution to the signal’s total
information content (0.12 bits vs. 10.45 bits for the
hexamer). There is no remarkable sequence bias in the
3∞ flanking position. These findings generally refute the
extended PAS proposed by Yada et al. (1994). We also
note that when we trained Yada’s program, Quant2, on
82
J.E. Tabaska, M.Q. Zhang / Gene 231 (1999) 77–86
Fig. 2. Sequence logo (Schneider and Stephens, 1990) of the polyA
signals (positions 11–16) found by gibbs-seq, along with flanking
sequence.
our positive and negative training sets, the positions
immediately adjacent to the PAS in the weight matrix
generated by the program (not shown) show no significant base preferences.
Broken down by PAS sequence, the 280 mRNAs in
our database contain 214 AATAAA (76.4%), 37
ATTAAA (13.2%), and 22 (7.9%) other one-base variants of AATAAA (e.g. AGTAAA). The remaining seven
sequences (2.5%) contain two-base variants of
AATAAA, but since there is high probability (~82%)
of finding a two-base variant of any given hexamer
within a 50-base window of random sequence, we consider these latter sequences as having no discernible
PAS. The proportions among our 144 DNA sequences
(which partially overlap the mRNAs) are similar: 104
AATAAA (72.2%), 22 ATTAAA (15.3%), 14 other onebase variants (9.7%), and four (2.8%) with no discernible
PAS. We note that our fraction of AATAAA (~75%)
is somewhat lower than the oft-quoted 90% figure
( Wahle, 1995; Colgan and Manley, 1997), but on the
whole our figures tend to support prevailing opinion
more than the observations of Claverie (1997).
Examination of our DE matrix ( Table 2) shows that
it prefers U residues at every position except the fourth,
and will accept C at positions 3, 4, and 6, and G at 2,
8, 9, and 10. It is notable that this appears to be a
somewhat compressed version of the SELEX products
generated by the RBD of the 64 kDa subunit of CStF,
reported by Takagaki and Manley (1997).
When the matrix scores of the PASes in the DNA
sequences of our database are plotted against the maximum DE score for each ( Fig. 3), an inverse relationship
between the strengths of the PASes and their associated
DEs is observed. The median DE scores for the different
classes of PAS are 6.40 for AATAAA (PAS matrix
score=10.5); 6.70 for ATTAAA (PAS score=8.95);
6.88 for other one-base variants of AATAAA (PAS
scores between 8.00 and 8.50); and 8.32 for two-base
variants (PAS scores less than 7). With the caveat that
one should be cautious about extrapolating physical
meaning from these computer-generated matrix scores,
this seems to reflect the known cooperativity of PAS
and DE activity ( Wahle, 1995; Colgan and Manley,
1997).
Fig. 3. Relationship between maximum DE matrix score and polyA
signal score for the DNA sequences in our polyA database. AATAAA
signals score 10.5 on the vertical axis; ATTAAA=8.95; other one-base
variants, ~8.25; two-base variants, less than 7.
The spatial relationship between the polyA sites in
our database and their corresponding PASes and maximally scoring DEs is shown in Fig. 4. The PASes are
tightly distributed about 10 to 25 bases upstream of the
polyA sites, in agreement with prior observations
(Colgan and Manley, 1997). Most of the DEs occur
between 5 and 30 bases downstream of the polyA site,
again in accord with other data (Colgan and Manley,
1997; Chou et al., 1994; Chen et al., 1995). The DE
distribution also contains a secondary peak around 40
bases. While interesting, this is probably not surprising
in light of the fact that the DE matrix was trained
assuming two DEs per sequence. It is notable, however,
that although the DE matrix was generated using
sequence fragments consisting of the first 100 bases after
the polyA sites in our training set, in no case does the
best DE occur more than 50 bases from the polyA site.
3.2. polyadq
The design philosophy behind polyadq can best be
understood by examination of Fig. 3 and the test results
described below. Our first attempts to develop a QDF
for PAS detection included a PAS weight matrix score.
It was found that this one variable consistently dominated such QDFs to the extent that they called a positive
at nearly every AATAAA in a sequence, and a negative
almost everywhere else — including most ATTAAAtype signals, which comprise ~15% of the PASes in our
database. The performance statistics for these QDFs
were in fact practically identical to those of the LDFbased POLYAH (see Tables 3 and 4). We drew two
conclusions from this: first, that the QDF should not
directly use any kind of score that concerns the PAS
hexamer itself; and second, that the greatest gain in PAS
prediction performance was to be made by accurately
detecting ATTAAA signals while simultaneously reducing the false positive rate of AATAAA signal prediction.
83
J.E. Tabaska, M.Q. Zhang / Gene 231 (1999) 77–86
Fig. 4. Distribution of identified PAS (open bars) and DE (closed bars) positions relative to the polyA site. Negative and positive coordinates are,
respectively, 5∞ and 3∞ of the polyA site at zero. Positions are for the first base of the PAS or DE. Only the highest scoring DE for each polyA site
is counted.
Table 3
PolyA signal prediction test results for full genes (A) and the last two exons of genes (B). See Section 2.4.1 for test conditions
TP
TN
FP
FN
SNa (%)
SPb (%)
CCc
(A) Full Genes
Simple
Quant2
GRAIL
POLYAH
polyadq
66
35
66
67
48
121
213
121
125
256
186
86
186
182
51
12
41
12
11
30
84.6
46.1
84.6
85.9
61.5
26.2
28.9
26.2
26.9
48.5
0.203
0.149
0.203
0.224
0.413
(B) Last Two Exons
Simple
Quant2
GRAIL
POLYAH
polyadq
66
35
66
67
50
23
49
23
25
62
49
20
49
47
10
12
41
12
11
28
84.6
46.1
84.6
85.9
64.1
57.4
63.6
57.4
58.8
83.3
0.196
0.176
0.196
0.241
0.512
a SN=sensitivity=TP/(TP+FN ).
b SP=specificity=TP/( TP+FP).
c
CC=correlation coefficient=
S
TP · TN−FP · FN
( TP+FP) (TP+FN )( TN+FP)( TN+FN )
We therefore set about training a QDF that classifies
AATAAA and ATTAAA hexamers using only the flanking sequences. But in Fig. 3, the two types of signals
obviously have different maximum DE score distributions. Likewise, we found that the values of other
variables (not shown) were distributed differently
depending on PAS hexamer sequence. To allow for these
differences without use of any type of PAS score in the
discriminant, we decided to implement our program
with separate QDFs for evaluating AATAAA and
ATTAAA-type signals.
Hence, polyadq. The results of our tests on polyadq
and the other PAS prediction programs are summarized
in Table 3 and in Fig. 5. Table 3 shows prediction performance on full genes (part A) and on the 3∞ terminal
region ( last two exons and introns) of the test genes
(part B). Since we envision polyadq being used in
conjunction with exon-finding programs such as MZEF
(Zhang, 1997), which can help to restrict the area being
searched, we consider the latter case to be more relevant,
.
and so have optimized polyadq for searching in the
terminal regions of genes. As is clear from Table 3,
polyadq has a considerably higher correlation coefficient
(CC ) than the other programs in this test. In the
sensitivity–specificity curve in Fig. 5, it can be seen that
when polyadq is adjusted (by means of the program’s
two cut-off scores for calling AATAAA or ATTAAA
positives) to the same sensitivity level as POLYAH and
GRAIL, polyadq still has a higher specificity level than
these other programs.
Table 4 reveals a rather surprising result. Here we
have classified by PAS sequence the total positive calls
made by polyadq, Simple, GRAIL, and POLYAH.
Clearly, the GRAIL PAS finder does exactly what our
Simple program does, calling a positive wherever it
encounters an AATAAA and negatives everywhere else.
POLYAH does only slightly better, calling negatives at
six of the 297 AATAAA hexamers scanned and positives
at eight non-AATAAA sites. polyadq, on the other
hand, finds a significant number of ATTAAA signals
84
J.E. Tabaska, M.Q. Zhang / Gene 231 (1999) 77–86
Table 4
Breakdown of positive calls by polyA signal sequence for Simple,
GRAIL, POLYAH, and polyadq. Figures under ‘All Calls’ are for all
positives called by each program, not just those counted in the prediction test. Entries for AATAAA and ATTAAA counts this column are
listed as (number called )/(number scanned). The number scanned for
each program excludes AATAAA and ATTAAA hexamers in that
program’s dead zone. Under ‘Last 2 Exons’, positive calls are further
broken down by their classification in the last two exons test of
Table 3B
All Calls
Simple
GRAIL
POLYAH
polyadq
AATAAA
ATTAAA
Other
AATAAA
ATTAAA
Other
AATAAA
ATTAAA
Other
AATAAA
ATTAAA
Other
312/312
0/195
0
304/304
0/192
0
291/297
4/188
4
99/301
41/190
0
Last 2 Exons
TP
FP
Null
66
49
197
66
49
189
65
1
1
42
8
46
0
1
9
1
180
3
2
48
31
genome. This is probably a somewhat unfair comparison, since polyadq considered only 28 sites while
Kondrakhin’s matrix was used to evaluate every 66 base
segment in the ~36 kbp genome. The latter’s performance would undoubtedly be better if the matrix were
used to score only those segments containing a properly
placed AATAAA or ATTAAA. Nevertheless, these are
the results reported by the matrix’s constructors and
therefore presumably reflect its intended use.
To further compare polyadq with the other programs,
we found that when tested on the Ad2 direct strand,
Quant2 reports two TP with four FP; GRAIL and the
Simple method, eight TP with six FP; POLYAH, eight
TP and five FP (this is for the version of POLYAH
available through the BCM e-mail server. Salamov and
Solovyev (1997) report that POLYAH can be set to
generate eight TP with four FP). When adjusted to the
same sensitivity level as POLYAH and GRAIL — and
even when set to find all nine Ad2 direct strand PASes —
polyadq reports only two FPs.
4. Discussion
Fig. 5. Sensitivity–specificity curve for polyadq (solid line) in the last
two exons test compared with performance of other programs
(Quant2=&; POLYAH=%; Simple/GRAIL=#). To obtain the
curve for polyadq, the program’s cut-off parameters were initially
adjusted to call a single positive in the test set. The parameters were
then incremented so as to add one positive call at a time to its predictions. For each run, SN and SP statistics were calculated, and sorted
by SN into bins 1% wide. The highest SP in each bin is plotted here.
and better discriminates between true AATAAA signals
and non-PAS AATAAA hexamers in the test set.
To compare the performance of polyadq with that of
Kondrakhin’s generalized weight matrix, we used our
program to detect the PASes in the direct strand of the
adenovirus type 2 genome (Ad2; GenBank accession
J01917). With its cut-off values set to give maximum
performance in the last two exons test, polyadq found
seven of the nine PASes in the sequence with only one
FP. At the same level of sensitivity, Kondrakhin et al.
(1994) report approximately 825 FPs in the adenovirus
We have shown here that our program, polyadq, is
better able to detect polyA signals in sequences than
other previously published programs. This is accomplished using quadratic discriminants in only three variables covering only 100 bases of sequence, while the
best competing program uses seven variables and twice
as much sequence data. This means polyadq can be used
where the amount of sequence available is limited due
to noise or other considerations, and can be expected
to be more robust since the small number of feature
variables has likely forced the program to have ‘learned’
general features, rather than idiosyncrasies, of its training data. Furthermore, polyadq is the first program to
perform substantially better than the simplistic approach
of calling every AATAAA a polyA signal.
The disparity between the performance numbers of
POLYAH in our tests and those obtained by Salamov
and Solovyev (1997) probably deserves some explanation. Our test set contains several large genomic and
‘complete CDS’ sequences that were placed in GenBank
after the publication of POLYAH. The majority of false
positives called by POLYAH were in the introns of these
sequences. The sequences dated before about
mid-1997 — those that would have comprised the data
sets used by Salamov and Solovyev — were primarily
single (3∞ terminal ) exon entries. Our last two exons
performance test was therefore probably closer to that
performed by POLYAH’s authors. And indeed, in this
test POLYAH does exhibit the same level of sensitivity
and higher specificity than Salamov and Solovyev
reported. Yet our CC for POLYAH (0.241), is much
lower than what they found (0.62). This is because of
J.E. Tabaska, M.Q. Zhang / Gene 231 (1999) 77–86
the nature of our test. For the reasons stated in
Section 2.4.1, we tested prediction at predetermined sites
within our sequences, rather than on all possible sites.
Based on the description by Salamov and Solovyev
(1997) and the propensities illustrated in Table 4, we
presume that POLYAH considers and rejects a large
number of non-AATAAA sites as it scans a sequence,
resulting in a high TN count and a correspondingly
higher CC.
So let us assume for the sake of argument that in our
last two exons test, POLYAH evaluated and called
negatives at an additional 250 pseudosites. POLYAH
also made one non-AATAAA FP prediction in the last
two exons test (see Table 4), so we would also include
this in the program’s score. This would raise POLYAH’s
CC to 0.62, as reported by the program’s authors. If we
added these 251 pseudosites to the test set, polyadq
would call negatives at all of the new sites, since they
would be neither AATAAA nor ATTAAA — all pseudosites of these types were in the original test set.
polyadq’s CC on this new test set would then be 0.68,
once again higher than POLYAH. For reference, we
would also find that the CC of the Simple program
would be 0.61 in this hypothetical test.
We note that the comprehensive gene-finding program
GENSCAN (Burge and Karlin, 1997) has PAS detection
capability. We did not include GENSCAN in this analysis, however, because the PAS finder does not operate
in a standalone fashion. GENSCAN’s PAS finder is
integrated into the program’s gene model such that
detection of an upstream CDS is required for PAS
detection. In fact, in our tests of the program (not
shown), the primary reason why GENSCAN missed
PASes is because it could not find, or it misidentified,
the 3∞ end of the CDS. It also bears mentioning that
GENSCAN’s model does not allow multiple PASes in
a gene to be identified, and that GENSCAN’s PAS
model covers only the signal hexamer, disregarding the
influence of the DE. We have found that the latter gives
GENSCAN a marked tendency to place the PAS at the
first AATAAA downstream of a gene’s CDS.
Some readers may ask about the relative importance
of the three feature variables used in polyadq.
Unfortunately, this cannot be answered rigorously. One
of the advantages of using a quadratic, rather than
linear, discriminant is that correlations between variables
can be taken into account. The drawback to this is that
the individual contributions of the variables to the QDF
score cannot be separated. Subjectively, though, the
authors observed that the most important factor in
obtaining a good PAS discriminant was the DE matrix
score/DE position score pair employed. We tried several
DE sequence (average DE, best DE, top two DEs, etc.)
and DE position (weighted average hit position, best
DE position, top two positions, etc.) scoring strategies,
and it appeared, for instance, that the average DE score
85
variable must be paired with some type of average DE
position variable to get good predictive performance.
The downstream dimer preference score seems to be of
secondary importance, but works well in combination
with any DE score/DE position pair.
Accurate identification of polyA signals is important
because it gives us a handle on a region of the gene that
has been largely ignored by computational analysis, the
3∞ UTR. Being the only part of an mRNA not scanned
by the ribosome, the 3∞ UTR is quite likely to contain
post-transcriptional regulatory and cellular localization
signals (Pesole et al., 1997), especially those that involve
large secondary structures. The in silico characterization
of gene expression and control will therefore require
reliable determination of the boundaries of the 3∞ UTR.
Additionally, the prediction of terminal exons of genes
will benefit from improvements in the detection of polyA
signals. We are presently working on the application of
the research described here to these areas.
Acknowledgement
This work was supported by a grant from the Merck
Genome Research Institute.
References
Ahmed, Y.F., Gilmartin, G.M., Hanly, S.M., Nevins, J.R., Greene,
W.C., 1991. The HTLV-I rex response element mediates a novel
form of mRNA polyadenylation. Cell 64, 727–737.
Benson, D.A., Boguski, M.S., Lipman, D.J., Ostell, J., Ouellette,
B.F.F., 1998. GenBank. Nucleic Acids Res. 26, 1–7.
Beyer, K., Dandekar, T., Keller, W., 1997. RNA ligands selected by
cleavage stimulation factor contain distinct sequence motifs that
function as downstream elements in 3∞-end processing of premRNA. J. Biol. Chem. 272, 26769–26779.
Brown, P.H., Tiley, L.S., Cullen, B.R., 1991. Effect of RNA secondary
structure on polyadenylation site selection. Genes Devel. 5,
1277–1284.
Burge, C., Karlin, S., 1997. Prediction of complete gene structures in
human genomic DNA. J. Mol. Biol. 268, 78–94.
Chen, F., MacDonald, C.C., Wilusz, J., 1995. Cleavage site determinants in the mammalian polyadenylation signal. Nucleic Acids Res.
23, 2614–2620.
Chou, Z.-F., Chen, F., Wilusz, J., 1994. Sequence and position requirements for uridylate-rich downstream elements of polyadenylation
signals. Nucleic Acids Res. 22, 2525–2531.
Claverie, J.-M., 1997. Computational methods for the identification of
genes in vertebrate genomic sequences. Human Mol. Genet. 6,
1735–1744.
Colgan, D.F., Manley, J.L., 1997. Mechanism and regulation of
mRNA polyadenylation. Genes Devel. 11, 2755–2766.
Edwalds-Gilbert, G., Milcarek, C., 1995. Regulation of poly(A) site
use during mouse B-cell development involves a change in the binding of a general polyadenylation factor in a B-cell stage-specific
manner. Mol. Cell. Biol. 15, 6420–6429.
Fickett, J.W., Tung, C.-S., 1992. Assessment of protein coding measures. Nucleic Acids Res. 20, 6441–6450.
86
J.E. Tabaska, M.Q. Zhang / Gene 231 (1999) 77–86
Fickett, J.W., 1996. Finding genes by computer: the state of the art.
Trends Genet. 12, 316–320.
Heumann, J.M., Lapedes, A.S., Stormo, G.D., 1994. Neural networks
for determining protein specificity and multiple alignment of binding sites, Proc. 2nd Int. Conf. on Intelligent Systems for Molecular
Biology. AAAI Press, Menlo Park, CA, pp. 188–194.
Keller, W., 1995. No end yet to messenger RNA 3∞ processing!. Cell
81, 829–832.
Kondrakhin, Yu.V., Shamin, V.V., Kolchanov, N.A., 1994. Construction of a generalized consensus matrix for recognition of vertebrate
pre-mRNA 3∞-terminal processing sites. CABIOS 10, 597–603.
Lou, H., Neugerbauer, K.M., Gagel, R.F., Berget, S.M., 1998. Regulation of alternative polyadenylation by U1 snRNPs and SRp20.
Mol. Cell. Biol. 18, 4977–4985.
Matis, S., Xu, Y., Shah, M., Guan, X., Einstein, J.R., Mural, R.,
Uberbacher, E., 1996. Detection of RNA polyadenylation sites in
human DNA sequence. Comput. Chem. 20, 135–140.
McLauchlan, J., Gaffney, D., Whitton, J.L., Clements, J.B., 1985. The
consensus sequence YGTGTTYY located downstream from the
AATAAA signal is required for efficient formation of mRNA 3∞
termini. Nucleic Acids Res. 13, 1347–1368.
Pearson, W.R., Lipman, D.J., 1988. Improved tools for biological
sequence comparison. Proc. Natl. Acad. Sci. USA 85, 2444–2448.
Pesole, G., Fiormarino, G., Saccone, C., 1994. Sequence analysis and
compositional properties of untranslated regions of human
mRNAs. Gene 140, 219–225.
Pesole, G., Liuni, S., Grillo, G., Saccone, C., 1997. Structural and
compositional features of untranslated regions of eukaryotic
mRNAs. Gene 205, 95–102.
Salamov, A.A., Solovyev, V.V., 1997. Recognition of 3∞-processing
sites of human mRNA precursors. CABIOS 13, 23–28.
Schneider, T.D., Stephens, R.M., 1990. Sequence logos: a new way to
display consensus sequences. Nucleic Acids Res. 18, 6097–6100.
Takagaki, Y., Seipelt, R.L., Peterson, M.L., Manley, J.L., 1996. The
polyadenylation factor CstF-64 regulates alternative processing of
IgM heavy chain pre-mRNA during B cell differentiation. Cell
87, 941–952.
Takagaki, Y., Manley, J.L., 1997. RNA recognition by the human
polyadenylation factor CstF. Mol. Cell. Biol. 17, 3907–3914.
Venables, W.N., Ripley, B.D., 1994. Modern Applied Statistics with
S-Plus. Springer, New York.
Wahle, E., 1995. 3∞-End cleavage and polyadenylation of mRNA precursors. Biochim. Biophys. Acta 1261, 183–194.
Yada, T., Ishikawa, M., Totki, Y., Okubo, K., 1994. Statistical Analysis of Human DNA Sequences in the Vicinity of POLY(A)
SIGNAL. ICOT Technical Report TR-876.
Zhang, M.Q., 1997. Identification of protein coding regions in the
human genome by quadratic discriminant analysis. Proc. Natl.
Acad. Sci. USA 94, 565–568.
Zhang, M.Q., 1998. Statistical features of human exons and their
flanking regions. Human Mol. Genet. 7, 919–932.