Amino acid translation program for full

Physiol Genomics
5: 81–87, 2001.
Amino acid translation program for full-length cDNA
sequences with frameshift errors
YOSHIFUMI FUKUNISHI AND YOSHIHIDE HAYASHIZAKI
Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center, RIKEN
Yokahama Institute, Yokohama City, Kanagawa 230-0045; Genome Science Laboratory, RIKEN
Tsukuba Institute, and Core Research of Evolutional Science and Technology, Japan Science
and Technology Corporation, Tsukuba City, Ibaraki 305-0074, Japan
Received 5 October 2000; accepted in final form 19 January 2001
phred score; Kozak consensus; codon usage; initiation codon;
base-call error
A LARGE-SCALE SEQUENCING EFFORT
has been started at the
Institute of Physical and Chemical Research (RIKEN)
that includes compiling the complete coding sequence
(CDS) of all the mouse full-length cDNAs. Base-call
error in the CDS, especially frameshift error, is an
relevant problem when trying to deduce amino acid
sequences from uncorrected cDNA or genomic sequences. Complete determination (finishing) of these
sequences is time-consuming, and information about
the predicted position of frameshift error is useful
when designing primers for the finishing step. In addition, having a first view of the possible amino acid
sequence facilitates study and classification of genes
before the finishing steps.
Several methods have been developed to identify and
correct sequencing-derived frameshift errors. Iseli and
colleagues (9) have developed ESTScan, a program for
Article published online before print. See web site for date of
publication (http://physiolgenomics.physiology.org).
Address for reprint requests and other correspondence: Y. Fukunishi, 1-1, Higashi/Tsukuba, Ibaraki 305-0074, Japan (E-mail: rgscerg
@gsc.riken.go.jp).
detecting and reconstructing the coding sequence in
expressed sequence tags (ESTs) that may carry frameshift errors. For predicting the CDS, this method
adopted a novel hidden Markov model, which is robust
against frameshift error. However, most computer software programs designed to correct frameshifts in DNA
sequences have been developed for analyzing genomic
DNA rather than full-length cDNA sequences (11, 13,
15). Efforts to determine all the full-length cDNAs of
an organism, such as the RIKEN Full-Length cDNA
Encyclopedia Project, require high-quality correction of
unfinished cDNA sequences to maximize the accuracy
of the putative amino acid translation predictions.
The conventional approach to correct sequence errors tries to identify frameshift errors through exon
predictions of the six frames of the experimental sequence. Exon prediction can be achieved through homology search against known genes or genomic sequences of closely related organisms. Alternative exon
prediction methods (e.g., the Markov model) are more
elegant and incorporate amino acid frequency, preferred codon usage, and other statistical information.
Regardless of the method used, the final result is that
two, shorter open reading frames (ORFs) are merged
by changing frames, thereby creating a single, longer
ORF. However, the conventional method is associated
with two main disadvantages.
The first problem with the conventional method of
exon prediction is that this method may predict two or
three exons, reflecting different frames, for the same
region. In addition, the regulatory mechanism of translation may require a low preferred codon usage in the
CDS to regulate the amino acid translation for some
sequences. In this situation, the difficulty is that conventional methods will predict the exons that comprise
the corrected CDS.
The second problem is that traditional software programs for exon prediction are designed to correct
frameshifts in light of information from text-base (sequence of bases) analysis of DNA sequences. However,
these programs fail to account for the quality of the
data (e.g., the clarity of the gel and length of run) that
gave rise to the inaccurate sequence. Both the textbase and sequence quality information are very important components of the correction process.
1094-8341/01 $5.00 Copyright © 2001 the American Physiological Society
81
Downloaded from http://physiolgenomics.physiology.org/ by 10.220.33.3 on June 18, 2017
Fukunishi, Yoshifumi, and Yoshihide Hayashizaki.
Amino acid translation program for full-length cDNA sequences with frameshift errors. Physiol Genomics 5: 81–87,
2001.—Here we present an amino acid translation program
designed to suggest the position of experimental frameshift
errors and predict amino acid sequences for full-length cDNA
sequences having phred scores. Our program generates artificial insertions into artificial deletions from low-accuracy
positions of the original sequence, thereby generating many
candidate sequences. The validity of the most probable sequence (the likelihood that it represents the actual protein) is
evaluated by using a score (Va) that is calculated in light of
the Kozak consensus, preferred codon usage, and position of
the initiation codon. To evaluate the software, we have used
a database in which, out of 612 cDNA sequences, 524 (86%)
carried 773 frameshift errors in the coding sequence. Our
software detected and corrected 48% of the total frameshift
errors in 62% of the total cDNA sequences with frameshift
errors. The false positive rate of frameshift correction was
9%, and 91% of the suggested frameshifts were true.
82
AMINO ACID TRANSLATION PROGRAM FOR FULL-LENGTH cDNA
METHOD
Removing a frameshift that arose from a sequencing error
requires inserting or deleting a variable number of bases.
Designed to correct frameshifts in this way, our computer
program is composed of three principal steps. The first step
generates all possible sequences created after one or two
bases are inserted into or deleted from the site of the presumed frameshift. The second step calculates the Va score,
which represents the likelihood that the modified sequence is
the correct sequence. In the third step, the Va score is used to
choose the most probable sequence among the various candidates. The Va score reflects the Kozak consensus (10),
preferred codon usage (1, 4, 8), and the position of the initiation codon. The Va score is calculated for all candidatecomplete CDSs, which are the sequences between any ATG
and any stop codon. The candidate CDS with the lowest Va
score is chosen as the corrected CDS.
Let us suppose the existence of a set of an infinite number
of theoretically allowed CDSs that start with ATG and end
with a stop codon. These sequences are sorted in light of
preferred codon usage, Kozak consensus, and the position of
the initiation ATG. Both the sequence of a codon that is used
frequently and the sequence of a good Kozak consensus have
a low index, t. The most probable CDS-like sequence has an
index number of 0, and the less probable CDS-like sequence
receives an index of 1. If the candidate sequence has the
index P, then the Va score of the candidate sequence is
defined as Va ⫽ ln P. Thus the Va score represents the log of
the probability that one could obtain by chance a sequence
more CDS-like than the candidate sequence in the sorted
sequences. Note that the Va score is not the log of the
probability that the sequence carries CDS, since our method
is based on the rank order of sequences. The Va includes a
frameshift penalty that depends on the phred score at the
artificial frameshift position if the artificial insertion(s)
and/or deletion(s) is added to the candidate sequence. The Va
of the candidate CDS is defined as
Va ⫽ ln Pcodon-pref ⫹ ln PKozak ⫹ ln Patg ⫹ ln Pframeshift (1)
w2
ln Pcodon-pref ⫽
兺 ln P
i ⫽ w1
ln Pframeshift ⫽ ⫺Dpenalty ⫻
共i兲
(2)
codon-pref
冋兺
j
册
ln Perr共j兲
(3)
where Pcodon-pref, PKozak, and Patg are the probabilities of
generating a random sequence that shows a more preferable
codon usage, Kozak consensus, and ATG position, respectively, than the candidate sequence that is generated from
the unadjusted cDNA. In Eq. 2, i is the base position of the
cDNA sequence, w1 is the position of one of the ATGs, and w2
is the position of one of the stop codons in the sequence when
we focus on a single ORF. Note that the value of i changes
every three bases. The higher the preferred codon usage, the
better the sequence. All events (preferred codon usage, Kozak
consensus, and ATG position) are assumed to be independent
of each other. In Eq. 3, j is the base position at which the
artificial insertion/deletion occurs. Perr(j) is the probability of
a base-call error at frameshift position j.
To reduce the computation time, the number of artificial
insertions/deletions is restricted to no more than the number
of experimental frameshift errors. The experimental frameshift error is expected to occur around bases for which basecalling was not reliable. The reliability of the base-call is the
so-called phred score (Q), which is calculated from the dispersion of the distance and height of the signal peaks of each
base and the distance from the nearest nonassigned base
(“N”). To restrict the number of artificial insertions/deletions,
a penalty value is added to Va for each artificial insertion/
deletion, and the penalty value (Pframeshift) depends on the
phred score of the position at which the artificial insertion/
deletion occurs.
The penalty for frameshifts (Pframeshift) is not a probability
but rather an empirical parameter. The probability of a
frameshift can be calculated from Perr, which is a parameter
of an experimental process. In contrast, the other probabilities (Pcodon-pref, PKozak, and Patg) are parameters of the artificial process (random sequence generation); note that
Pcodon-pref, PKozak, and Patg are not the probabilities that the
sequence could be the true CDS, because the actual sequence
is not mathematically random. Therefore, the penalty for
frameshift (Pframeshift) cannot be treated as same as the
parameters Pcodon-pref, PKozak, and Patg. However, it is natural to introduce the following relation: the lower the probability of base-call error (Perr), the greater the penalty for
frameshift (Pframeshift). Equation 3 is an equation that satisfies this relation.
Perr in Eq. 3 is calculated from Q by the definition (5, 6)
P err共 j兲 ⫽ 10 ⫺ Q共j兲/10
(4)
In Eq. 3, Dpenalty is a parameter that must be optimized for
maximal prediction accuracy.
To estimate the number of sequences generated by the
trial-and-error process, let the maximum number of insertions and/or deletions be two and the number of base pairs in
the input sequence be N. The program generates one sequence with no artificial frameshift, 2 ⫻ N sequences with a
single artificial insertion or deletion, and 2 ⫻ N ⫻ (N ⫺ 1)
sequences with two frameshifts by combining insertions and
deletions.
http://physiolgenomics.physiology.org
Downloaded from http://physiolgenomics.physiology.org/ by 10.220.33.3 on June 18, 2017
To overcome these problems, we developed a new
trial-and-error method of exon prediction that is designed especially for full-length cDNA sequence data.
Our method artificially inserts bases into and/or deletes bases from the experimental sequence and predicts a candidate CDS for every trial insertion and
deletion. If the experimental frameshift error is removed by the trial insertion/deletion, then a CDS-like
ORF will emerge. The most convincing sequence is
then chosen from the various candidates.
This method circumvents the principal problems of
conventional exon prediction schemes in the following
ways. First, it incorporates additional information that
is specific to full-length cDNA, such as the length of the
ORF, Kozak consensus (10), and length of the 5⬘ untranslated region (UTR). Second, our method incorporates information about the quality of the uncorrected
sequence data. A numerical score, the phred score,
represents the sequence quality. This information is
useful for suggesting the position of a frameshift error
(5, 6). Furthermore, a quantitative index for evaluating
the reliability of amino acid translation would be helpful when using the resulting hypothetical proteins. To
this end, our method estimates the validity of the
predicted CDS (the likelihood that the predicted CDS
is the corrected CDS) in light of the phred score of the
base at which the putative frameshift occurs.
83
AMINO ACID TRANSLATION PROGRAM FOR FULL-LENGTH cDNA
Table 1. Summary of PKozak
Number
1
2
3
5
4
6
7
9
8
10
11
13
14
16
12
15
1
2
4
3
1
2
3
4
Consensus
AnnatgG
GnnatgG
AnnatgA
AnnatgC
AnnatgT
GnnatgA
GnnatgC
GnnatgT
CnnatgG
TnnatgG
CnnatgA
TnnatgA
CnnatgC
TnnatgC
CnnatgT
TnnatgT
NnnatgG
NnnatgA
NnnatgC
NnnatgT
AnnatgN
GnnatgN
CnnatgN
TnnatgN
Population 1, %
Population 2, %
PKozak
27.4
18.8
10.4
8.6
7.3
7.2
5.9
5.0
2.4
2.1
0.9
0.8
0.8
0.7
0.6
0.6
50.9
19.4
16.0
13.6
53.8
37.0
4.7
4.1
21.2
19.8
14.2
7.1
8.3
6.5
5.5
4.5
4.8
1.8
1.8
1.2
1.0
0.5
1.3
0.7
47.6
23.7
14.1
14.8
50.8
36.3
8.9
4.2
1/16
2/16
3/16
4/16
5/16
6/16
7/16
8/16
9/16
10/16
11/16
12/16
13/16
14/16
15/16
16/16
1/4
2/4
3/4
4/4
1/4
2/4
3/4
4/4
PKozak, probability of generating a random sequence that shows a
Kozak sequence. “Consensus” refers to the sequence around the
initiation codon, and the first and last bases are in question. “Population 1” are estimated according to statistics of Suzuki et al. (14).
“Population 2” refers the representation among 2,815 complete coding sequences (CDS).
Table 2. Summary of Patg
Order
Population, %
Patg
1st
2nd
3rd
4th
5th
6th
⬎7th
60.3
18.1
9.0
4.0
2.6
1.4
7.5
1/7
2/7
3/7
4/7
5/7
6/7
7/7
For “Population,” values are percent of 2,815 complete CDS sequences. “Order” indicates the order of the initiation codon (ATG) in
question from the 5⬘ end of the full-length cDNA. Patg, probability of
generating a random sequence that shows an ATG position.
randomly generating a sequence with a more preferable
consensus (PKozak) than that of each string in Table 1. The
consensus is denoted as XnnatgY (X, Y ⫽ A, G, C, T), where
X and Y are the bases whose identities are in question.
GnnatgG is the second-most preferred consensus among the
16 options; therefore, PKozak of GnnatgG equals 2/16, which
means that a consensus that is more preferable than
GnnatgG occurs at the probability of 2/16. The numbers in
the second column of Table 1 are generated by using the
statistics of Suzuki et al. (14), and the numbers in the third
column are calculated by using our test data set. Because
PKozak as defined by the statistics of Suzuki et al. (14) differs
from that defined by our data set, we examined both parameters in the test calculation.
Table 2 illustrates how our software “chooses” which ATG
to use as the initiation codon and the probability of a randomly generated sequence that has a more preferable initiator ATG (Patg) than that of the candidate sequence; “n”
means that the initiation codon is the nth ATG from the 5⬘
end in all frames. In 60.3% of transcripts, the initiation codon
is the first ATG from the 5⬘ end of the unadjusted sequence,
and in 18.1% of transcripts, the initiation codon is the second
ATG from the 5⬘ end. The upstream (5⬘) ATG is preferred to
that downstream, and the average length of the 5⬘-UTR is
151.6 bp. We are unsure whether the specific ATG or the
length of the 5⬘-UTR is more important in determining the
site of initiation. We chose to adopt the specific ATG, and the
fraction of Patg is set arbitrarily at 7 for simplicity. The factor
7 does not affect the prediction result, because this factor is a
constant in Eq. 1.
Pcodon-pref is summarized in Table 3. The codons are sorted
by the preferred codon usage. Information about preferred
codon usage is available by using GCG (8) and at the TransTerm web site (http://uther.otago.ac.nz/Transterm.html)
(1, 4). Note that we order in light of the preferred codon usage
instead of the frequency of codon usage and that P⫻64 (Table
3) is 64 times Pcodon-pref. If the codon includes an unclear base
(“N”), then Pcodon-pref is set at 32/64 for simplicity. PKozak is
not independent of Pcodon-pref; therefore, the probability in
Eq. 1 is double counted. In addition, the Kozak consensus
includes the first base of the next codon (e.g., the last G of
AnnATGG belongs to the following codon). When we applied
our method to proteins of more than 30 amino acids, the
double count negligibly affected Va and therefore is ignored
in Eq. 1.
Let us consider an example of the calculation of the Va
score. Let us suppose there exists a cDNA sequence 5⬘TnnATGCTATGAnnGnnATGGAGTGA-3⬘ that includes the
two candidate CDSs Seq1 (TnnATGCTA) and Seq2 (GnnATGGAG).
The region of the initiation codon offers 16 variations of the
Kozak sequence in the form of XnnATGY, where X and Y are
http://physiolgenomics.physiology.org
Downloaded from http://physiolgenomics.physiology.org/ by 10.220.33.3 on June 18, 2017
The program thus generates a total of 1 ⫹ 2N ⫹ 2N(N ⫺ 1)
sequences, and Va is calculated for all frames of these sequences. In actual use, the number of insertions and deletions per test sequence is limited to two or four. For three
and/or four insertions/deletions, the trial-and-error procedure is divided into two steps. The trial using one and two
artificial frameshifts is performed first, and the most probable sequence is chosen. Then the additional artificial frameshifts are applied to the sequence generated in the previous
step, and the most probable sequence is chosen from this
second set of trial sequences.
Pcodon-pref, PKozak, and Patg are derived from the statistics
of the actual cDNA database. To determine Dpenalty, we
performed the CDS prediction for a data set of known genes
while changing Dpenalty to find the optimal value for this
variable.
We prepared a test data set from 2,815 complete CDS
mouse cDNAs in the National Institute for Biotechnology
Information (NCBI) nonredundant database. The minimum
length of the 5⬘-UTR is set at 25 bp. All statistical data in the
following sections are derived from this data set. The base
composition of the CDS is: A, 25.9%; G, 26.4%; C, 25.9%; and
T, 21.8%. In the 5⬘-UTR, the base composition is: A, 20.8%; G,
29.2%; C, 29.8%; and T, 20.3%. The populations of A, G, C,
and T are almost equivalent in the CDS, whereas the 5⬘-UTR
is GC rich. We assume that the populations of A, G, C, and T
in the CDS region are equivalent.
The data of the complete CDS mouse cDNAs of the NCBI
nonredundant database are not necessarily full length. Since
Suzuki and colleagues (14) have succeeded in making and
analyzing a 5⬘-rich, full-length-rich cDNA library, we
adopted their statistics.
Table 1 shows the various possible Kozak sequences that
can occur around the initiation codon and the probability of a
84
AMINO ACID TRANSLATION PROGRAM FOR FULL-LENGTH cDNA
Table 3. Summary of Pcodon-pref
Codon
P ⫻ 64
codon
P ⫻ 64
codon
P ⫻ 64
codon
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
CTG
CAG
ACC
CTC
ATG
CAC
GCT
GAT
GCA
TCT
TAT
TGT
TTG
CAT
CTA
ATA
2
6
10
14
18
22
26
30
34
38
42
46
50
54
58
62
GAG
GGC
CCC
AGC
TAC
GAA
GCG
CCT
CCG
GGT
ACT
TCG
AGA
CGT
TCA
TTA
3
7
11
15
19
23
27
31
35
39
43
47
51
55
59
63
AAG
GTG
TTC
AAC
TCC
GTC
TGG
GGA
AAA
ACG
AGT
ATT
CAA
GTT
GTA
TAA
4
8
12
16
20
24
28
32
36
40
44
48
52
56
60
64
GCC
GAC
ATC
CGC
GGG
TGC
CGG
AGG
CCA
ACA
TTT
AAT
CTT
CGA
TGA
TAG
Pcodon-pref, probability of generating a random sequence that shows
a more preferable codon usage. P ⫻ 64 indicates Pcodon-pref times 64.
each one of four base pairs (A, G, C, and T). TnnATGC is the
rarest Kozak consensus among the 16 options (Table 1); all
other sequences are more likely. Thus PKozak for Seq1 is
equal to 16/16. GnnATGG is the second-best consensus
around the initiation codon among the 16 sequences, so
PKozak for Seq2 is equal to 2/16. Because Seq1 is upstream of
Seq2, Patg of Seq1 is 1/7 and that of Seq2 is 2/7. CTA in Seq1
is a rare codon whose frequency of occurrence is the 57th
among those of the 64 codons. However, GAG of Seq2 is the
second-most frequent codon among the 64. Therefore,
Pcodon-pref of CTA is 57/64, and that of GAG is 2/64. To summarize, PKozak, Patg, and Pcodon-pref for each frame are PKozak ⫽
16/16, Patg ⫽ 1/7, and Pcodon-pref ⫽ 57/64 for Seq1; and PKozak ⫽
2/16, Patg ⫽ 2/7, and Pcodon-pref ⫽ 2/64 for Seq2. The Va scores
of Seq1 and Seq2 are Va(Seq1) ⫽ ln(16/16) ⫹ ln(1/7) ⫹
ln(57/64) ⫽ ⫺0.89; and Va(Seq2) ⫽ ln(2/16) ⫹ ln(2/7) ⫹
ln(2/64) ⫽ ⫺2.95. Thus Seq2 is more CDS-like than Seq1.
Preparation of test sample data. To determine the parameter Dpenalty in Eq. 3 and to evaluate the prediction accuracy
of our method, we needed two sequence sets. One consisted of
test sequences with phred scores and frameshift error, and
the other contained reference sequences without base-call
error. As mentioned in the previous section, Dpenalty is a
parameter that must be optimized to give the maximal prediction accuracy.
For the set of reference sequences, we downloaded 612
known mouse complete CDS sequences from the NCBI nonredundant database. The average length of the cDNAs in the
test data set is 2,370 bp. We generated the test sequences
from these reference sequences by using a Monte Carlo
simulation technique. Because phred scores were not available with the sequence data in the NCBI database, we first
prepared a phred score for each base of the sequences by
using a random number generator. We set the base-call
accuracy of the test data at ⬃99% (Q ⬃ 20), in other words,
average Perr ⫽ 1%.
Next, we generated artificial frameshift errors for each
reference sequence, and the probability of insertion and deletion depends on the phred score. Note that not all the
base-call errors are frameshift errors. We checked the incidence of insertion, deletion, and substitution error in our
experimental electropherograms. We observed 51 deletion
errors, 38 insertion errors, and 250 substitution errors, for a
total of 339 base-call errors; thus deletion errors accounted
for 15.0% of all base-call errors, insertion errors accounted
RESULTS
The prediction accuracy of our program is evaluated
by counting the number of predicted amino acid sequences for which percent identity to the correct amino
acid sequence is higher than the threshold. We set the
identity threshold at 100%, 98%, 95%, 90%, and 85%.
After testing various values for Dpenalty to identify the
optimal value, this parameter was set at 80. Note that
we tried both PKozak parameters (as defined by the
statistics of Suzuki et al., Ref. 14, and by our statistics)
in the following prediction calculation and found that
changing how the PKozak parameter is defined does not
affect the prediction result at all. The program does not
use the population in Table 1 but the order of the
probability (Table 1). The effect of this choice is that
the difference due to the order is negligible between
two sets of parameters.
Table 6 shows the prediction accuracy of several
methods. “Longest frame” refers to the sequence of the
longest single ORF of the unadjusted cDNA data. For
Table 4. Distribution of the number of insertions
and deletions in the CDS of the 612 test sequences
generated by Monte Carlo method
Number of Insertions
and Deletions
0
1
2
3
4
⬎5
http://physiolgenomics.physiology.org
Distribution,
% of 612 test sequences
14.4
61.1
18.0
2.3
2.8
1.4
Downloaded from http://physiolgenomics.physiology.org/ by 10.220.33.3 on June 18, 2017
P ⫻ 64
for 11.2%, and substitution errors accounted for 73.8%. A
stop codon can appear in the ORF by substitution error, and
a stop codon can be lost by substitution error [TAA (stop) 3
AAA (lysine) by T 3 A substitution]. The probability of
stop-codon occurring by substitution error is 4%, and the
probability of loss of the stop codon by substitution error is
1–2%, so we ignored the substitution error. Artificial mutation was only the deletion of a base and the insertion of “N.”
We adopted the following conditions to generate the mutation: 1/6 Perr (16.7%) is the probability of deletion, 1/6 Perr
(16.7%) is the probability of insertion, and 4/6 Perr is the
probability of substitution error, which is ignored. For simplicity, the factor 1/6 is adopted instead of the actual rate of
deletion (15.0%) and insertion (11.2%).
Table 4 shows the population of our test sequences with
frameshift error in CDS. Of the 612 cDNA sequences in our
test set, 524 (86%) cDNAs carry 773 frameshift errors in the
CDS; 14% of CDSs are free of frameshift error. These statistics do not include the insertions and deletions in the 5⬘- and
3⬘-UTRs.
In addition, we prepared a set comprising the actual experimental data of 17 sequences (Table 5) from known genes
with complete CDSs to show the actual prediction accuracy of
the program. These genes were sequenced by using the primer-walking method and ABI377 sequencers (Perkin-Elmer
Biosystems, Foster City, CA; http://www.pebio.com/). All
of the sequences in this data set include frameshift error. The
phred score was calculated from the gel-image file of the
ABI377 sequencers.
85
AMINO ACID TRANSLATION PROGRAM FOR FULL-LENGTH cDNA
Table 5. Prediction accuracy of our
DECODER program
Longest Frame
DECODER
Size 1,
aa
Size 2,
aa
Identity,
%
Size 2,
aa
Identity,
%
AF049879
AF230074
U34691
AB027963
AF067146
AF123533
U90435
L48514
AB011002
AB016784
AB038243
D87990
X63003
X60452
U50406
U10435
U90123
479
484
367
266
423
784
428
354
435
729
178
322
478
504
205
371
154
435
393
262
179
222
378
206
158
164
257
94
156
103
217
205
391
179
90.8
81.2
68.9
63.2
49.9
48.0
44.4
40.6
35.9
34.8
30.3
25.7
20.9
12.1
100.0
94.4
81.6
479
546
474
379
423
784
428
336
473
559
146
318
282
475
318
625
243
99.8
85.7
74.5
65.7
93.2
96.6
96.7
87.7
90.5
70.2
56.4
73.8
54.2
91.7
64.5
56.8
60.1
The number of artificial frameshifts is limited to 2. “Longest
frame” refers to the sequence of the longest single open reading
frame (ORF) of the unadjusted cDNA data. For the first 14 sequences, DECODER predicts a protein that is more highly identical
to the actual product than the longest frame, whereas the last 3
sequences are not. Size 1, length of actual protein; Size 2, length of
predicted amino acid sequence; Identity, percent identity to actual
protein product; aa, amino acid.
comparison, we used Genscan, which was developed for
analyzing genomic DNA rather than cDNA (2, 3). This
program is one of the best exon-prediction programs,
and it can be applied to cDNA analysis, offering prediction of the part of the CDS, which is better than the
simple choice of the longest frame. Genscan is designed
to work with promoter prediction statistics to identify
components of the promoter sequence (e.g., the TATA
box and the cap site), after which it performs exon
prediction. Because cDNA transcripts do not contain
the promoter site, Genscan is limited in its applications
with cDNA.
We named our program DECODER. As shown in the
right column of Table 6, 68–70% of the proteins predicted by DECODER are at least 85% identical to the
correct proteins. When the required prediction accuracy is higher than 85%, DECODER has the highest
prediction accuracy among the three methods we evaluated. Under the boundary condition of 85% identity,
the Genscan result is almost equivalent to that of
DECODER. However, when a higher accuracy is required, DECODER achieves a much better prediction
score than Genscan. Of the amino acid sequences predicted by DECODER, 35–36% are at least 98% identical to the sequence of the actual protein product, and
43% show ⱖ95% identity (Table 4). In contrast, only
15% of the predicted proteins generated by using Genscan show more than 98% identity to the actual proteins, and only 69% show more than 85% identity.
Genscan and DECODER can be used together. Genscan is well-suited to finding exons and portions of the
CDS that lack ATG and/or a stop codon. However,
Table 6. Prediction accuracy of various exonprediction methods
Sequence Identity
Longest frame
Genscan ver. 1.0
DECODER 1
DECODER 2
100%
⬎98%
⬎95%
⬎90%
⬎85%
12
10
10
10
12
15
36
35
15
31
43
43
19
54
59
59
25
69
68
70
Values are numbers of predicted proteins with sequence identity to
the actual protein product. “Sequence Identity” indicates the amino
acid sequence identity to actual protein product, 100%, ⬎98%, etc.
“Longest frame” refers to the sequence of the longest single ORF of
the unadjusted cDNA data. DECODER 1, the number of artificial
frameshifts is limited to 2; DECODER 2, the number of artificial
frameshifts is limited to 4.
http://physiolgenomics.physiology.org
Downloaded from http://physiolgenomics.physiology.org/ by 10.220.33.3 on June 18, 2017
Accession
No.
frameshift error drastically reduces the prediction accuracy of Genscan. If DECODER is used to remove the
frameshift error in the test sequence, then Genscan
can be applied to the modified test sequence. We found
that 73% of the amino acid sequences predicted by the
Genscan program offer 85% accuracy.
Of the 612 cDNA sequences that compose the reference sequence set, 524 (86%) carry 773 frameshift
errors in the CDS. DECODER corrected 48% of the
total frameshift errors in 62% of the 524 cDNA sequences that had frameshift error. The false positive
rate of frameshift correction is 9%, and 91% of the
suggested frameshifts were true.
Table 5 shows the prediction accuracy of DECODER
for the experimental sequence data. The identity of the
proteins predicted by DECODER exceeds the threshold
(85%) for 8 of the 17 sequences tested. For the first 14
sequences, DECODER predicts a protein that is more
closely identical to the actual product than the longest
frame. In the three remaining cases, the amino acid
sequence predicted by DECODER is longer than the
true amino acid sequence. Hence, DECODER is likely
to overestimate the length of the CDS.
One drawback of our procedure is that it is very
time-consuming. In the presented test of 612 sequences, the processing time of DECODER was 45 s
per sequence on a DEC Alpha personal workstation
(600 MHz). If the size of the sequence is M bp, then the
computational time of the conventional method of exon
prediction is proportional to M. In contrast, the computational time of our method is proportional to M2
when two artificial frameshifts are used in the trialand-error procedure. Thus the conventional method
can be used for genomic as well as cDNA sequences.
Because the time-consuming nature of our method is
proportional to the length of the sequence evaluated,
our method is not well-suited to analysis of genomic
sequences. However, cDNAs are typically shorter than
genomic sequences. Therefore, our method realistically
can be applied to studying cDNA sequences, for which
purpose we created the DECODER program.
We compared also the results of DECODER and
ESTScan, which is a program that can detect coding
regions in DNA sequences, even if they carry frame-
86
AMINO ACID TRANSLATION PROGRAM FOR FULL-LENGTH cDNA
CONCLUSIONS
The amino acid translation program DECODER is
useful for evaluating full-length cDNA sequences with
experimental frameshift errors early before the completion of a given project. This program can suggest the
position of a frameshift, predict an amino acid sequence,
and evaluate the likelihood that the deduced amino acid
sequence is that of the actual protein product.
To remove frameshift error, the program tries to
make insertions and/or deletions in the experimental
sequence, and the candidate CDS is predicted for every
trial. If the trial insertion/deletion removes the frameshift error, then a CDS-like ORF emerges. The software chooses the sequence most likely to represent
that of the actual protein product. This likelihood is
reflected as the Va score, which is calculated on the
basis of the Kozak consensus, the preferred codon usage, and the position of the initiation codon. The more
likely the predicted sequence is the actual sequence,
the lower the Va score.
Our data demonstrate that DECODER shows high
accuracy for predicting CDS and frameshift error of
full-length cDNA. With respect to the predicted amino
acid sequence, Genscan yields almost equivalent results to those of DECODER at low thresholds of identity (ⱕ85%). However, when a higher accuracy of the
predicted amino acid sequence is required, DECODER
achieves a much better prediction score than does
Genscan.
Using DECODER and Genscan in concert capitalizes
on the relative strengths of these methods. DECODER
can remove the frameshift error of unadjusted cDNA
sequences; the corrected cDNA sequences are then
submitted to the amino acid prediction algorithm of
Genscan, which does not require as much computation
time as DECODER does. When Genscan predicts proteins for DECODER-corrected cDNAs, 73% of the predicted proteins show more than 85% identity to the
actual proteins. DECODER can detect and remove 48%
of the total frameshift errors. Thus 62% of cDNA with
frameshift error were detected, and at least one of the
frameshift errors was corrected. The false positive rate
of frameshift correction is 9%, and 91% of the suggested frameshifts were true.
Farabaugh et al. (7, 12) suggested that some RNA
sequences program the ribosome to alter the reading
frame efficiently to allow for the expression of alternative translational products. Sites that cause the ribosome to shift frames, termed programmed frameshift
sites, occur in organisms from bacteria to higher eukaryotes. Medigue et al. (11) reported that their frameshift detection program detected these natural frameshifts. Distinguishing between natural and artificial
frameshifts is difficult, and neither DECODER nor any
other exon-prediction method overcomes this shortcoming.
ESTScan overestimates the frameshift errors, and
DECODER underestimates the frameshift errors, and
therefore DECODER shows lower false positive than
ESTScan, and, vice versa, ESTScan shows lower false
negative frameshift detection than DECODER. Note
that there are parameters for modifying the selectivity
and sensitivity in both ESTScan and DECODER, and
the result depends on these parameters. The significant difference between these two programs is that the
purpose of ESTScan is detection of CDS in DNA sequences and that of DECODER is prediction of amino
acid sequence and frameshift errors in full-length
cDNA. If it is unknown whether the DNA sequence is
full-length cDNA, then ESTScan can be applied. In this
case, DECODER cannot be applied, since DECODER is
not an exon prediction program. If the sequence is
known to be a full-length cDNA, then DECODER can
show better amino acid prediction results than Genscan and ESTScan.
DECODER can be applied to the forthcoming uncorrected full-length cDNA sequences provided by the
RIKEN Encyclopedia Project and the Mammalian Gene
Collection of the National Institutes of Health. In the
transcriptome analysis, the highest priority is given to
the comprehensive collection of unadjusted cDNAs,
especially the collection of de novo transcriptional sequences. The CDS and the initiation codon in the cDNA
sequence can be predicted by comparing the unadjusted full-length cDNA sequence with the genomic
DNA sequence. DECODER can be modified by using
this information. The current trial-and-error algorithm
of DECODER is sufficiently time-consuming that using
it to analyze genomic sequences is impractical. To
address this disadvantage, an alternative choice of
algorithm for the future is the hidden Markov model,
which can be robust against the frameshift error.
http://physiolgenomics.physiology.org
Downloaded from http://physiolgenomics.physiology.org/ by 10.220.33.3 on June 18, 2017
shifts. Also, ESTScan can detect and correct frameshift
errors. Since we used the ESTScan on a web site of the
European Molecular Biology network for this comparison (http://www.ch.embnet.org/software/ESTScan.
html), a smaller number of test sequences was desirable rather than the 612 sequences of the previous test
data set. The new test subdata set was composed of
randomly selected 50 sequences out of the 612 sequences of the test data set. The average length of the
cDNAs in the test subdata set was 2,169 bp. Of the 50
cDNA sequences that composed the reference sequence
set, 47 (86%) carry 56 frameshift errors in the CDS.
The size and error rate of the test subdata set reflects
that of the original data set of 612 sequences.
ESTScan could not detect CDS for 3 sequences (6%)
of 50 sequence data. DECODER predicted a CDS for all
of the sequences, since this program is not an exon
prediction program but it shows the most probable
CDS in any sequence. ESTScan and DECODER corrected 86% and 54% of the total frameshift errors, and
the false positive rates of frameshift corrections were
44% and 7%, respectively. ESTScan and DECODER
predict the amino acid sequences of 80% and 88%
identical to the correct amino acid sequences on average, respectively. ESTScan predicted only 4% of the
start codon correctly (the author of ESTScan mentions
this problem at the web home page), whereas DECODER predicted 80% of the start codon correctly.
AMINO ACID TRANSLATION PROGRAM FOR FULL-LENGTH cDNA
We thank Dr. C. Iseli for suggestions. We thank Shiro Fukuda and
Hiroshi Minami for support, and we thank the members of the
RIKEN Genome Science Center for the data preparation.
This study has been supported by Special Coordination Funds and
a Research Grant (to Y. Hayashizaki) for the RIKEN Genome Exploration Research Project, Core Research for Evolutional Science
and Technology (CREST), and Research and Development for Applying Advanced Computational Science and Technology (ACT-JST) of
Japan Science and Technology Corporation (JST) from the Science
Technology Agency of the Japanese Government. This work also was
supported by a Grant-in-Aid for Scientific Research on Priority Areas
and the Human Genome Program from the Ministry of Education,
Science, and Culture of Japan and by a Grant-in-Aid for a SecondTerm Comprehensive 10-Year Strategy for Cancer Control from the
Ministry of Health and Welfare of Japan (to Y. Hayashizaki).
1. Brown CM, Dalphin ME, Stockwell PA, and Tate WP. The
translational termination signal database. Nucleic Acids Res 21:
3119–3123, 1993.
2. Burge C and Karlin S. Prediction of complete gene structures
in human genomic DNA. J Mol Biol 268: 78–94, 1997.
3. Burge CB and Karlin S. Finding the genes in genomic DNA.
Curr Opin Struct Biol 8: 346–354, 1998.
4. Dalphin ME, Brown CM, Stockwell PA, and Tate WP. The
translational signal database, TransTerm, is now a relational
database. Nucleic Acids Res 26: 335–337, 1998.
5. Ewing B, Hillier L, Wendl MC, and Green P. Base-calling of
automated sequencer traces using phred. I. Accuracy assessment. Genome Res 8: 175–185, 1998.
6. Ewing B and Green P. Base-calling of automated sequencer
traces using phred. II. Error probabilities. Genome Res 8: 186–
194, 1998.
7. Farabaugh PJ. Programmed translational frameshifting. Annu
Rev Genet 30: 507–528, 1996.
8. Gribskov M, Devereux J, and Burgess RR. The preferred
codon usage plot: graphic analysis of protein coding sequences
and prediction of gene expression. Nucleic Acids Res 12: 539–
549, 1984.
9. Iseli C, Jongeneel CV, and Bucher P. ESTScan: a program
for detecting, evaluating, and reconstructing potential coding
regions in EST sequences. ISMB 138–148, 1999.
10. Kozak M. An analysis of 5⬘-noncoding sequences from 699
vertebrate messenger RNAs. Nucleic Acids Res 15: 8125–8148,
1987.
11. Medigue C, Rose M, Viari A, and Danchin A. Detecting and
analyzing DNA sequencing errors: toward a higher quality of the
Bacillus subtilis genome sequence. Genome Res 9: 1116–1127, 1999.
12. Pande S, Yimaladithan A, Zhao H, and Farabaugh PJ.
Pulling the ribosome out of frame by ⫹1 at a programmed
frameshift site by cognate binding of aminoacyl-tRNA. Mol Cell
Biol 15: 298–304, 1995.
13. Richterich P. Estimation of errors in “raw” DNA sequences: a
validation study. Genome Res 8: 251–259, 1998.
14. Suzuki Y, Ishihara D, Sasaki M, Nakagawa H, Hata H,
Tsunoda T, Watanabe M, Komatsu T, Ota T, Isogai T,
Suyama A, and Sugano S. Statistical analysis of the 5⬘ untranslated region of human mRNA using “oligo-capped” cDNA
libraries. Genomics 64: 286–297, 2000.
15. White O, Dunning T, Sutton G, Adams M, Ventor JC, and
Fields C. A quality control algorithm for DNA sequencing
projects. Nucleic Acids Res 2: 3829–3838, 1993.
http://physiolgenomics.physiology.org
Downloaded from http://physiolgenomics.physiology.org/ by 10.220.33.3 on June 18, 2017
REFERENCES
87