The Triplet Code From First Principles

Journal of Biomolecular Structure &
Dynamics, ISSN 0739-1102
Volume 22, Issue Number 1, (2004)
©Adenine Press (2004)
The Triplet Code From First Principles
Edward N. Trifonov
http://www.jbsdonline.com
Genome Diversity Center
Abstract
Temporal order (“chronology”) of appearance of amino acids and their respective codons on
evolutionary scene is reconstructed. A consensus chronology of amino acids is built on the
basis of 60 different criteria each offering certain temporal order. After several steps of filtering the chronology vectors are averaged resulting in the consensus order: G, A, D, V, P,
S, E, (L, T), R, (I, Q, N), H, K, C, F, Y, M, W. It reveals two important features: the amino
acids synthesized in imitation experiments of S. Miller appeared first, while the amino acids
associated with codon capture events came last. The reconstruction of codon chronology is
based on the above consensus temporal order of amino acids, supplemented by the stability
and complementarity rules first suggested by M. Eigen and P. Schuster, and on the earlier
established processivity rule. At no point in the reconstruction the consensus amino-acid
chronology was in conflict with these three rules. The derived genealogy of all 64 codons
suggested several important predictions that are confirmed. The reconstruction of the origin
and evolutionary history of the triplet code becomes, thus, a powerful research tool for
molecular evolution studies, especially in its early stages.
Institute of Evolution
University of Haifa
Haifa 31905, Israel
Key words: Amino acids; Code origin; Codon capture; Codon history; Codons;
Complementarity; Earliest genes; Earliest proteins; Early evolution; First codons; Molecular
chronology; Origin of Life; Processivity.
Introduction
Research on the origin and evolution of the early life is an area of elaborate speculations of which almost none became a theory proper, that is complemented by
testable and subsequently confirmed predictions. An important exception is prediction by S. Miller that some biologically relevant substances could have been abiotically produced in the atmosphere of the early Earth. The prediction was spectacularly confirmed in the imitation experiments (1, 2) [see also (3)] that yielded as
many as ten different natural amino acids, thus, laying the foundation of the theory of the origin of life and early evolution. Eigen and Schuster noted (4) that alanine and glycine, the highest yield amino acids in the experiments of Miller, are
encoded today by (G+C)-rich complementary triplets GCC and GGC. They put
forward the hypothesis on the importance of complementarity and thermostability
in the evolution of the triplet code.
Many hypothetical scenarios of this evolution are discussed in literature. It is commonly believed that 20 amino acids of present-day proteins did not appear simultaneously on the evolutionary scene. Rather, some amino acids entered at earlier
stages while others are comparatively late. Many factors may have influenced the
temporal order of engagement of the amino acids and codons in the triplet code. In
this work total of 60 various criteria for the amino-acid chronology are summarized
in a consensus chronology that reveals two fundamental features: the amino acids
of Miller are first in the list, while the amino acids associated with codon capture,
especially those for which codon assignments are not yet fully stabilized, appear
Phone: +972 4 828 8096
Fax: +972 4 824 6554
Email: [email protected]
1
2
Trifonov
last in the chronology. The order of appearance of the amino acids suggests a
unique order of engagement of the respective codons. A complete reconstruction
of the genealogy of all 64 codons is described. It follows strictly the rules of complementarity and thermostability, of Eigen and Schuster. A new rule of processivity (5) is confirmed whereby new codons are not introduced de novo but rather
appear progressively as simple point change derivatives and complementary copies
of those engaged earlier, starting with a single ancestral codon pair GCC·GGC.
The reconstructed stages of evolution of the triplet code are, thus, uniquely defined
by five most basic principles: I. Abiotic start, II. Complementarity, III.
Thermostability, IV. Processivity, and V. Codon capture at the end.
Results
Amino-acid Chronology
An early version of the consensus amino-acid chronology has been derived on the
basis of 40 different criteria (5). In this section the updated amino acid chronology is presented, built on 60 criteria. A more advanced data filtering is applied. In
the derivation of the chronology no specific theory is given preference. Instead all
diverse knowledge and thoughts accumulated during decades are thoroughly taken
into account and expressed in form of the consensus temporal order.
Every theory or viewpoint that suggests certain temporal order for the amino acids
is taken as a contribution to the consensus order, in form of a specific detailed or
approximate ranking of the amino acids, the earliest first. 60 rankings for 60 different criteria are used for the derivation of the consensus. Many of the criteria are
apparently independent while some are related. To avoid any bias such related criteria were combined in groups. Each group was represented by a single (average)
ranking for the group. The criteria then were divided in two non-mixible classes:
single-factor and multi-factor criteria. The single-factor criteria are all independent.
When a positive correlation between respective chronological orders is observed it
reflects only the fact that they express some common trend, presumably related to
the amino-acid chronology we are aiming to. On the contrary, the multi-factor criteria, that is various synthetic hypotheses on the origin and evolution of the triplet
code, are interdependent. Each one of them is built on several basic ideas, frequently common for different combined criteria, but in different combinations and
with different priorities, depending on the expertise and intuition of their author(s).
Essentially, the multi-factor criteria are individually biased opinions on the same
matter. The two classes of the criteria are used in this work for separate reconstructions of the amino-acid chronology. One is result of equal weight averaging of independent temporal orders (single-factor criteria), while another one corresponds to
the consensus of opinions, also taken with equal weights (multi-factor criteria). As
it is described below, the resulting two chronologies turn out to be very similar.
A. Single-factor Criteria: After combining related criteria in groups the ensemble of total 17 independent rankings for the single-factor criteria is derived (Table
I). Brief descriptions of the criteria and respective suggested chronological orders
are presented. Ranks of the amino acids in parentheses are uncertain and thus taken
the same for the amino acids indicated. For example, in case of criterion N43 (see
Table I) the ranking vector is A = 3, C = 14, D = 14, E = 14, F = 14, G = 3, H = 14,
I = 9, K = 14, L = 6, M = 19.5, N = 14, P = 3, Q = 14, R = 7, S = 8, T = 3, V = 3,
W = 19.5, and Y = 14. In continuity with earlier work (5) the numbers for the criteria N1 to N40 are kept the same. The new ones are N41 to N60 (see also Table
III). The numbers for correlated groups are primed (e.g., ranking for group N1’
combines criteria N1, N34, N35, N37 and N44). Several single-factor criteria are
not included in the calculation of the consensus order as they show negative correlation with the rest of the criteria (N15, N17, N18, N25, N27, N31, N46 and N51).
Most of the independent single-factor criteria (17 of total 25) correlate positively
with each other. Correlation coefficients between the rankings are calculated as in
(5), on the basis of normalized sums of absolute differences between components
of respective 20D ranking vectors (see Methods).
3
The Triplet Code
From First Principles
Table I
Single-factor criteria of amino-acid chronology.
N1’.
Criteria based on various evaluations of complexity of amino-acids. N1, N34, N35,
N37 – as in (5), and N44 (6).
G, A, S, P, V, C, D, T, N, L, K, I, E, (MQ), H, R, F, Y, W
N2’.
Criteria based on evolution of amino-acyl-tRNA synthetases. N2, N7 (5).
(AG), (DFHKNPST), (CEILMQRV), (WY)
N3’.
Yields of amino acids in imitated primordial conditions. N3, N20, N21, N22 (5).
G, A, L, V, D, E, I, P, S, T, F, M, Y, K, (CHNQRW)
N4’.
Criteria based on amino-acid compositions of various sets of presumably ancient
proteins. N4, N26 (5), N41 (7), N45 (8), N47 (9), N48 (10), and N58 (11).
A, G, V, S, D, L, T, P, E, I, R, K, N, Q, H, F, M, Y, C, W
N5’.
Criteria based on chemical inertness of amino acids. N5 (5), N42 (7), N49 (12).
G, A, V, S, (IL), D, (NQ), (FP), T, C, (EHKRWY), M
N6’.
Stability of complementary interactions. N6, N39, N40 (5).
A, G, S, P, R, D, T, C, E, (VW), H, L, (MQ), I, N, Y, F, K
N8’.
Stability of codon assignments. N8 (5), N52 (13).
(ADEFGHP), V, S, T, L, R, I, Q, (KNY), (MW), C
N11. GCU based theory (14).
A, (DGPSTV), E, (CFHIKLMNQRWY)
N12. RRY hypothesis of Crick (15).
(DGNS), (ACEFHIKLMPQRTVWY)
N13. RNY hypothesis (4).
(AG), (DINSTV), (CEFHKLMPQRWY)
N15. (excluded) Hypothesis of Ferreira (16).
(FGKLNP), (CDEHQRSTVW), (AIMY)
N17. (excluded) Early copolymerization code of Nelsestuen (17).
(DEFHIKLMSTVY), (ACGNPQRW)
N18. (excluded) Composition of proteinoids of Fox (18).
A, E, V, (GK), M, L, C, Y, (NQ), I, (DF), R, H, P, W, T, S
N24. Primordial code in tRNA sequence (19).
(ADGV), (CEFHIKLMNPQRSTWY)
N25. (excluded) Evolutionary distances between isoacceptor tRNAs (20).
Q, H, P, (LS), G, C, W, R, V, (DE), A, Y, T, (IM), F, (KN)
N27. (excluded) Match scores of BLOSSUM matrix (21).
(AILSV), (EKMQRT), (DFGN), (PY), H, C, W
N31. (excluded) Algebraic model of Hornos and Hornos (22).
(CDFSV), (EKLRY), (HP), (AGIMNQTW)
N32. Composition of translated Urgen (23, 24).
V, (AGP), (ENRT), (LQS), (CDFHIKMYW)
N33. Murchison meteorite (25).
(AG), (DEPV), (CFHIKLMNQRSTWY)
N38. Minimal alphabet for folding (26).
(AGEIK), (CDFHLMNPQRSTVWY)
N43. Mutational stability of codon assignments (27). More stable assignments – earlier.
(AGPTV), L, R, S, I, (CDEFHKNQY), (MW)
N46. (excluded) Complementary circular code (28).
(ADEFGIKLNQTVY), (CHMPRSW)
N51. (excluded) Metabolic cost (H. Akashi, personal communication). Costly amino
acids – late.
D, N, T, E, Q, K, P, I, M, S, G, A, R, V, L, C, H, Y, F, W
N56. Composition of an urancestral tRNA gene (29).
S, G, P, (LQV), H, (EINRTW), A, (CDFKMY)
N59. Arginine first (30).
R, (ACDEFGHIKLMNPQSTVWY)
The average ranking calculated for the 17 independent criteria is shown in the
Table II. Two major features are revealed by the averaging. One is chronological
priority of the amino acids presumably synthesized abiotically before the life
started. All ten topmost (earliest) amino acids of the consensus temporal order
belong to the amino acids that appeared in the imitation experiments of Miller (1,
2). As it was indicated earlier (5), this is observed even when the Miller’s data are
excluded from the calculations. This strongly supports the abiotic start theory:
apparently, the Life opportunistically utilized first what was available – in particular, those amino acids that have been provided by the chemistry of the early environment. The chronologically first Ala, Gly, Asp and Val are the highest yield
amino acids of the Miller’s mix.
Table II
Consensus temporal order of amino acids (singlefactor criteria)
Amino
acids of
Miller
+
+
+
+
+
+
+
+
+
+
G
A
V
S
P
D
T
E
L
I
N
R
H
Q
K
F
C
M
W
Y
Average
rank
(± 0.7)
Order
2.8
3.9
6.5
7.1
7.4
7.7
9.0
9.9
10.3
10.9
11.2
11.7
12.7
12.8
13.2
13.2
13.9
15.0
15.3
15.3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Codon
capture
cases
(+)
(+)
+
+
+
+
+
+
+
Another important feature is the bottom-most placement of the amino acids for
which their respective codons are known or appear to be borrowed from the repertoires already established earlier (13). Despite overwhelming universality of the
present-day genetic code, there is a variety of cases when the codon assignments
within some codon quartets are deviant (see Discussion) indicating the capture of
the earlier codons for later amino acids. Such data indicate, for example, that the
codon for methionine (AUG) has been borrowed from original isoleucine repertoire
of codons (AUN) (31). Actually, it is not clear a priori who has borrowed from
whom. The chronology in the Table II points to late methionine as the captor.
4
Trifonov
B. Multi-factor Criteria: There is total of 13 multi-factor criteria among the 60
criteria collected for the chronology analysis. All are independent in terms of their
authorship. Each is based on two or more explicitly stated principles. For example, the Jukes’ chronology (32) exploits analysis of codon-anticodon (mRNAtRNA) contacts, and the amino acids with larger numbers of codons assigned to
them are assumed to be amongst the earliest. Most of the multi-factor criteria are
also influenced by implicit trends, the choices made by the authors on the basis of
their expert intuition. The multi-factor criteria intercorrelate due to sharing often
the same single factors or similar type of consideration, like idea of coevolution of
amino acids and their codons by Dillon (33), Wong (34-36) and Wächtershäuser
(37). The intercorrelation, clearly, reflects the very consensus we are looking for,
that may emerge above the differences between various expert opinions.
The Table III presents the temporal orders for the amino acids, as suggested by various authors. Similarly to the single-factor set, as above, the rankings that negatively correlate with the rest of criteria (3 of total 16) are excluded from calculations of the consensus.
Table III
Multi-factor criteria of amino-acid chronology.
N9.
Table IV
Consensus temporal order of amino acids (multifactor criteria).
Amino
acids of
Miller
+
+
+
+
+
+
+
+
+
+
A
G
D
V
E
P
S
L
T
Q
R
N
I
H
K
C
F
Y
M
W
Average
rank
(± 0.7)
Order
4.1
4.2
4.2
6.1
6.3
7.2
8.0
9.5
9.8
9.9
10.2
11.4
11.9
13.2
13.4
13.8
15.1
15.2
15.9
17.7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Codon
capture
cases
(+)
(+)
(+)
+
+
+
+
+
+
Jukes’ theory of the origin of the code (32).
(ADEGHLPQRV), (CFIKNSTY), (MW)
N10. Coevolution theory of Wong (34-36).
(ADEGS), V, (PT), (IL), F, C, Y, (KR), (NQ), H, (MW)
N14. Hypothesis of Hartman (38).
G, P, A, R, (DENQST), (HK), C, (FILVY), (MW)
N16. Prebiotic physicochemical code of Altshtein-Efimov (39).
(ADEGKRSTV), (CFHILMNPQWY)
N19. Coevolution theory of Dillon (33).
G, A, D, V, E, Q, (HLPR), N, T, (IS), (KM), F, (CY), W
N23. Coevolution theory of Wächtershäuser (37).
(DE), (ACGNPQST), (ILMV), (FHKRWY)
N28. (excluded) A/U start theory (40).
(FIKLMNY), (CDEHQRSTVW), (AGP)
N29. N-fixing amino acids first, Davis (41).
(DENQ), (APSV), (CG), T, (ILM), R, K, (FY), H, W
N30. GNN codons first, Taylor and Coates (42).
(ADEGV), (CFHIKLMNPQRSTWY)
N36. Jimenez-Montano (43).
(ADGV), (LPR), (CIKQST), (EFHMNWY)
N50. SNS code of Ikehara (44).
(ADGV), E, (HLPQR), (CFIKMNSTWY)
N53. Cavalier-Smith (45).
(AILPV), (DEGST), (CFHNY), (KMQRW)
N54. (excluded) Self-referential code (46).
(GPS), (DEFKLN), (AHQRV), (CIMTWY)
N55. (excluded) Synthesis with hydropathicity (47).
G, S, D, N, K, (AF), (HT), E, Q, L, (PV), R, (CW), (IM), Y
N57. Maslov, GGG start (48).
G, (DS), A, (LPV), (EFHIKMNQR), (CTWY)
N60. Three stages of Baumann and Oro (49).
(ADEGILPQRSTV), (KN), (CFHY), (MW)
The average ranking (consensus) of the multi-factor criteria is presented in the
Table IV. Two major features displayed by the single-factor set, i.e., Miller first and
codon capture last, are characteristic of the multi-factor set as well.
C. Final Consensus: Considering the rankings presented in the Tables II and IV
as different evaluations of the same temporal order, one can combine the two in one
final consensus presented in the Table V.
Although the derived consensus order of engagement of the amino acids is built
largely on the basis of numerous speculations, it has a merit of a best guess, since
all available diverse factors and opinions are taken into account in its derivation.
As such it is, hopefully, not very far from the true order. Especially because the
Miller’s amino acids and codon capture cases are found in the consensus order just
about as one would expect, as the earliest and latest, respectively.
Reconstruction of Temporal Order of 64 Codons
A. First Steps of the Reconstruction: The build-up described below is based
on the derived consensus amino-acid chronology, on the principles of thermostability and complementarity suggested by Eigen and Schuster (4), and on the processivity rule established earlier (5). These guidelines never come into conflict,
and can be followed strictly all the way through the reconstruction, resulting in a
very clear and consistent picture.
As initially suggested (4), the most stable complementary pair of codons,
GCC·GGC, would correspond to, presumably, the earliest alanine and glycine,
thus, encoded in complementary strands of the RNA duplex or hairpin stem. Both
these amino acids are at the top of the Miller’s mix. They lead the consensus
chronological list of amino-acid. Notably, the respective codon assignments, GCC
and GGC, are the same as in present-day codon table.
The next codons to appear should serve valine and aspartic acid, according to the
consensus amino-acid chronology. The strongest triplets of modern codon repertoires for valine and aspartate, that is those which are characterized by the highest
melting enthalpy of the respective complementary pairs (see Methods) are GUC
(valine) and GAC (aspartate). Remarkably, they are a complementary pair as well
as the generic GCC and GGC triplets. Valine and aspartate, indeed, should have
appeared simultaneously, like it apparently happened with alanine and glycine.
The simplest way to derive GUC and GAC triplets from chronologically earlier
GCC and GGC is either transition of middle C to U in GCC, with subsequent complementary copying, or transition G to A in GGC, with complementary copying, or
both. Amongst various point mutations the transitions pyrimidine to pyrimidine
and purine to purine are known to be most frequent, and, therefore, “cheap” [see,
e.g., (50)]. In the case of valine and aspartate, again, the respective modern codons
fit right away to the reconstruction of the earliest codons.
Next to appear was proline (see Table V). Today it is encoded by (the strongest of
four proline codons) CCC triplet. It should have come simultaneously with its
complementary GGG. This codon, in its turn, apparently appeared as mutation of
the codon GGC (already present) to GGG, in the third redundant codon position
which is another type of “cheap” mutation. All subsequent codons in the reconstruction below appeared the same way – by change in the third position in one of
the earlier codons, and complementary copying.
B. Consecutive Exhaustion of all 64 Triplets: Major part of the reconstruction
of the evolutionary history of the triplet code consists of consecutive steps similar to those described in the previous section: The strongest codon of the next
amino acid in the consensus amino-acid chronology always turns out to be complementary to one of already engaged codons, mutated in the third position. This
is displayed in the Figure 1 as series of 32 lines (32 complementary codon pairs),
each line representing one step in the evolution of the triplet code. Thus, the 4th
step is engagement of the (strongest) codon UCC for serine, complementary to
5
The Triplet Code
From First Principles
Table V
Consensus temporal order of amino acids (final).
Amino
acids of
Miller
+
+
+
+
+
+
+
+
+
+
G
A
D
V
P
S
E
T
L
R
N
I
Q
H
K
C
F
Y
M
W
Average
rank
(± 0.7)
Order
3.5
4.0
6.0
6.3
7.3
7.6
8.1
9.4
9.9
11.0
11.3
11.4
11.4
13.0
13.3
13.8
14.2
15.2
15.4
16.5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Codon
capture
cases
(+)
(+)
(+)
+
+
+
+
+
+
6
Trifonov
Figure 1: Reconstruction of codon chronology based on
consensus amino-acid chronology (Table V) and the
thermostability, complementarity and processivity rules.
The amino-acid chronology is shown on the top of the
Figure. The ascending numbers from 1 to 20 corespond
to the temporal order of amino acids. The column on the
left shows the melting enthalpy values (see Methods) for
the 32 complementary pairs of codons. The presumed
initial, earlier assignments of some codons are shown in
lower case. Two major stages in the evolution of the
code are shown, separated by a vertical.
GGA (glycine). The choice of the UCC codon for serine as the earliest of six
codons of two serine sets (UCX and AGY), is supported by analysis of serine
codons in conserved and non-conserved positions of protein-coding genes (51).
Then comes glutamate (GAG) complementary to CUC for leucine. Both these
amino acids, again, appear simultaneously. The gag (GAG) triplet, apparently, is
the mutated version of GAC (already present), for aspartate. An alternative origin of this pair of codons could be as for the GAC/GUC pair before – via transitions, GGG to GAG and/or CCC to CUC.
Note that the order of (numbered) amino acids on the top of the Figure 1 is not
an exact copy of the consensus order shown in the Table V. The calculated values of the average ranks given in the Table are rather approximate. The typical
error bar for the values is 0.7 rank units (see Methods). This means, that many
of the neighboring ranks in the presented order are actually indistinguishable, and
the order can be locally changed while keeping it as valid as the one originally
calculated. More than 12 such changes would be allowed. In the Figure 1 only
two local swaps had to be introduced to make the reconstruction fully consistent:
The order T, L is changed to L, T (difference in average ranks is 0.5), that puts E
and L together; and N, I, Q is changed to I, Q, N (difference 0.1), to make it consistent with thermostability rule.
After the 5th step, consecutively threonine (ACC), arginine (CGC) and serine
(AGC) are accommodated in the same manner, using glycine and alanine codons
as complementary counterparts. Step 9 (triplet ugc) is suggested by descending
order of thermostability. Since cysteine that is encoded today by this triplet, did not
yet appear according to the amino-acid chronology, the ugc triplet as well as ugg
and ugu triplets, presumably, all served originally as terminators.
Steps 10 to 18 correspond to codons of amino acids that are already engaged.
These steps are arranged in descending order of thermostability of respective complementary codon pairs. This descending order is also characteristic of all previous steps, within error bars, as can be seen from the column on the left of the Figure
1 (see Methods). Step 19 introduces AUC (isoleucine).
Histidine appeared later than glutamine, therefore thermodynamically next step
(line 20), codon cac, must have been associated originally with glutamine, to be
captured by histidine at a later stage.
All other steps, 21 to 32, are self-explanatory. Here the codons uuc (originally
leucine), uac (originally terminator) and few other codons also shown in smaller
type, apparently, correspond to later codon capture events (see below). The codon
uuu (originally for leucine and then for phenylalanine) and aaa (originally for
asparagine and later for lysine) represent the last, least stable, complementary
codon pair. These complete the consecutive filling of all 64 vacancies.
C. Codon Capture Stage: The last, still continuing (see below) period of the
evolution of the table of codons is the codon capture period, when all 64 triplets
are already engaged, and codons for every new amino acid had (have) to be captured from the established codon repertoires. The line numbers on the left of the
Figure 1 correspond to time course during the earlier exhaustion period, whereas
the time course of the codon capture period is given rather by succession of
amino acids in the consensus chronology, from histidine on, in no relation anymore to thermostability of codons.
The UGG codon (for tryptophan) is likely to have been borrowed from original
repertoire UGN for terminators. In Mycoplasma capricolum, for example, not only
UGG serves for tryptophan, but UGA codon (normally stop-codon) as well (52).
Similarly, cysteine codons UGC and UGU have, apparently, the same origin, from
initial UGN repertoire. In a ciliate E. octacarinatus the universal UGA terminator
encodes cysteine as well (53).
Another original set of stop-codons, UAN, apparently, yielded to tyrosine, losing
the respective UAC and UAU codons (13). Codon AUG, for methionine, an exception to otherwise all isoleucine tetrad AUN, originally served, most likely, for
isoleucine. In yeast, AUA codon is utilized for methionine as well (31).
There are several documented cases where AAA is found to encode asparagine
rather than lysine (13), e.g., in some animal mitochondria (54).
No similar evidence is available for presumed UUY codon capture by phenylalanine from original UUN set for leucine, and for capture of CAY codons by histidine from CAN codons for glutamine. Placement of these amino acids and codons
into the codon capture category is consistent with the idea of T. Jukes (55) on
archetypal anticodons. Also, according to coevolution theory of Wong (34) the histidine is a newcomer to the original CAN glutamine quartet, causing its splitting in
CAR for glutamine and CAY for histidine.
The latest newcomers at this stage of the evolution of the code, apparently, are
selenocysteine (56) and pyrrolysine (57), amino acids number 21 and 22. Like cysteine and tryptophane before, selenocysteine captured the last UGA codon from the
presumed early stop-codon repertoire UGN. Pyrrolysine, in its turn, captured the
termination codon UAG, following example of tyrosine that preyed on the second
set UAN of early stop-codons. Few additional natural amino acids may be discovered later or appear de novo in millennia to come. Their respective codons will
have to be borrowed from other amino acids. This process, probably, would not go
to its limits of total 63 different amino acids, considering importance of the degeneracy of the triplet code, that allows to accommodate more than just only the protein-coding message in the same sequence (58, 59).
Discussion
Merits of the Reconstruction
The most striking characteristic of the reconstructed codon chronology is that it follows the very basic rules of complementarity, thermostability and processivity but
strictly. The order of engagement of the amino acids, the consensus chronology, is
7
The Triplet Code
From First Principles
8
Trifonov
in full accord with the rules. The amino-acid chronology itself is a quintessence of
natural simplicity and opportunism: use first those amino acids that are available.
When done with all codons, take from those amino acids that have too many.
Another important feature of the reconstruction is persistence of the code. The
reader, perhaps, noticed that, at no point in the build-up of the evolutionary history of the codons it was assumed that the billions years ago the triplet code was different from the present-day code (except for smaller repertoire). This apparent persistence can be explained, but only partially, by the fact that some of the chronological criteria for amino acids are based on the properties of modern codons. The
earliest assignments at the steps 1 to 8 are fully conserved. The other stages, actually, are as conserved, only somewhat compromised by the later intruders that had
to change some of the earlier assignments by the inevitable codon capture events.
Such a persistence of the code gives an additional credibility to the reconstruction.
With the derived consensus chronology of amino acids and the three very basic
rules at hand (thermostability, complementarity and processivity) the reconstruction turned out to be a smooth and enjoyable journey.
Yet, one should not forget that the whole construction is still an elaborate speculation. To become a theory it has to lead to hard new results, along the classical route
from speculation to predictions and to confirmations. Few published studies based
on earlier versions of the codon genealogy (8, 60) have already put the described
reconstruction on that path.
Glycine Clock
Inspection of the first six steps of the codon genealogy suggests that the aminoacid composition of the earliest proteins was dominated by glycine. Other earliest amino acids, alanine, proline, serine and threonine are all complemented by
glycine rather than by any new amino acid. Later steps result in gradual dilution
of the glycine. Prediction is that should we have a sample of the most ancient proteins, it would be glycine-rich. Comparison of related protein sequences from
eukaryotes and prokaryotes (separated about 3.5 billion years back) shows that
those patches of sequences that did not change since then are, indeed, glycine-rich.
Moreover, the glycine content of such unchanged sequences in pairwise comparisons between other kingdoms makes possible a construction of rooted tree of
kingdoms consistent with current knowledge (8).
Binary Alphabet
The complementarity of the simultaneously appearing codons splits the growing
amino-acid alphabet in two almost independent groups – those amino acids that
replace glycine and those replacing alanine. One simple possibility is that the complementary strands of the presumed earliest coding RNA duplex were, respectively,
Gly-strand and Ala-strand. First to join glycine (GGC), according to the Figure 1,
was aspartate, with mutated codon GAC. Next was glutamate (GAG). It was followed by arginine, with its CGC codon complementary to alanine codon GCG, of
the Ala-strand. Upto this step (No. 7) the Gly-strand would encode gly, asp, glu and
arg, while Ala-strand would encode ala and val. Respective codons are all of the
structure N-purine-N in Gly-strand, and N-pyrimidine-N – in Ala-strand. Since all
later stages in the codon evolution involve only changes in the third positions and
(complementary) in the first positions of the codons, the structure of the codons in
the Gly-strand would stay the same, N-purine-N, while N-pyrimidine-N codons are
all carried by the Ala-strand. Hence, two alphabets: G, D, E, R, S, Q, N, H, K, C,
Y, W - Gly-alphabet, and A, V, P, S, L, T, I, F, M - Ala-alphabet.
The earliest minigenes, presumably, carried messages encoding miniproteins of
two independent alphabets (with the exception of common serine), Gly-alphabet
for Gly-strand, and Ala-alphabet for Ala-strand. A prediction follows: After the
minigenes at some later stage fused in longer sequences, the resulting protein
sequences had to have mosaic structure with short patches of the two alphabets.
Extant sequences may still have remnants of this mosaic sequence arrangement. In
particular, the alternation of the patches of two kinds, or rather of heavily mutated
descendants of these patches, all within their respective alphabets, would be
expected. Autocorrelation analysis of prokaryotic protein sequences in two-letter
alphabet presentation, indeed, demonstrates such alternation (60). The size of the
ancient miniproteins is estimated as well – 6 amino acid residues.
Conservation of Binary Protein Sequences
The binary code of protein sequences, as described above, developed during the
evolution of the triplet code to its completion some billions years ago. The
sequence patterns that have become established during that time and later, may be
expressed as binary sequences. If subsequent mutations would not occur, the
binary presentation would describe the patterns in their ancient form.
Coincidentally, the most frequent mutations, transitions, that is purine-to-purine
and pyrimidine-to-pyrimidine replacements, keep the N-purine-N and N-pyrimidine-N triplet structures unchanged, no matter what happens to the 1st and 3rd
positions. In other words, the ancient binary pattern should have stayed rather
conserved, at least for some time. This appears, indeed, to be the case. Analysis
of the amino-acid replacement matrices, PAM61 and BLOSSUM21 shows that the
replacements almost exclusively occur within the respective two alphabets, the
swaps between the alphabets being rather rare (62). The binary presentation of
modern sequences would have an evolutionary connotation that may help to establish distant relatedness between the sequences, especially those that appear unrelated in traditional 20-letter presentation.
Origin of Life Connection
Before the early Life reached the protein-coding stage it had to go through many
other stages on the way from fully abiotic syntheses, to the most primitive life
forms, and further on. The reconstructed onset of the triplet code may well represent the earliest picture of Life to explore, described in only few but very important details. Using that picture as an advanced outpost to the beginnings, expanding it further back may help to outline even earlier stages of Life.
One exciting proposal is to actually model this stage. For example, one may consider and treat experimentally the four oligomers (GCC)6, (GGC)6, (Ala)6 and
(Gly)6 as one multipart replicator, such that the oligomers supplemented by appropriate catalyzers and supplies would, perhaps, condition each other’s replication
(http://research.haifa.ac.il/~genom/Trifonov/Origin/).
Methods
Stability of codon pairs is calculated as melting enthalpies from latest data on
sequence-specific melting of RNA duplexes (63), by summing respective values
for component dinucleotides stacks.
Error limits for average rank values can be estimated from differences between rank
values calculated for subsets of the 60 criteria, or comparison with earlier calculations. Average absolute difference between rank values calculated on the basis of 40
criteria (5) and 60 criteria (this work) is 0.7 rank units. With this uncertainty the calculated temporal orders can be locally changed without compromising their validity.
Correlations between chronology vectors are calculated as in (5), from the absolute
sums S of differences between respective vector components for two criteria com-
9
The Triplet Code
From First Principles
10
Trifonov
pared. In case of complete identity of the vectors (full correlation) the sum S equals
0. Maximal possible value of S, for fully anticorrelating vectors, is 200. Thus, the correlation coefficient C = 1 - S/100 takes values 1 and -1 for fully correlating and anticorrelating vectors, respectively. Since the components of the chronology vector are
not independent (being always sequences of natural integers from 1 to 20), the correlations between randomized vectors do not exactly cluster around zero. This, however, does not influence the calculated ranks in the consensus amino-acid chronology.
Acknowledgements
Discussions with Drs. I. N. Berezovsky, A. Elitzur, M. Lahav, N. Lahav, L.
Leizerowitz and S. Nir, as well as comments of an anonymous reviewer are highly
appreciated.
References and Footnotes
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
S. L. Miller. Science 117, 528-529 (1953).
S. L. Miller. Cold Spr. Harb. Symp. Quant. Biol. 52, 17-27 (1987).
W. Löb. Ber. 46, 684-697 (1913).
M. Eigen and P. Schuster. Naturwissenschaften 65, 341-369 (1978).
E. N. Trifonov. Gene 261, 139-151 (2000).
D. Haig and L. D. Hurst. J. Molec. Evol. 33, 412-417 (1991).
R. V. Eck and M. O. Dayhoff. Science 152, 363-366 (1966).
E. N. Trifonov. Gene Therapy and Mol. Biol. 4, 313-322 (1999). www.gtmb.org
D. J. Brooks, J. R. Fresco, A. M. Lesk, and M. Singh. Mol. Biol. Evol. 19, 1645-1655 (2002).
D. J. Brooks and J. R. Fresco. Molec. Cell Proteomics 1, 125-131 (2002).
S. H. White. J. Mol. Biol. 227, 991-995 (1992).
B. S. Berlett and E. R. Stadtman. J. Biol. Chem. 272, 20313-20316 (1997).
S. Osawa, T. S. Jukes, K. Watanabe, and A. Muto. Microb. Rev. 56, 229-264 (1992).
E. N. Trifonov and T. Bettecken. Gene 205, 1-6 (1997).
F. H. C. Crick, S. Brenner, A. Klug, and C. Pieczenik. Origins Life 7, 389-397(1976).
R. Ferreira and K. R. Coutinho. J. Theor. Biol. 164, 291-305 (1993).
G. L. Nelsestuen. J. Mol. Evol. 11, 109-120 (1978).
S. W. Fox and T. V. Waehneldt. Bioch. Bioph. Acta. 160, 246-249 (1968).
W. Möller and G. M. C. Janssen. Biochimie. 72, 361-368 (1990).
M. B. Chaley, E. V. Korotkov and D. A. Phoenix. J. Mol. Evol. 48, 168-177 (1999).
S. Henikoff and J. G. Henikoff. Proc. Natl. Acad. Sci. USA 89, 10915-10919 (1992).
J. E. M. Hornos and Y. M. M. Hornos. Phys. Rev. Lett. 71, 4401-4404 (1993).
M. Eigen and R. Winkler-Oswatitsch. Naturwissenschaften 68, 217-228 (1981).
M. Eigen and R. Winkler-Oswatitsch. Naturwissenschaften 68, 282-292 (1981).
K. A. Kvenvolden, J. G. Lawless, and C. Ponnamperuma. Proc. Natl. Acad. Sci. USA 68,
486-490 (1971).
D. S. Riddle, J. V. Santiago, S. T. Bray-Hall, N. Doshi, V. P. Grantcharova, Q. Yi, and D.
Baker. Nature Struct. Biol. 4, 805-809 (1997).
L. F. Luo. Origin Life Evol. Biosphere. 18, 65-70 (1988).
D. G. Arques and C. J. Michel. J. Theor. Biol. 182, 45-58 (1996).
W. M. Fitch and K. Upper. Cold Spr. Harb. Symp. Quant. Biol. 52, 759-767 (1987).
M. Yarus. Biochemistry 28, 980-988 (1989).
B. G. Barrell, S. Anderson, A. T. Bankier, M. H. L. de Bruijn, E. Chen, A. R. Coulson, J.
Drouin, I. V. C. Eperon, D. P. Nierlich, B. A. Roe, F. Sanger, P. H. Schreier, A. J. H. Smith,
R. Staden, and I. G. Yang. Proc. Natl. Acad. Sci. USA 77, 3164-3166 (1980).
T. H. Jukes. Nature 246, 22-26 (1973).
L. S. Dillon. The Genetic Mechanism and The Origin of Life. Plenum Press, New York and
London (1978).
J. T-F. Wong. Proc. Natl. Acad. Sci. USA 73, 2336-2340 (1976).
J. T-F. Wong. Trends Bioch. Sc. 6, 33-36 (1981).
J. T-F. Wong. Microbiol. Sc. 5, 174-181 (1988).
G. Wächtershäuser. Microbiol. Rev. 52, 452-484 (1988).
H. Hartman. J. Mol. Evol. 40, 541-544 (1995).
A. D. Altshtein and A. V. Efimov. Mol. Biol. 22, 1133-1149 (1988).
A. Jimenez-Sanchez. J. Mol. Evol. 41, 712-716 (1995).
B. K. Davis. Evolution of the Genetic Code. Prog. Biophys. Mol. Biol. 72, 157-243 (1999).
F. J. R. Taylor and D. Coates. BioSystems 22, 177-187 (1989).
M. A. Jimenez-Montano. BioSystems 54, 47-64 (1999).
K. Ikehara, Y. Omori, R. Arai, and A. Hirose. J. Mol. Evol. 54, 530-538 (2002).
T. Cavalier-Smith. J. Mol. Evol. 53, 555-595 (2001).
R. C. Guimaraes, 13th Intl Conference on Origin of Life, Oaxaca, Mexico (abstract 1) (2002)
R. C. Guimaraes, 13th Intl Conference on Origin of Life, Oaxaca, Mexico (abstract 2) (2002)
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
S. Y. Maslov. Biophysics 26, 642-645 (1981).
U. Baumann and J. Oro. BioSystems 29, 133-141 (1993).
A. D. Yoder, R. Vilgalys, and M. Ruvolo. Mol. Biol. Evol. 13, 1339-1350 (1996).
Y. Diaz-Lazcoz, A. Henaut, P. Vigier, and J.-L. Risler. J. Mol. Biol. 250, 123-127 (1995).
F. Yamao, A. Muto, Y. Kawauchi, M. Iwami, S. Iwagami, Y. Azumi, and S. Osawa. Proc.
Natl. Acad. Sci. USA 82, 2306-2309 (1985).
F. M. Meyer, H. J. Schmidt, E. Plümper, A. Hasilik, G. Mersmann, H. E. Meyer, A.
Engström, and K. Heckmann. Proc. Natl. Acad. Sci. USA 88, 3758-3761 (1991).
T. Ohama, S. Osawa, K. Watanabe, and T. H. Jukes. J Mol. Evol. 30, 329-332 (1990).
T. H. Jukes. J. Mol. Evol. 18, 15-17 (1981).
I. J. Chambers, J. Frampton, P. Goldfarb, N. Affara, W. McBain, and P. R. Harrison. EMBO
J 5, 1221-1227 (1986).
G. Srinivasan, C. M. James, and J. A. Krzycki. Science 296, 1459-1462 (2002).
E. N. Trifonov. Bull. Math. Biol. 51, 417-432 (1989).
E. N. Trifonov. Molecular Biology 31, 759-767 (1997).
E. N. Trifonov, A. Kirzhner, V. M. Kirzhner, and I. N. Berezovsky. J. Mol. Evol. 53, 394401 (2001).
S. F. Altschul. J. Mol. Biol. 219, 555-565 (1991).
E. N. Trifonov. In Discovering Biomolecular Mechanisms with Computational Biology. Ed.
F. Eisenhaber. Landes Bioscience, Georgetown (2004) in press.
T. Xia, J. SantaLucia, M. E. Burkard, R. Kierzek, S. J. Schroeder, X. Jiao, C. Cox, and D.
H. Turner. Biochemistry 37, 14719-14735 (1998).
Date Received: May 10, 2004
Communicated by the Editor Valery Ivanov
11
The Triplet Code
From First Principles