Nucleotide Sequence Complex Representation

Nucleotide Sequence Complex Representation
Paul Dan CRISTEA, Aurora Rodica TUDUCE
University "Politehnica" of Bucharest, Spl. Independentei 313, Romania, [email protected]
Abstract. The paper present recent developments in the nucleotide genomic signal (NuGS)
methodology. NuGS approach is based on the conversion of symbolic nucleotide sequences
into numeric genomic signals. NuGS method proved adequate for both large scale analysis of
genomic sequences, by revealing hidden regularities and symmetries, and for local analysis,
important for the study of gene dynamics.
1 Introduction
Nucleotides are the monomers of nucleic acids, DNA and RNA. The nucleobases are the key
components that provide the identity of nucleotides, thus their function as letters in the genetic
alphabet. The primary nucleobases are adenine (A), cytosine (C), guanine (G) – shared by DNA
and RNA, thymine (T) – only in DNA, and uracil (U) – which replaces T it in RNA. The order of
nucleotides along a strand stores the genetic information.
A triplet of successive nucleotides forms a codon, which encodes amino acids according to the
so called standard Genetic Code [1], which is valid for the coding segments (exons) of most nuclear
genomes. There are 43 = 64 codons, which encode the 20 amino acids building the proteins, and
the ‘stop’ or ‘terminator’ marking the end of a gene and of protein synthesis. A gene can comprise
several exons. The total of the protein encoding part of the genome is only 2-3% in homo sapiens,
but reaches over 95% in archaea and bacteria. The intergenic part of the human genome contains
repetitive, quasi-random sequences and a large amount of transposable elements that bear a close
resemblance to the DNA of some independent entities like viruses and bacteria. A significant part
of the inter-gene chromosomal DNA plays an important role in the control of protein synthesis,
conjointly with other gene regulatory systems.
There is a sharp contrast between the deceivingly simple structure of DNA nucleotide chains –
an unbranched linear code written in a four letters alphabet, and the overwhelming complexity of
the protein tri-dimensional structure. Correspondingly, the complexity of the proteome ─ the set of
proteins existing in a cell, tissue, organ or the whole body, exceeds by far the complexity of the
genome. The total number of genes in the human genome is only about 32,000, while the proteome
comprises more than one million proteins, many of them transitory. The key to organism
complexity is not in the number of genes, but in the way parts of the genes are expressed and
combined to build different proteins by using alternative splicing. The regulatory mechanisms that
control these processes in every cell are sensitive to signals from the environment and from the
other cells, tissues and organs in the organism. Nevertheless, the nucleotide chains in DNA
molecules and the amino acid chains in various proteins are the bearers of the same genetic
information, despite of the many changes that can be introduced by pre- and post-translational
processes and the essential role of the way proteins are 3D packed.
2 Representation of Genomic Information
The representation genomic information can be either symbolic or numerical.
The symbolic representation uses directly the four symbols of the nucleotides, A, C, G and T
(or U). The analysis and computational methods are applied directly to the symbolic nucleotide
sequences, as they are obtained at the output of the sequencing machines, or as they are retrieved
from the genomic databases. The symbolic approach has the advantage of simplicity, important for
storage, retrieval and transmission of genomic data, but it also has significant disadvantages in
1
what concerns the processing of data, as it restricts the methodology to pattern matching and
statistical analysis [2].
The numerical representation uses an additional conversion step to map each individual
nucleotide symbol of a sequence into a corresponding numerical value [3, 4]. The numerical
representation allows the effective use of powerful digital signal processing techniques, initially
developed in the domains of communications and networking, but which can be applied
successfully to genomic research as well. The numeric approach is able to reveal many features of
genome structure, symmetries and regularities, which are sometimes hidden and would be difficult
to identify by symbolic representation and processing techniques [5, 6].
We have developed and used a specific approach, the nucleotide genomic signal (NuGS)
methodology, which can be applied at various scales, going from the global to the local analysis
of nucleotide sequences. The large scale analysis of genomic sequences reveals not only
ostensive features of extant DNA sequences, at the scale of entire chromosomes, or even entire
genomes, but also ancestral features, which existed in DNA sequences but disappeared during
subsequent evolution, under the pressure of species separation [4, 5].
3 Binary Indicator Sequences and their Fourier Transforms
A straightforward method to numerically characterize symbolic DNA nucleotide sequences is
to use binary indicator sequences. The method attaches four binary indicator sequences uA[k],
uC[k], uG[k], and uT[k], k  {0, 1,…, N-1}, to a symbolic DNA string Nu[k] of N nucleotides,
seen as characters in the alphabet {A, C, G, T}. Each indicator sequence uX[k] keeps track of
the occurrence of a symbol X in the sequence:
 1, when Nu [k ]  X,
uX [k ]  
 0, otherwise,
X  {A, C, G, T}, k  {0, 1, … , N-1}.
(1)
The Discrete Fourier Transforms (DFTs) of the binary sequences uA[k], uC[k], uG[k] and
uT[k], k = 0,…, N-1, can be defined as for any other digital sequence of N entries. The
corresponding DFT sequences, UA[h], UC[h], UG[h], and UT[h], h = 0,…, N-1, respectively,
have also N entries each, given by the equations:
N 1
U X [ h ]   u X [ k ] e  j 2 h k / N ,
k 0
X  {A, C, G, T}, h  {0,1,, N  1} ,
(2)
which provide together the four-dimensional frequency spectrum of a nucleotide sequence. The
cumulated power spectrum of the four binary sequences DFTs,
S [h]   U X [h]  U A [h]  U C [h]  U G [h]  U T [h] , h  {0,1,, N  1}
,
2
2
2
2
2
(3)
X
has the interesting property that, for most coding regions (exons), there is a period three peak in
S[.] (see [7]). Unfortunately, there are many exceptions, and the resolution is not good enough
to allow exact exon location.
The distribution of symbols along a sequence can also be described by cumulative
occurrence sequences defined in function of the binary indicator sequences. For a symbolic
DNA string x[k] of N nucleotides, one can define the four cumulative occurrence sequences:
k
cX [k ]   uX [h], X  {A, C, G, T}, k  {0, 1, … , N-1}.
h 0
2
(4)
In several representations of genomic sequences, bipolar binary imbalance sequences are
used to describe the nucleotide distribution. An imbalance sequence uX-Y[.] gives the difference
in the number of occurrences between the symbols X and Y in the symbolic sequence Nu[.]:
u X -Y [k ]  u X [k ]  u Y [k ] {1, 1}, X, Y  {A, C, G, T}, k  {0, 1, … , N-1}.
(5)
Binary indicator sequences can also be used to describe nucleotide sequences in terms of
classes comprising more then one nucleotide (poly-nucleotides). For instance, it can be
convenient to use binary indicator sequences for di-nucleotide classes, which form three
 4
dichotomic pairs    6 :
 2
 weak (uW[.] = uA[.] + uT[.], and strong bonds (uS[.] = uC[.] + uG[.]),
(6)
 purines (uR[.] = uA[.] + uG[.]), and pyrimidines (uY[.] = uC[.] + uT[.]),
(7)
 amino (uM[.] = uA[.] + uC[.]), and keto (uK[.] = uG[.] + uT[.]).
(8)
Definitions (2, 4, 5) can be easily extended to the di-nucleotide indicator sequences (6 - 8).
4 Nucleotide Complex Representation
We define the nucleotide complex representation [4, 5, 6] by the mapping Nu  C {Nu} ,
which establishes a one-to-one correspondence between the nucleotides Nu  {A, C, G, T} and
their complex representations C {Nu} belonging to the set of complex numbers {a, c, g , t} :
a  C {A}  1  j, c  C {C}   1  j, g  C {G}   1  j ,
t  C {T}  1  j,
(9)
represented in figure 1.
Figure 1: Complex representation of single-nucleotide classes.
The representations in (9) and Figure 1 are complex numbers having all the same absolute
value 2 , but different canonic arguments:
arg( a )  

4
, arg(c )  
3
3

, arg( g )  
, arg(t )  
4
4
4
(10)
The phase of a complex number is a periodical magnitude, as acomplex number does not
change when its phase varies with an arbitrary integer multiple of 2̶. the phase period. The
arguments given in (10) are the phases of the nucleotide representations {a, c, g, t} belonging to
the canonic domain (-
The edges of the square in Figure 1 correspond to the di-nucleotide classes:
3
Re = +1 ↔ weak bonds, Re = –1 ↔ strong bonds,
Im = +1 ↔ purines,
Im = –1 ↔ pyrimidines.
(11)
Thus, the real and imaginary parts of the nucleotide complex representation describe the
membership of nucleotides to the {W, S, R, Y} di-nucleotide classes.
The mapping (9) can be applied to a symbolic sequence of nucleotides, Nu[k], k  {0, 1,
… , N-1}, which is transformed into a complex digital signal C Nu[k ] , k  {0, 1, … , N-1}.
The real and imaginary parts of the signal samples are given by:
C Nu[k ]  u W -S [k ]  j uR -Y [k ] , k {0, 1,.., N-1},
(12)
where uW-S[k] and uR-Y[k] are the imbalance sequences between the weak - strong, and purine pyrimidine di-nucleotide classes, respectively.
5 Phase Nucleotide Genomic Signals
The argument sequence resulting from (10) is the simplest phase signal. Using the binary
nucleotide indicator sequences uA[k], uC[k], uG[k], and uT[k], k{0, 1,…, N-1}, defined in (1),
and the arguments of the nucleotide representation given in (10), the argument sequence results:
arg  C Nu[k ]  

uA [k ]  3uC [k ]  3uG [k ]  uT [k ]
k  0,1, , N  1 .
(13)
4
This is a one-dimensional one-to-one representation of nucleotide sequences as complex
number sequences, useful for the local analyses of nucleotide sequences. For the global study of
the distribution of nucleotides and of di-nucleotides along a sequence, there are other phase
signals which can be adapted from Signal Theory for the analysis of any sequences having
complex entries. Two phase signals are particularly useful for this purpose, the cumulative and
the unwrapped phases.
The cumulative phase is defined as the sum of the arguments of the complex representations
of the sequence as a signal, from the first (0) to the current, kth, sample:
k
 c [k ]   arg(C {Nu[h ]}) ,
k  0,1,, N  1,
h 1
(14)
where Nu[h] is the hth nucleotide, h = 1,…,k, in the sequence of N nucleotides, and C {Nu[h]}
is its complex representation, with the arguments belonging to the canonic domain (-
Using the cumulative occurrence sequences cA[k], cC[k], cG[k], cT[k], k  {0, 1, … , N-1},
defined in (4), the cumulative phase (14) can be written as:

3cG [k ]  cC [k ]  cA [k ]  cT [k ]  
N [k ], k  0, 1,, N  1 ,
(15)
4
4
where N[k] is the nucleotide imbalance,
(16)
N [k ]  3cG [k ]  cC [k ]  cA [k ]  cT [k ]), k  0, 1,  , N  1 ,
a signature of the distribution of nucleotides in the sequence. The values cA[k], cC[k], cG[k], and
cT[k] give the number of occurrences of the A, C, G, and T nucleotides, respectively, in the
first k entries of the sequence. If the distribution of nucleotides in each pair of the singlenucleotide classes week (A-T) and strong (G-T) were balanced, the nucleotide imbalance would
be close to zero along the sequence, which is not the case for prokaryotes [5].
The unwrapped phase is defined as the phase of the elements in the sequence, corrected by
adding or subtracting a multiple of 2 (i.e., 2m, mZ, Z – the set of integers), in such a way
 c [k ] 
4
that the absolute value of the difference of phase between any two successive entries of the
sequence becomes smaller or equal to  :
 u [0]  arg(C {Nu[0]}),
 u [k ]  arg(C {Nu[k ]})  2m , m  Z , k  1,..., N  1, so that  u [k ]   u [k  1]   .
(17)
The inequality |u[k[-u[k-1]| ≤  gives two solutions whenever arg(C {Nu[k ]}) and
arg(C {Nu[k  1]}) differ in absolute value exactly with . For the complex representation in
figure 1 and the arguments (10), this happens for AC, CA, GT, and TG. The standard
MATLAB® unwrap function returns u[k]=u[k-1]+, if arg(C {Nu[k ]})  arg(C {Nu[k  1]}) ,
and u[k]=u[k-1]–, in the opposite case. This is not convenient for the analysis of nucleotide
sequences, where diametrically opposed pairs of nucleotides (such as AC and CA) should be
equivalent in what concerns the phase difference between the complex representations of their
components. The two solutions must be considered together and equally valid, so that
unwrapped phase should vary equiprobably with  and -, and the net results should be zero.
As a consequence, we classified the 16 di-nucleotides in only three classes:
- neutral, comprise both the di-nucleotides that have the complex representations of their
components superposed (AA, CC, GG, TT), and those with the components diametrically
opposed with respect to the origin (AC, CA, CG, GC), as shown in figure 2a;
- positive, for which the complex representations of the components are rotated in the
trigonometric direction with respect to each other (AG, GC, CT, TA), figure2b;
- negative, for which the complex representations of the components are rotated clock-wise
with respect to each other (AT, TC, CG, GA), figure 2c.
For neutral di-nucleotides the unwrapped phaseremains the same (u[k] =, for positive
di-nucleotides it increases with u[k] = +/2, and for the negative di-nucleotides it decreases
with u[k] = –/2.
Using (10) and (17), the unwrapped phase results:
k


u[k]  u[0]   u[h] u[0]  (n[k]  n[k]) u[0]  P[k], k {1,, N 1},
h1
2
2
(18)
where n+[k] is the number of positive di-nucleotides (AG, GC, CT, TA), n‒[k] is the number of
negative di-nucleotides (AT, TC, CG, GA) occurring among the first k entries of the sequence.
P[k] is the di-nucleotide imbalance, a signature of the distribution of di-nucleotides (pairs of
nucleotides) in the sequence, given by:
P[k]  n[k]  n[k], k {1,, N 1}.
(19)
In all practical cases, u[0] is negligible.
As they have a direct statistical significance and are expressed by integer numbers, it is
convenient to use the nucleotide imbalance (N) and the di-nucleotide imbalance (P) in the
genomic signal analysis, instead of the cumulated phase (c) and the unwrapped phase (u),
respectively.
6 Conclusion
The Nucleotide Genomic Signal (NuGS) methodology that we have developed and used [4, 5,
6] is based on the conversion of symbolic nucleotide sequences into numeric genomic signals.
The representation we adopted is unbiased, i.e., it is adequate for a large variety of problems
5
related to DNA analysis. To achieve such a versatility, the representation is not using the
cardinality of numbers – their capacity to express and handle quantities, using instead their
ordinality – the capacity to orderly label and classify objects, to handle the order of objects.
The large scale analysis of genomic sequences reveals not only ostensive features of extant
DNA sequences, at the scale of entire chromosomes or genomes, but also ancestral features,
which existed in DNA sequences and disappeared during evolution, under the pressure of
species separation.
The local analysis of nucleotide sequences is important for the study of gene dynamics [8],
especially for tracking the development of pathogen resistance to treatment. Such results can
help fast diagnosis and early assessment of drug efficiency, allowing a systematic use of the
advances in molecular medicine in support clinical decisions [6, 9, 10].
Genome regularities also allow to predict the nucleotides in a DNA sequence, using a
methodology similar to time series prediction, and to estimate the cell self repair potential in
processes such as replication, transcription or crossover.
The striking regularities revealed with the NuGS approach correspond to surprising
symmetries in the distribution of nucleotides and pairs of nucleotides along DNA sequences in
archaea, bacteria and eukarya. As a consequence, a genome appears to be – from the structural
point of view – more than a plain text, as it also satisfies regularities evoking rhythm and rhyme
in poems.
References
[1] http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
[2] R. Durbin, S. Eddy, A. Krogh, and G.Mitchison, Biological Sequence Analysis.
ProbabilisticModels of Proteins and Nucleic Acids, University Press, Cambridge, UK, 1998.
[3] I. Cosic, Macromolecular bioactivity: Is it Resonant Interaction between Molecules? - Theory and
Applications, IEEE Trans. on BME, 41, pp. 1101-1114, 1994.
[4] P. D. Cristea. Conversion of Nitrogenous Base Sequences into Genomic Signals. Journal of
Cellular and Molecular Medicine 6(2) 279–303, 2002.
[5] P. D. Cristea. Representation and analysis of DNA sequences. In: E. Daugherty, I. Shmulevich, J.
Chen, and Z. J. Wang (eds.), Genomic Signal Processing and Statistics, Eurasip Book Series on
Signal Processing and Communications, Hindawi Publ. Corp., 2005, pp. 15–65.
[6] P. D. Cristea, Rodica Tuduce, J. Cornelis, R. Deklerck, I. Nastac, M. Andrei, Signal Representation
and Processing of Nucleotide Sequences, Proc. of the 7th IEEE Intl. Conf. on Bioinformatics and
Bioengineering, Harvard Medical School, Boston, USA (2007) 1214-1219.
[7] J. A. Berger, S. K. Mitra, and J. Astola, Power spectrum analysis for DNA sequences, Proc. of the
7th Intl. Symp. on Signal Processing and its Applications, 2, pp. 29-32, 2003.
[8] P. D. Cristea, Building Phylogenetic Trees by Using Gene nucleotide Genomic Signals, 34th
Annual International IEEE EMBS Conference, August 28 - September 1, 2012, Hilton, San Diego,
California, USA.
[9] S. T. Cole, R. Brosch, J. Parkhill, et al. Deciphering the biology of Mycobacterium tuberculosis
from the complete genome sequence. Nature, 393(6685), pp. 537-544, 1998.
[10] A. Telenti et al., Detection of rifampin-resistance mutations in Micobacterium tuberculosis,
Lancet, 341, pp. 647-650, 1993.
6