Nucleotide Sequence Complex Representation Paul Dan CRISTEA, Aurora Rodica TUDUCE University "Politehnica" of Bucharest, Spl. Independentei 313, Romania, [email protected] Abstract. The paper present recent developments in the nucleotide genomic signal (NuGS) methodology. NuGS approach is based on the conversion of symbolic nucleotide sequences into numeric genomic signals. NuGS method proved adequate for both large scale analysis of genomic sequences, by revealing hidden regularities and symmetries, and for local analysis, important for the study of gene dynamics. 1 Introduction Nucleotides are the monomers of nucleic acids, DNA and RNA. The nucleobases are the key components that provide the identity of nucleotides, thus their function as letters in the genetic alphabet. The primary nucleobases are adenine (A), cytosine (C), guanine (G) – shared by DNA and RNA, thymine (T) – only in DNA, and uracil (U) – which replaces T it in RNA. The order of nucleotides along a strand stores the genetic information. A triplet of successive nucleotides forms a codon, which encodes amino acids according to the so called standard Genetic Code [1], which is valid for the coding segments (exons) of most nuclear genomes. There are 43 = 64 codons, which encode the 20 amino acids building the proteins, and the ‘stop’ or ‘terminator’ marking the end of a gene and of protein synthesis. A gene can comprise several exons. The total of the protein encoding part of the genome is only 2-3% in homo sapiens, but reaches over 95% in archaea and bacteria. The intergenic part of the human genome contains repetitive, quasi-random sequences and a large amount of transposable elements that bear a close resemblance to the DNA of some independent entities like viruses and bacteria. A significant part of the inter-gene chromosomal DNA plays an important role in the control of protein synthesis, conjointly with other gene regulatory systems. There is a sharp contrast between the deceivingly simple structure of DNA nucleotide chains – an unbranched linear code written in a four letters alphabet, and the overwhelming complexity of the protein tri-dimensional structure. Correspondingly, the complexity of the proteome ─ the set of proteins existing in a cell, tissue, organ or the whole body, exceeds by far the complexity of the genome. The total number of genes in the human genome is only about 32,000, while the proteome comprises more than one million proteins, many of them transitory. The key to organism complexity is not in the number of genes, but in the way parts of the genes are expressed and combined to build different proteins by using alternative splicing. The regulatory mechanisms that control these processes in every cell are sensitive to signals from the environment and from the other cells, tissues and organs in the organism. Nevertheless, the nucleotide chains in DNA molecules and the amino acid chains in various proteins are the bearers of the same genetic information, despite of the many changes that can be introduced by pre- and post-translational processes and the essential role of the way proteins are 3D packed. 2 Representation of Genomic Information The representation genomic information can be either symbolic or numerical. The symbolic representation uses directly the four symbols of the nucleotides, A, C, G and T (or U). The analysis and computational methods are applied directly to the symbolic nucleotide sequences, as they are obtained at the output of the sequencing machines, or as they are retrieved from the genomic databases. The symbolic approach has the advantage of simplicity, important for storage, retrieval and transmission of genomic data, but it also has significant disadvantages in 1 what concerns the processing of data, as it restricts the methodology to pattern matching and statistical analysis [2]. The numerical representation uses an additional conversion step to map each individual nucleotide symbol of a sequence into a corresponding numerical value [3, 4]. The numerical representation allows the effective use of powerful digital signal processing techniques, initially developed in the domains of communications and networking, but which can be applied successfully to genomic research as well. The numeric approach is able to reveal many features of genome structure, symmetries and regularities, which are sometimes hidden and would be difficult to identify by symbolic representation and processing techniques [5, 6]. We have developed and used a specific approach, the nucleotide genomic signal (NuGS) methodology, which can be applied at various scales, going from the global to the local analysis of nucleotide sequences. The large scale analysis of genomic sequences reveals not only ostensive features of extant DNA sequences, at the scale of entire chromosomes, or even entire genomes, but also ancestral features, which existed in DNA sequences but disappeared during subsequent evolution, under the pressure of species separation [4, 5]. 3 Binary Indicator Sequences and their Fourier Transforms A straightforward method to numerically characterize symbolic DNA nucleotide sequences is to use binary indicator sequences. The method attaches four binary indicator sequences uA[k], uC[k], uG[k], and uT[k], k {0, 1,…, N-1}, to a symbolic DNA string Nu[k] of N nucleotides, seen as characters in the alphabet {A, C, G, T}. Each indicator sequence uX[k] keeps track of the occurrence of a symbol X in the sequence: 1, when Nu [k ] X, uX [k ] 0, otherwise, X {A, C, G, T}, k {0, 1, … , N-1}. (1) The Discrete Fourier Transforms (DFTs) of the binary sequences uA[k], uC[k], uG[k] and uT[k], k = 0,…, N-1, can be defined as for any other digital sequence of N entries. The corresponding DFT sequences, UA[h], UC[h], UG[h], and UT[h], h = 0,…, N-1, respectively, have also N entries each, given by the equations: N 1 U X [ h ] u X [ k ] e j 2 h k / N , k 0 X {A, C, G, T}, h {0,1,, N 1} , (2) which provide together the four-dimensional frequency spectrum of a nucleotide sequence. The cumulated power spectrum of the four binary sequences DFTs, S [h] U X [h] U A [h] U C [h] U G [h] U T [h] , h {0,1,, N 1} , 2 2 2 2 2 (3) X has the interesting property that, for most coding regions (exons), there is a period three peak in S[.] (see [7]). Unfortunately, there are many exceptions, and the resolution is not good enough to allow exact exon location. The distribution of symbols along a sequence can also be described by cumulative occurrence sequences defined in function of the binary indicator sequences. For a symbolic DNA string x[k] of N nucleotides, one can define the four cumulative occurrence sequences: k cX [k ] uX [h], X {A, C, G, T}, k {0, 1, … , N-1}. h 0 2 (4) In several representations of genomic sequences, bipolar binary imbalance sequences are used to describe the nucleotide distribution. An imbalance sequence uX-Y[.] gives the difference in the number of occurrences between the symbols X and Y in the symbolic sequence Nu[.]: u X -Y [k ] u X [k ] u Y [k ] {1, 1}, X, Y {A, C, G, T}, k {0, 1, … , N-1}. (5) Binary indicator sequences can also be used to describe nucleotide sequences in terms of classes comprising more then one nucleotide (poly-nucleotides). For instance, it can be convenient to use binary indicator sequences for di-nucleotide classes, which form three 4 dichotomic pairs 6 : 2 weak (uW[.] = uA[.] + uT[.], and strong bonds (uS[.] = uC[.] + uG[.]), (6) purines (uR[.] = uA[.] + uG[.]), and pyrimidines (uY[.] = uC[.] + uT[.]), (7) amino (uM[.] = uA[.] + uC[.]), and keto (uK[.] = uG[.] + uT[.]). (8) Definitions (2, 4, 5) can be easily extended to the di-nucleotide indicator sequences (6 - 8). 4 Nucleotide Complex Representation We define the nucleotide complex representation [4, 5, 6] by the mapping Nu C {Nu} , which establishes a one-to-one correspondence between the nucleotides Nu {A, C, G, T} and their complex representations C {Nu} belonging to the set of complex numbers {a, c, g , t} : a C {A} 1 j, c C {C} 1 j, g C {G} 1 j , t C {T} 1 j, (9) represented in figure 1. Figure 1: Complex representation of single-nucleotide classes. The representations in (9) and Figure 1 are complex numbers having all the same absolute value 2 , but different canonic arguments: arg( a ) 4 , arg(c ) 3 3 , arg( g ) , arg(t ) 4 4 4 (10) The phase of a complex number is a periodical magnitude, as acomplex number does not change when its phase varies with an arbitrary integer multiple of 2̶. the phase period. The arguments given in (10) are the phases of the nucleotide representations {a, c, g, t} belonging to the canonic domain (- The edges of the square in Figure 1 correspond to the di-nucleotide classes: 3 Re = +1 ↔ weak bonds, Re = –1 ↔ strong bonds, Im = +1 ↔ purines, Im = –1 ↔ pyrimidines. (11) Thus, the real and imaginary parts of the nucleotide complex representation describe the membership of nucleotides to the {W, S, R, Y} di-nucleotide classes. The mapping (9) can be applied to a symbolic sequence of nucleotides, Nu[k], k {0, 1, … , N-1}, which is transformed into a complex digital signal C Nu[k ] , k {0, 1, … , N-1}. The real and imaginary parts of the signal samples are given by: C Nu[k ] u W -S [k ] j uR -Y [k ] , k {0, 1,.., N-1}, (12) where uW-S[k] and uR-Y[k] are the imbalance sequences between the weak - strong, and purine pyrimidine di-nucleotide classes, respectively. 5 Phase Nucleotide Genomic Signals The argument sequence resulting from (10) is the simplest phase signal. Using the binary nucleotide indicator sequences uA[k], uC[k], uG[k], and uT[k], k{0, 1,…, N-1}, defined in (1), and the arguments of the nucleotide representation given in (10), the argument sequence results: arg C Nu[k ] uA [k ] 3uC [k ] 3uG [k ] uT [k ] k 0,1, , N 1 . (13) 4 This is a one-dimensional one-to-one representation of nucleotide sequences as complex number sequences, useful for the local analyses of nucleotide sequences. For the global study of the distribution of nucleotides and of di-nucleotides along a sequence, there are other phase signals which can be adapted from Signal Theory for the analysis of any sequences having complex entries. Two phase signals are particularly useful for this purpose, the cumulative and the unwrapped phases. The cumulative phase is defined as the sum of the arguments of the complex representations of the sequence as a signal, from the first (0) to the current, kth, sample: k c [k ] arg(C {Nu[h ]}) , k 0,1,, N 1, h 1 (14) where Nu[h] is the hth nucleotide, h = 1,…,k, in the sequence of N nucleotides, and C {Nu[h]} is its complex representation, with the arguments belonging to the canonic domain (- Using the cumulative occurrence sequences cA[k], cC[k], cG[k], cT[k], k {0, 1, … , N-1}, defined in (4), the cumulative phase (14) can be written as: 3cG [k ] cC [k ] cA [k ] cT [k ] N [k ], k 0, 1,, N 1 , (15) 4 4 where N[k] is the nucleotide imbalance, (16) N [k ] 3cG [k ] cC [k ] cA [k ] cT [k ]), k 0, 1, , N 1 , a signature of the distribution of nucleotides in the sequence. The values cA[k], cC[k], cG[k], and cT[k] give the number of occurrences of the A, C, G, and T nucleotides, respectively, in the first k entries of the sequence. If the distribution of nucleotides in each pair of the singlenucleotide classes week (A-T) and strong (G-T) were balanced, the nucleotide imbalance would be close to zero along the sequence, which is not the case for prokaryotes [5]. The unwrapped phase is defined as the phase of the elements in the sequence, corrected by adding or subtracting a multiple of 2 (i.e., 2m, mZ, Z – the set of integers), in such a way c [k ] 4 that the absolute value of the difference of phase between any two successive entries of the sequence becomes smaller or equal to : u [0] arg(C {Nu[0]}), u [k ] arg(C {Nu[k ]}) 2m , m Z , k 1,..., N 1, so that u [k ] u [k 1] . (17) The inequality |u[k[-u[k-1]| ≤ gives two solutions whenever arg(C {Nu[k ]}) and arg(C {Nu[k 1]}) differ in absolute value exactly with . For the complex representation in figure 1 and the arguments (10), this happens for AC, CA, GT, and TG. The standard MATLAB® unwrap function returns u[k]=u[k-1]+, if arg(C {Nu[k ]}) arg(C {Nu[k 1]}) , and u[k]=u[k-1]–, in the opposite case. This is not convenient for the analysis of nucleotide sequences, where diametrically opposed pairs of nucleotides (such as AC and CA) should be equivalent in what concerns the phase difference between the complex representations of their components. The two solutions must be considered together and equally valid, so that unwrapped phase should vary equiprobably with and -, and the net results should be zero. As a consequence, we classified the 16 di-nucleotides in only three classes: - neutral, comprise both the di-nucleotides that have the complex representations of their components superposed (AA, CC, GG, TT), and those with the components diametrically opposed with respect to the origin (AC, CA, CG, GC), as shown in figure 2a; - positive, for which the complex representations of the components are rotated in the trigonometric direction with respect to each other (AG, GC, CT, TA), figure2b; - negative, for which the complex representations of the components are rotated clock-wise with respect to each other (AT, TC, CG, GA), figure 2c. For neutral di-nucleotides the unwrapped phaseremains the same (u[k] =, for positive di-nucleotides it increases with u[k] = +/2, and for the negative di-nucleotides it decreases with u[k] = –/2. Using (10) and (17), the unwrapped phase results: k u[k] u[0] u[h] u[0] (n[k] n[k]) u[0] P[k], k {1,, N 1}, h1 2 2 (18) where n+[k] is the number of positive di-nucleotides (AG, GC, CT, TA), n‒[k] is the number of negative di-nucleotides (AT, TC, CG, GA) occurring among the first k entries of the sequence. P[k] is the di-nucleotide imbalance, a signature of the distribution of di-nucleotides (pairs of nucleotides) in the sequence, given by: P[k] n[k] n[k], k {1,, N 1}. (19) In all practical cases, u[0] is negligible. As they have a direct statistical significance and are expressed by integer numbers, it is convenient to use the nucleotide imbalance (N) and the di-nucleotide imbalance (P) in the genomic signal analysis, instead of the cumulated phase (c) and the unwrapped phase (u), respectively. 6 Conclusion The Nucleotide Genomic Signal (NuGS) methodology that we have developed and used [4, 5, 6] is based on the conversion of symbolic nucleotide sequences into numeric genomic signals. The representation we adopted is unbiased, i.e., it is adequate for a large variety of problems 5 related to DNA analysis. To achieve such a versatility, the representation is not using the cardinality of numbers – their capacity to express and handle quantities, using instead their ordinality – the capacity to orderly label and classify objects, to handle the order of objects. The large scale analysis of genomic sequences reveals not only ostensive features of extant DNA sequences, at the scale of entire chromosomes or genomes, but also ancestral features, which existed in DNA sequences and disappeared during evolution, under the pressure of species separation. The local analysis of nucleotide sequences is important for the study of gene dynamics [8], especially for tracking the development of pathogen resistance to treatment. Such results can help fast diagnosis and early assessment of drug efficiency, allowing a systematic use of the advances in molecular medicine in support clinical decisions [6, 9, 10]. Genome regularities also allow to predict the nucleotides in a DNA sequence, using a methodology similar to time series prediction, and to estimate the cell self repair potential in processes such as replication, transcription or crossover. The striking regularities revealed with the NuGS approach correspond to surprising symmetries in the distribution of nucleotides and pairs of nucleotides along DNA sequences in archaea, bacteria and eukarya. As a consequence, a genome appears to be – from the structural point of view – more than a plain text, as it also satisfies regularities evoking rhythm and rhyme in poems. References [1] http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi [2] R. Durbin, S. Eddy, A. Krogh, and G.Mitchison, Biological Sequence Analysis. ProbabilisticModels of Proteins and Nucleic Acids, University Press, Cambridge, UK, 1998. [3] I. Cosic, Macromolecular bioactivity: Is it Resonant Interaction between Molecules? - Theory and Applications, IEEE Trans. on BME, 41, pp. 1101-1114, 1994. [4] P. D. Cristea. Conversion of Nitrogenous Base Sequences into Genomic Signals. Journal of Cellular and Molecular Medicine 6(2) 279–303, 2002. [5] P. D. Cristea. Representation and analysis of DNA sequences. In: E. Daugherty, I. Shmulevich, J. Chen, and Z. J. Wang (eds.), Genomic Signal Processing and Statistics, Eurasip Book Series on Signal Processing and Communications, Hindawi Publ. Corp., 2005, pp. 15–65. [6] P. D. Cristea, Rodica Tuduce, J. Cornelis, R. Deklerck, I. Nastac, M. Andrei, Signal Representation and Processing of Nucleotide Sequences, Proc. of the 7th IEEE Intl. Conf. on Bioinformatics and Bioengineering, Harvard Medical School, Boston, USA (2007) 1214-1219. [7] J. A. Berger, S. K. Mitra, and J. Astola, Power spectrum analysis for DNA sequences, Proc. of the 7th Intl. Symp. on Signal Processing and its Applications, 2, pp. 29-32, 2003. [8] P. D. Cristea, Building Phylogenetic Trees by Using Gene nucleotide Genomic Signals, 34th Annual International IEEE EMBS Conference, August 28 - September 1, 2012, Hilton, San Diego, California, USA. [9] S. T. Cole, R. Brosch, J. Parkhill, et al. Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature, 393(6685), pp. 537-544, 1998. [10] A. Telenti et al., Detection of rifampin-resistance mutations in Micobacterium tuberculosis, Lancet, 341, pp. 647-650, 1993. 6
© Copyright 2026 Paperzz