Journal of Medical and Biological Engineering, 33(5): 504-512 504 Choi-Williams Distribution to Describe Coding and Non-coding Regions in Primary Transcript Pre-mRNA Umberto S. P. Melia1,2,3,* Juan J. Gallardo3 Montserrat Vallverdú1,2,3 Alexandre Perera1,2,3 Francesc Clarià4 Pere Caminal1,2,3 1 Department of Automatic Control, Barcelonatech, 08028 Barcelona, Spain Biomedical Engineering Research Centre, Barcelonatech, 08028 Barcelona, Spain 3 Centers of Biomedical Research Network of Bioengineering, Biomaterials and Nanomedicine, 08028 Barcelona, Spain 4 Department of Computer Science and Industrial Engineering, Lleida University, 25003 Lleida, Spain 2 Received 4 Nov 2011; Accepted 18 May 2012; doi: 10.5405/jmbe.1060 Abstract Deoxyribonucleic acid (DNA) information is discrete in both “time” (sequence positions) and “amplitude” (nucleotide values). This permits the use of signal processing techniques for its characterization. The conversion of DNA nucleotide symbols into discrete numerical values enables signal processing to be employed to solve problems related to sequence analysis, such as finding coding sequences. In this work, a numerical conversion method was chosen based on the thermodynamic data of free energy changes (ΔG°) of the formation of a duplex structure of DNA or ribonucleic acid (RNA), associated with the nucleotide sequence pre-mRNA (messenger RNA). The aim of this work was to characterize coding regions (exons) from non-coding regions (introns) using a methodology based on time-frequency representation (TFR). This permits the observation of the evolution of the periodicity and frequency components with time, introducing more variables related to the gene sequences compared to those used in traditional fast Fourier transform analysis. The parameters calculated from TFR are instantaneous frequency and instantaneous power. It was found that instantaneous frequency and power variables in different frequency bands allowed the correct classification between exons and introns with a prediction accuracy of more than 85%. Keywords: Bioinformatics (genome) databases, Classification and feature extraction, Stochastic processes, Time series analysis 1. Introduction The genetic information necessary to reproduce, develop, and keep alive an organism is coded in the deoxyribonucleic acid (DNA) sequences that are mostly inside the cell nucleus. DNA comprises a succession of the nucleotides adenine (A), thymine (T), cytosine (C), and guanine (G). This sequence contains sections called genes that contain the information for building proteins, the functional units of a living organism. When a gene is expressed, the sequence of DNA is used as a pattern to synthesize ribonucleic acid (RNA) during a process called transcription. The result represents a single-stranded copy of primary transcript pre-mRNA (messenger RNA), where thymine is replaced with uracil (U). In eukaryotes, genes are further divided into relatively smaller protein coding segments known as exons, interrupted * Corresponding author: Umberto S. P. Melia Tel: +34-93-4017160; Fax: +34-93-4017045 E-mail: [email protected] by non-coding spacers known as introns, which are removed during the splicing process for the formation of mature RNA. This is the final chain made by continuous sequences of coding regions that can be translated into proteins. In complex organisms, the primary RNA transcript could be alternatively edited, so that the set of exons finally expressed can vary in response to specific biological signals. In complex organisms, two levels of molecular machinery are involved in the splicing of pre-mRNA transcripts: the basal machinery and gene regulation system. The basal machinery, which is found in all organisms whose genome contains introns, consists of five small nuclear RNA molecules (sRNA) [1]. These molecules, which are formed by a few nucleotides, bind to certain proteins to form the spliceosome complex that performs the two primary functions of splicing: recognition of the intron/exon boundaries and catalysis of the cut-and-paste reactions that remove introns and join exons [2]. For each intron, four nucleotide sequences work as signals that report to the spliceosome where to cut: at the start of the intron or cut point 5', at the end or cut point 3', in the middle or area of branching, or at the polypyrimidine stretch [3]. Moreover, the gene regulation system controls the J. Med. Biol. Eng., Vol. 33 No. 5 2013 process of splicing and cutting by driving the basal machinery to these cut point [4,5]. The classification of human gene sequences into exons and introns is a difficult problem in DNA sequence analysis. The prediction of genes and the classification of coding and non-coding DNA sequences are popular research areas. In the past twenty years, numerous advanced statistical gene-finding algorithms have been developed using two approaches for computational gene prediction. One is gene structure and signal-based searches (ab initio prediction of coding or noncoding regions) and the other uses algorithms based on knowledge of other organisms. Several complex systems for predicting gene structure have been developed. Many genefinding systems are mainly based on a statistical or neural network approach [6-11], or use an optimization algorithm. In one study, a dynamic program was applied to discriminate between exons and pseudo-exons and between introns and pseudo-introns of yeast DNA [12]. A two-dimensional (2D) graphical representation was used by Nandy [13] in order to provide an indication of possible exon regions within a long DNA sequence. Wu et al. [14] proposed a set of statistical features, called SZ features, and combined them with other features as the information of the stop codon (a nucleotide triplet within mRNA that indicates a termination of translation). A significant improvement in the recognition accuracy was achieved with short sequences of DNA. Chang introduced various visualization and graphical representation schemes of symbolic DNA sequences [15]. Several visualization schemes were reviewed and the techniques compared by analyzing possible applications for sequence alignment, feature extraction, and sequence clustering. Particularly, a study by Lo et al. [16] used a three-dimensional trajectory method for DNA sequence visualization. Significant pattern disparity, helpful in monitoring the deliberate deposition of false sequences in public domains, was revealed. Using a representation in which the four bases correspond to four vertices of a tetrahedron, they observed local differences and similarities between two DNA sequences. Since DNA information is discrete in both time and amplitude, digital signal processing techniques are an efficient approach for its investigation. Several studies have tried to demonstrate that the DNA periodicity in the coding region is contributed by codon usage frequencies, as well as the lack of this periodicity in introns. This periodicity is introduced by the coding biases in the translation of codons into amino acids [17,18]. Thus, Fourier transform analysis has been widely used for sequence processing [19-21]. A method was developed based on auto-regressive modeling using forward-backward linear prediction and the singular value decomposition algorithm [22]. One study [23] found that a frequency-domain approach can help the extraction of exon and intron characteristics. In fact, it was demonstrated [24] that the frequency spectrum has some perturbations when an exon is present. The Fourier transform poses problems of windowing or data truncation artifacts and spurious spectral peaks, and thus can only reveal the global periodicity of stationary signals, 505 losing the time dependence. In contrast, a time-frequency representation (TFR) permits the observation of the evolution of the periodicity and frequency components with time, allowing the analysis of non-stationary signals [25]. Moreover, this representation, which keeps the time dependence of signal features, gives the possibility of introducing more variables related to the gene sequences that a traditional analysis does not allow. The conversion into numerical values of the symbols associated with the nucleotide sequence of DNA permits problems related to the localization and annotation of genes and coding sequences to be addressed using signal processing tools. It is worth mentioning that in this representation, time is associated with the position of nucleotides along the gene sequence. In the present work, an analysis based on the TFR, from a Choi-Williams distribution (CWD), of a large number of gene sequences was performed in order to find which variables, related to pre-mRNA, can better discriminate coding regions from non-coding regions. The nucleotide sequences were translated into numerical sequences by Gibbs energy conversion and then TFR analysis was applied. From TFR, several variables were proposed and statistically analyzed in order to differentiate exon sequences, intron sequences, and their partial sequences. This work was conducted in order to study the reliability of constructing a detector of intron and exon sequences based on TFR for future research. There are various detectors that can approximate the position of the start and end of exons by applying statistic tools to DNA sequences in FASTA string format. One of the detectors with high accuracy uses a threeperiodic fifth-order hidden Markov model [26,27]. The present study tests the capability of time-frequency functions to differentiate exons and introns in order to apply them in a future detector instead of FASTA string DNA. 2. Materials and methods 2.1 Analyzed database About 1000 human gene sequences were processed, taken from two chromosomes (chromosome 1 and chromosome 2) with the corresponding notation of the positions of exons and introns. These sequences with their respective annotation were taken from the public database Ensembl (2009) [28] using the R package biomaRt [29]. Software was developed in order to organize and manage gene sequence data and particular information. About 4000 exons and 10000 introns were contained in the analyzed datasets. The minimum length taken into account for introns and exons was 100 nucleotides. 2.2 Analysis of sequences In order to characterize coding and non-coding regions in primary transcript pre-mRNA, different analyses were performed, taking into account different parts of the gene. In this way, exon and intron sequences were analyzed with different approaches: Time-Frequency Representation of Pre-mRNA 506 1. Study A: Discrimination between the complete sequences of exons and the complete sequences of introns. 2. Study B: Classification of fixed-length windows sliding on the complete gene sequences. 3. Study C: Discrimination between the complete sequences of exons and the beginning part or the ending part of introns. Study B was used to define whether a selected window belongs to an exon or an intron sequence. Several lengths and several overlaps of a window were taken into account. In this way, a window belongs to one of three kinds of region depending which part of the gene is studied: 1. Pure exon region: all nucleotides of the window belong to an exon sequence. 2. Pure intron region: all nucleotides of the window belong to an intron sequence. 3. Transition region between exon and intron nucleotides: one part of the window contains exon (or intron) nucleotides and the other part contains intron (or exon) nucleotides. In this case, the window contains the point where the exon (or intron) ends and the intron (or exon) starts. The classes considered for the classification were exon region and intron region; therefore, the transition regions were considered exon or intron depending on the prevalence of nucleotides belonging to an exon or an intron region. In the calculation of the results, pure exons, pure introns and also the influence of the transition regions were taken into account. Finally, the results of all the studies were obtained by looking for the best single variable and the best combined variables using a genetic algorithm (GA). 2.3 Numerical representation of genomic sequences The thermodynamic values of changes of enthalpy (ΔH°), entropy (ΔS°), and free energy (ΔG°) of the formation of a duplex structure of DNA or RNA can be calculated from the thermodynamic data library [30]. They are based on the nearest-neighbor interactions between 10 base pairs of nucleotides: AA/UU, AU/UA, UA/AU, CG/GC, GC/CG, CU/GA, GA/UC, GU/AC, AC/GU, and GG/CC. These thermodynamic values are also useful for predicting the energy required to break the bond between complementary nucleotides of the double helix. The values of the parameters of these Watson-Crick base pairs are shown in Table 1 [31]. Table 1. Energy values for pairs of nucleotides. Sequence dAA/dUU dAU/dUA dUA/dAU dCA/dGU dCU/dGA dGA/dCU dGU/dCA dCG/dGC dGC/dCG dGG/dCC ΔH° kcal/mol –8.0 –5.6 –6.6 –8.2 –6.6 –8.8 –9.4 –11.8 –10.5 –10.9 ΔS° cal/mol/K –21.9 –15.2 –18.4 –21.0 –16.4 –23.5 –25.5 –29.0 –26.4 –28.4 ΔG° kcal/mol –1.2 –0.9 –0.9 –1.7 –1.5 –1.5 –1.5 –2.8 –2.3 –2.1 Thermodynamic parameters for DNA-RNA obtained at 310.15 K with an estimate error of below 5-6% The thermodynamic variables include information about the stability of the sequence. The Gibbs free energy (ΔG°) was used in order to convert into numeric values the symbols associated with the nucleotide sequence of pre-mRNA. Thus, the obtained sequence is a numerical sequence with the advantage of providing a physical meaning. . It accounts for the ease with which a number of nucleotide binds to their complementary set, a circumstance that occurs continuously during the editing process in the pre-mRNA sequence. This happens during the cutting of introns and the union of exons in order to generate the mRNA sequence that carries the protein code. The obtained sequences are a succession of thermodynamic values that can be assimilated to a nonstationary stochastic process with a non-zero mean value. Thus, it can be considered that the numerical representation of the nucleotide sequence is a signal sampled at intervals of 1 second. Consequently, the maximum bandwidth frequency is 0.5 Hz in accordance with the Nyquist theorem. 2.4 Time-Frequency representation tool For continuous time and frequency variables, the Wigner distribution (WD) for signal x(t) is defined as: x(t 2 ) x (t 2 )e Wx(t , f ) * j 2 f d (1) There are many other quadratic TFRs with an energetic interpretation [32]. The class of all time-frequency shift-invariant quadratic TFRs is known as the quadratic Cohen’s class. Prominent members of Cohen’s class are the spectrogram and the WD. Every member of Cohen’s class may be interpreted as a 2D filtered WD. In fact, it can be shown that Tx (t , f ) is a member of Cohen’s class, if and only if it can be derived from the WD of the signal x(t) via a time-frequency convolution. Tx (t , f ) h(t t ', f f ')Wx(t ', f ')dt ' df ' (2) Each member Tx of Cohen’s class is associated with a unique, signal-independent kernel function h(t , f ) (or 2D filter). The convolution is transformed into a simple multiplication in the Fourier transform domain. For the analysis of each gene, a particular Cohen’s class distribution was chosen, namely the CWD [33]. It was applied to numerical sequences, after the calculation of the analytical signal, using Eq. (2) with the WD in Eq. (1) and the following Choi-Williams exponential: h(t , f ) 4 c 4 2 e ( tf )2 c (3) Equation (3) preserves the properties of the WD [33,34], such as the marginal properties and instantaneous frequency. Moreover, it is able to reduce the WD interference by estimating an adequate c parameter for the working band. The proposed estimation criterion was: 1 Amb 0.95 Amt (4) J. Med. Biol. Eng., Vol. 33. No. 5 2013 Time (s) Figure 2 shows an example of a CWD normalized by the maximum value of a gene (FH gene), indicating exon and intron edges. For the same gene, Fig. 3 shows the instantaneous frequency along the nucleotide positions in the TF band of an intron and an exon. The dynamics of the intron and exon are different, with the behavior observed for the intron being more regular. Another characterization of these sequences is shown in Fig. 4, where the instantaneous power is divided in the four bands defined in this study. The dynamics of the evolution of the intron and exon sequences are different. In order to describe these differences, variables that are able to quantify them should be considered. Amplitude (a.u.) Amplitude (a.u.) where Amb represents the mean value of the spectral amplitude between test tones and Amt is the mean value of the spectral amplitude at the test tones [34]. The parameter value of the CW exponential was estimated at 0.005 in order to eliminate the interferences produced by a test sinusoidal signal made by four frequency components at 0.05, 0.15, 0.3, and 0.45 Hz. The WD of this signal test and the TFR without interferences obtained by the CWD are shown in Figs. 1(a) and 1(b), respectively. 507 Frequency (Hz) (a) Amplitude (a.u.) Frequency (Hz) Number of Nucleotides (b) Figure 1. (a) WD interferences of a simulated signal and (b) WD without interferences obtained using CWD. In order to calculate variables that might reflect characteristics of the signal in different frequency ranges, the spectrum was divided into the following frequency bands: very low frequency (VLF), 0-0.1 Hz; low frequency (LF), 0.10.2 Hz; high frequency (HF), 0.2-0.4 Hz; very high frequency (VHF), 0.4-0.5 Hz; total frequency (TF), 0-0.5Hz. Instantaneous power was calculated for each gene signal as the CWD integral in the frequency in each of the bands (Eq. (5)). Subsequently, for each band, the instantaneous frequency function was calculated [35] as the average frequency of the spectrum with respect to time (Eq. (6)). Enorm (t ) f2 f1 f (t ) i (5) Tx (t , f )df Tx (t , f )df where f1 and f2 are the limits of the frequency bands. Number of Nucleotides (a) Number of Nucleotides (b) Figure 3. Instantaneous frequency of (a) intron and (b) exon. 2.5 Shannon and Rényi entropies Tx (t , f )df fTx (t , f )df Frequency (Hz) Frequency (Hz) Figure 2. CWD of FH gene (blue lines divide exon and intron regions). Frequency (Hz) Time (s) (6) In addition to traditional variables (mean, standard deviation, and maximum-minimum value), Shannon and Rényi entropies were used to quantify the regularity, uncertainty, or randomness of the sequences. They were applied to each sequence of instantaneous frequency and instantaneous power calculated from each CWD and to the original sequences of the Gibbs energy for each gene. Time-Frequency Representation of Pre-mRNA Power (a.u.) 508 Power (a.u.) Number of Nucleotides Power (a.u.) Number of Nucleotides Power (a.u.) Number of Nucleotides Number of Nucleotides Power (a.u.) (a) Power (a.u.) Number of Nucleotides Power (a.u.) Number of Nucleotides variables were defined. In the analysis, the original sequences of the Gibbs energy were considered in order to determine whether this representation is sufficient to obtain acceptable results without introducing TFR parameters. For each sequence, the following parameters were calculated: mean, standard deviation, value of the peak, peak position, minimum value, and Shannon and Rényi entropies. All these variables were obtained from the complete sequence and from the n = {25,50,75,100} nucleotides at the start and at the end of each intron and exon. However, Shannon and Rényi entropies were only calculated for the complete sequence in order to avoid uncertainty in the calculation of the probability of short signals, since their consistency decreases with decreasing number of data points. The position of the peak was calculated only for the n samples at the start and at the end of intron and exon sequences. Since this value depends of the length of the sequence, this could influence the statistical classification. In these n samples, it was considered that there is a peak if the peak value exceeds 75% of the maximum value of the entire sequence; otherwise it was assumed that no maximum value exists in these regions. In this way, all variables become independent of the length of the sequence. 2.7 Statistical analysis where Q = 8 represents the number of levels in which the signals are quantized. Values of Rényi entropies for q = {0.15,0.25,0.5,2,3} were calculated. In Eq. (8), the largest probabilities most influence the Rényi entropy when q > 1 and the smallest probabilities most influence the value of Rényi entropy when 0 < q < 1. The Rényi entropy converges to the Shannon entropy when q1 [35]. Two classes are considered for the statistical analysis: exons and introns. The U of Mann-Whitney test was calculated for each defined variable in order to find the variables that best classify introns versus exons. A statistically significant level of p < 0.05 was considered for the analysis. The Mann-Whitney test is a non-parametric test, independent of the kind of distribution, used to compare two independent classes of sampled data and for assessing whether two independent samples of observations come from the same distribution. The considered populations were exons and introns. A discriminant analysis between both classes was then performed. The aim of the discriminant analysis was to study how the defined variables were able to correctly classify the data into the two classes, exons and introns. For this purpose, classification functions were constructed for each variable and also for combined variables. Each function allows classification scores to be computed for each variable value calculated for each class (exons and introns). A quadratic function, Eq. (9), was used for this classification. In this formula, the subscript i denotes the respective class, the subscripts j = 1, 2, ..., m denote the variable under consideration taken from the total number of calculated variables, ci is a constant for each class, wij is the weight for the jth variable in the computation of the classification score for the ith class, and xj is the observed value for the respective jth variable. The resultant classification score is Si. 2.6 Definition of variables Si (ci wi1 x1 wi 2 x2 ... wim xm )2 Power (a.u.) Number of Nucleotides Number of Nucleotides (b) Figure 4. Instantaneous power of (a) intron and (b) exon. Let X be a discrete random variable which takes a finite number of possible values x1, x2 ,x3..., xn with probabilities P(1), P(2), P(3)..., P(n), respectively, such that P(i) i , i = 1, 2, n 3…n, i 1 P(i) 1 . Shannon (ShaEn) and Rényi (REq) entropy are respectively defined as: Q ShaEn P(i) log 2 ( P(i)) (7) i 1 REq Q 1 log 2 ( P q (i)) 1 q i 1 (8) From the calculated parameters (instantaneous frequency and instantaneous spectral energy) and the original sequences of the Gibbs energy along nucleotide position, a total of 316 (9) The classification function is used to directly compute classification scores for some new observations. The first step for classification is to build a discriminant function by J. Med. Biol. Eng., Vol. 33 No. 5 2013 estimating the coefficients win; the aim is to make Si as different as possible between the classes (introns and exons). The optimal ratio is the one that maximizes Eq. (10). Table 2. Best single variables for study A. Chromosome 1 Variables R quadratic sum of "between class differences " quadratic sum of " within class differences " 509 Mean (std) for exons (10) where the numerator contains the differences of the terms wi’n xn and wi’’n xn , with i’≠i’’, and the denominator contains the differences of the terms wi’n xn and wi’’n xn, with i’=i’’ [36]. The classification algorithm uses half of the number of genes for training to build a classification function, and the other half for validating its accuracy. The percentage of correctly classified exons and introns was respectively calculated as the ratio of the number of exons or introns correctly classified to the total number of exons or introns, respectively. A feature selection based on a GA was used in order to find an optimal combination of variables, chosen from all variables, which provides high accuracy in the classification of exons and introns. This allows a set of variables with good percentages of correct classification to be obtained without exploring all combinations. This algorithm performs a number of iterations using Eq. (9) with different variables randomly chosen in an attempt to get a cost function that maximizes the percentages of correctly classified exons and introns. A more thorough analysis of all combinations of all defined variables is thus achieved. 3. Results After obtaining the classification function constructed using half of the dataset by the training procedure, the best classification variables were validated using the other half of the dataset. Results presented in Table 2 show the best percentages of the validated variables. For study A, only entropies of instantaneous frequency and spectral energy were able to characterize (p < 0.0005) exons and introns with a prediction accuracy higher than 60%. Table 2 shows the mean value, the standard deviation, and the percentages of correctly classified exons and introns for the best variables. It can be noted that the mean value of entropy of exons is always higher than that of introns. Furthermore, Rényi entropy with q > 2 was not able to discriminate exons and introns with a prediction accuracy higher than 60%. The best prediction accuracy was obtained with the variable of Rényi entropy (q = 0.25) of instantaneous power (Inst_Pw_HF_RE025). Combining variables using the GA led to higher percentages of correctly classified exon and intron sequences. The GA was applied to three datasets of variables: 1. Only the best 20 single variables with statistical significance level of p < 0.0005 and prediction accuracy higher than 60%. 2. All the variables calculated from chromosome 1. 3. All the variables calculated from chromosome 2. Table 3 shows the best combination of variables obtained from these datasets. The best accuracy was obtained with the Inst_Freq_ VHF_RE015 Inst_Freq_ VHF_RE025 Inst_Freq_ VHF_RE05 Inst_Freq_ VHF_ShaEn Inst_Freq_ VHF_RE2 Inst_Freq_ HF_RE05 Inst_Freq_ HF_ShaEn Inst_Freq_ LF_ShaEn Inst_Freq_ VLF_ShaEn Inst_Freq_ShaEn Inst_Pw_ HF_RE025 Chromosome 2 Inst_Freq_ VHF_RE015 Inst_Freq_ VHF_RE025 Inst_Freq_ VHF_RE05 Inst_Freq_ VHF_ShaEn Inst_Freq_ VHF_RE2 Inst_Freq_ HF_RE05 Inst_Freq_ HF_ShaEn Inst_Freq_ LF_ShaEn Inst_Freq_ VLF_ShaEn Inst_Freq_ShaEn Inst_Pw_ HF_RE025 Mean (std) for introns 2.7869 (0.1638) 2.7457 (0.1530) 2.7548 (0.1547) 2.8997 (0.1239) 2.7168 (0.1490) 0.7722 (0.0209) 2.9351 (0.1231) 2.9289 (0.1214) 2.9649 (0.1258) 2.8669 (0.1650) 0.6354 (0.0600) 2.7016 (0.1526) 2.6240 (0.1554) 2.6265 (0.1588) 2.8149 (0.1215) 2.6276 (0.1348) 0.7505 (0.0261) 2.8186 (0.1395) 2.8158 (0.1372) 2.8834 (0.1283) 2.7867 (0.1473) 0.5949 (0.0555) 2.8236 (0.1481) 2.7655 (0.1486) 2.7682 (0.1500) 2.9212 (0.1135) 2.7428 (0.1367) 0.7747 (0.0200) 2.9480 (0.1190) 2.9447 (0.1184) 2.9921 (0.1124) 2.9071 (0.1346) 0.6412 (0.0584) 2.6854 (0.1521) 2.6050 (0.1555) 2.5961 (0.1588) 2.7894 (0.1256) 2.6009 (0.1381) 0.7450 (0.0270) 2.7900 (0.1421) 2.7967 (0.1381) 2.8680 (0.1308) 2.7656 (0.1514) 0.5901 (0.0536) Correctly classified exons (%) Correctly classified introns (%) Nex = 1960 Nintr = 3557 61.0 65.1 64.1 64.7 67.1 66.4 67.9 66.0 67.3 67.1 62.4 64.2 61.5 66.6 65.8 77.2 62.1 67.3 73.0 62.6 66.5 71.3 62.7 66.2 67.5 61.2 63.7 60.8 64.6 63.0 60.0 72.0 68.3 Accurac y (%) Nex = 2116 Nintr = 5761 67.9 69.1 69.5 73.6 69.0 70.8 73.1 70.4 71.9 77.1 65.9 69.7 71.9 68.7 70.2 82.9 65.5 70.9 78.4 68.3 71.8 77.7 66.9 70.7 74.8 65.2 68.6 76.0 64.0 67.5 66.7 72.4 71.4 Variables with statistical significance level p < 0.0005 for the classification of exons and introns. Inst: instantaneous; Freq: frequency; Pw: power; std, standard deviation; REq: Rényi entropy with q = {0.15,0.25,0.5,2,3}; ShaEn: Shannon entropy; Nex: Number of exons; Nintr: Number of introns. variables obtained from chromosome 2. It should be noted that the single variables with the highest accuracy do not give the best accuracy when combined. However, the accuracy increased when those variables were combined with the remaining variables. For study B (Table 4), the best classification for chromosome 1 was obtained with a window length of 300 nucleotides. The prediction accuracy was higher when only the pure zones of introns and exons were taken into account. Furthermore, the prediction accuracy was low when only the change zone was taken into account. This means that if the window is placed on the change zone, the characteristics of both exons and introns are mixed, showing an insignificant statistical level because the features that can differentiate the two classes are hidden. Time-Frequency Representation of Pre-mRNA 510 Table 3. Best combination of variables for study A (percentages are correctly classified exons and introns). * Inst_Freq_VHF_RE015 Inst_Freq_VHF_RE025 Inst_Freq_VHF_RE05 Inst_Freq_VHF_ShaEn Inst_Freq_VHF_RE2 Inst_Freq_HF_RE025 Inst_Freq_HF_RE05 Inst_Freq_HF_ShaEn Inst_Freq_LF_RE025 Inst_Freq_LF_RE05 Inst_Freq_LF_ShaEn Inst_Freq_VLF_RE025 Inst_Freq_VLF_RE05 Inst_Freq_VLF_ShaEn Inst_Freq_ShaEn ** Inst_Freq_HF_mean_End50 Inst_Freq_VHF_mean_End25 Inst_Freq_LF_std_End75 Inst_Freq_HF_std_End100 Inst_Freq_HF_RE05 Inst_Freq_LF_RE2 Inst_Freq_RE3 Inst_ Pw_mean_End100 Inst_Pw_HF_mean_End75 Inst_ Pw _HF_std_End50 Inst_ Pw _VLF_peak_Start100 Inst_ Pw _LF_peak_Start75 Inst_ Pw _VLF_min_End100 Inst_ Pw _VLF_tauMx_Start100 Inst_ Pw _VHF_ShaEn Inst_ Pw _HF_RE015 *** Inst_Freq_HF_mean_Start25 Inst_Freq_VHF_mean_End50 Inst_Freq_LF_std Inst_Freq_LF_std_Start100 Inst_Freq_VHF_std_Start25 Inst_Freq_LF_RE025 Inst_Freq_VLF_RE05 Inst_Freq_mean Inst_Freq_ShaEn Inst_ Pw _VLF_std_End25 Inst_ Pw _VLF_std_End75 Inst_ Pw _HF_std_End50 Inst_ Pw _HF_peak_Start25 Inst_ Pw _VLF_min_End75 Inst_ Pw _HF_tauMx_Start100 Inst_ Pw _VLF_tauMx_End100 Inst_ Pw _VHF_RE025 Chromosome 1 Chromosome 2 exon: 73.9 % intron: 73.1% exon: 73.6 % intron: 77.2% accuracy: 73.4% accuracy: 76.2% exon: 78.7% intron: 78.2% exon: 79.8% intron: 76.3% accuracy: 78.5% accuracy: 77.9% exon: 75.8% intron: 80.1% exon: 82.8% intron: 82.4% accuracy: 79.7% accuracy: 83.0% *Best combination found for the 20 best single variables **Best combination found for chromosome 1 ***Best combination found for chromosome 2 Variables with statistical significance level p < 0.0005 for the classification of exons and introns. Inst: instantaneous; Freq: frequency; Pw: power ; std: standard deviation; REq: Rényi entropy with q = {0.15,0.25,0.5,2,3 ; ShaEn: Shannon entropy; min: minimum value; tauMx: normalized position of maximum value; Start/End_n: first/last n nucleotides. Table 5 shows the subset of variables that gives the highest prediction accuracy for both studies A and B, for both chromosomes. These results were obtained by combining the variables of instantaneous frequency at HF and VHF. The best results were obtained using the entire sequences, but it can be noted that sequences of 300 nucleotides were sufficient for characterizing exons and introns with an accuracy of about 65%. In study C, the best subset of variables for discrimination between the complete sequence of exons and the 100 nucleotides at the start (AEIS, all exon versus intron start) and at the end (AEIE, all exon versus intron end) of introns is shown in Table 6 for both chromosomes. This study had the highest prediction accuracy, higher than 85%. The obtained results seem to show that using only the original sequences of Gibbs energy is insufficient to discriminate exons and introns with acceptable accuracy. In fact, only one variable related to Gibbs energy is included in the subsets of variables that give the best prediction accuracy and occurred only for study B. Variables related to the parameters obtained from TFR, such as instantaneous frequency and power. The best results were obtained by combining less than 20 variables of TFR. Adding more variables did not significantly increase the prediction accuracy. Furthermore, the variables related to the first or last 100 nucleotides in study A gave higher prediction accuracy than variables related to the first and last 75, 50, and 25 nucleotides. Therefore, for study C, only variables related to the first and last 100 nucleotides were taken in order to classify the complete sequences of exons versus the beginning or the end parts of introns. In this preliminary evaluation, there were sequences with very different lengths, especially in study A since the length of introns is generally higher than that of exons. Of note, all proposed variables are independent of the length of the sequence. Consequently, the results are not influenced by the differences in the length of sequences. The prediction accuracy could be also influenced by the presence of unknown alternative splicing in the intron sequences. This was confirmed by Pallejà et al. [37], who identified the main cause of incorrect annotation of genes as the error in the prediction of initial codons. Particularly, a frameshift at the 3'-end or a point mutation at the stop codon may cause the loss of the stop codon, thus extending the reading frame to the next in-frame stop codon. For this cause, some of the introns could have features of exons, which could cause misclassification of these introns, decreasing accuracy. Facchiano [38] and Chung et al. [39] confirmed the existence of overlapping regions between canonical and alternative reading in the exons of a eukaryote genome. TFR methodology can be applied to the characterization of this dual coding region, often common in bacteriophages and viruses, in the future. For both chromosomes, a quite high accuracy was obtained by selecting 10 to 15 variables from the 316 variables. Furthermore, it is possible to isolate many subsets of 10-15 variables with a quite high accuracy for both chromosomes. This suggests that these variables are sufficiently robust for the characterization of intron and exon sequences for all chromosomes of the human genome. 4. Conclusion A methodology based on TFR was proposed for analyzing genomic sequences. It was applied to the Gibbs energy of the boundary between nucleotides for exploring the information in this energy in the transcription process of genomic sequences. Variables related to Gibbs energy sequences are insufficient for correctly classifying exons and introns; their prediction accuracies are low compared to those of TFR variables. J. Med. Biol. Eng., Vol. 33. No. 5 2013 511 Table 4. Best combination of variables for the best window length (300 nucleotides) for study B (percentages are correctly classified exons and introns) VARIABLES Inst_Freq_HF_mean_Start75 Inst_Freq_HF_mean_End100 Inst_Pw_VLF_mean Inst_Pw_VLF_std Inst_ Pw _VHF_std_Start25 Inst_ Pw _VHF_std_End100 Inst_ Pw_LF_min Inst_ Pw _VHF_tauMx_Start100 Inst_ Pw _HF_RE2 RESULTS FOR PURE EXON/INTRON REGIONS RESULTS FOR TRANSITION REGIONS OVERLAP:10 exon: 73.2 % intron: 60.4 % OVERLAP:25 exon: 72.1 % intron: 61.8 % OVERLAP:50 exon: 70.5 % intron: 64.1 % accuracy: 63.5 % accuracy: 64.7 % accuracy: 65.8 % exon: 76.7% intron: 62.9% accuracy: 66.5% exon: 63.2% intron: 48.8% accuracy: 52.1% exon: 65.3% intron: 68.3% exon: 76.1% intron: 64.1% accuracy: 66.7% exon: 61.5% intron: 51.2% accuracy: 54.3% exon: 65.4% intron: 68.3% exon: 74.0% intron: 66.3% accuracy: 68.9% exon: 60.8% intron: 53.8% accuracy: 56.9% exon: 65.5% intron: 68.4% accuracy: 68.0% accuracy: 68.0% exon: 70.5% intron: 70.8% accuracy: 70.7% exon: 51.2% intron: 56.6% accuracy: 55.5% exon: 71.1% intron: 61.7% exon: 70.7% intron: 70.7% accuracy: 70.7% exon: 50.9% intron: 57.2% accuracy: 55.5% exon: 71.0% intron: 64.2% accuracy: 63.8% accuracy: 65.6% exon: 72.8% intron: 65.0% accuracy: 66.9% exon: 66.4% intron: 46.3% accuracy: 51.8% exon: 71.7% intron: 65.8% accuracy: 67.8% exon: 65.7% intron: 46.2% accuracy: 51.7% Inst_Freq_HF_mean Inst_Freq_VLF_std_Start25 Inst_Freq_HF_RE015 Inst_Freq_VLF_RE2 Inst_Freq_LF_RE2 accuracy: 68.0 % Inst_Pw_VLF_mean_Start100 Inst_ Pw _LF_min Inst_ Pw _HF_tauMx_End100 RESULTS FOR PURE EXON/INTRON exon: 70.5% REGIONS intron: 70.8% accuracy: 70.7% RESULTS FOR TRANSITION REGIONS exon: 51.0% intron: 56.7% accuracy: 55.5% GEnergy_mean Inst_Freq_LF_mean_Start50 exon: 70.8% Inst_Freq_HF_mean_Start50 intron: 61.7% Inst_Freq_HF_mean_End100 Inst_Freq_std_End50 accuracy: 63.8% Inst_Freq_LF_std_Start100 Inst_Freq_VLF_std_End25 Inst_ Pw _VLF_peak_Start75 Inst_ Pw_VLF_min RESULTS FOR PURE EXON/INTRON exon: 72.6% REGIONS intron: 65.0% accuracy: 66.9% RESULTS FOR TRANSITION REGIONS exon: 65.7% intron: 46.2% accuracy: 51.7% Variables with statistical significance level p < 0.0005 for the classification of exons and introns Inst: instantaneous; Freq: frequency; Sp: spectral; En: energy; std: standard deviation; REq: Rényi entropy with q = {0.15,0.25,0.5,2,3}; ShaEn: Shannon entropy; min: minimum value; tauMx: normalized position of maximum value; Start/End_n: first/last n nucleotides. Table 5. Best combination of variables for studies A and B and chromosomes 1 and 2 STUDY A STUDY B CHR 1 CHR 2 DETECTION DETECTION CHR 1 CHR 2 WINDOW: 300 WINDOW: 300 OVERLAP: 25 OVERLAP: 25 exon: 75.0% exon: 72.4% exon: 69.9% exon: 74.1% intron: 73.8% intron: 77.0% intron: 62.6% intron: 64.7% accuracy: 74.6% accuracy: 76.2% accuracy: 64.8% accuracy: 66.8% Four Variables: Inst_Freq_HF_mean; Inst_Freq_HF_ShaEn; Inst_Freq_VHF_mean_End5; Inst_Freq_VHF_min_Start100 The start and end parts of introns (cut points 5' and 3', respectively) and the polypyrimidine tract (toward the end of an intron) contain nucleotide sequences that work as signals for indicating where the spliceosome has to cut. For this, the signals obtained in this region, with the conversion of nucleotides into numerical signals, must have more regularity compared with the signals obtained from exons that do not need to notify the spliceosome about their position. This Table 6. Best combination of variables for study C (percentages are correctly classified exons and introns). Inst_Freq_VLF_mean Inst_Freq_LF_mean Inst_Freq_HF_mean Inst_Freq_VHF_mean Inst_Freq_VLF_std Inst_Freq_LF_std Inst_Freq_HF_std Inst_Freq_VHF_std Inst_Freq_mean Inst_Freq_std Inst_Pw_VLF_mean Inst_ Pw_LF_mean Inst_ Pw_HF_mean Inst_ Pw_VHF_mean Inst_ Pw_VLF_std Inst_ Pw_LF_std Inst_ Pw_HF_std Inst_ Pw_VHF_std AEIS AEIE CHROMOSOME 1 exon: 80.8% intron : 89.4% accuracy: 86% CHROMOSOME1 exon: 80.5% intron: 92.3% accuracy: 88% CHROMOSOME2 exon: 85.5% intron : 92.3% accuracy: 90% CHROMOSOME2 exon: 85.2% intron : 93.6% accuracy: 91% Variables with statistical significance level p < 0.0005 for the classification of exons and introns. Time-Frequency Representation of Pre-mRNA 512 hypothesis is supported by the fact that the mean value of the entropy of the sequences of instantaneous frequency and instantaneous power is always higher for exons compared to that for introns, indicating that introns are structurally less complex (i.e., more regular). This work was a preliminary study on the application of TFR on the characterization of nucleic acid sequences. A further validation should be performed by applying this methodology to different mammalian genomes. Acknowledgment This work was supported within the framework of the CICYT grants TEC2010-20886-C02-01 and TEC2010-20886C02-02, the Ramón y Cajal Program from the Spanish Government, and the Research Fellowship Grant FPU AP20090858 from the Spanish Government. CIBER of Bioengineering, Biomaterials and Nanomedicine is an initiative of ISCIII. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] J. Valcárcel, R. K. Gaur, R. Singh and M. R. Green, “Interaction of U2AF65 RS region with pre-mRNA branch point and promotion of base pairing with U2 snRNA,” Science, 273: 1706-1709, 1996. N. A. Faustino and T. A. Cooper, “Pre-mRNA splicing and human disease,” Genes Dev., 17: 419-437, 2003. B. R. Graveley, K. J. Hertel and T. Maniatis, “The role of U2AF35 and U2AF65 in enhancer-dependent splicing,” RNAPubl. RNA Soc., 7: 806-818, 2001. B. J. Blencowe, “Exonic splicing enhancers: Mechanism of action, diversity and role in human genetic disease,” Trends Biochem. Sci., 25: 106-110, 2000. J. L. Kan and M. R. Green, “Pre-mRNA splicing of IgM exons M1 and M2 is directed by a juxtaposed splicing enhancer and inhibitor,” Genes Dev., 13: 462-471, 1999. E. E. Snyder and G. D. Stormo, “Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks,” Nucleic Acids Res., 21: 607-613, 1993. Y. Xu, R. J. Mural and E. C. Uberbacher, “Constructing gene models from accurately predicted exons: an application of dynamic programming,” Comput. Appl. Biosci., 10: 613-623, 1994. R. Lopez, F. Larsen and H. Prydz, “Evaluation of the exon predictions of the GRAIL software,” Genomics, 24: 133-136, 1994. V. V. Solovyev, A. A. Salamov and C. B. Lawrence, “Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames,” Nucleic Acids Res., 22: 5156-5163, 1994. M. Q. Zhang and T. G. Marr, “Fission yeast gene structure and recognition,” Nucleic Acids Res., 22: 1750-1759, 1994. M. Q. Zhang, “Identification of protein coding regions in the human genome by quadratic discriminant analysis,” Proc. Natl. Acad. Sci. U. S. A., 94: 565-568, 1997. T. Chen and M. Q. Zhang, “Pombe: a gene-finding and exon-intron structure prediction system for fission yeast,” Yeast, 14: 701-710, 1998. A. Nandy, “Two-dimensional graphical representation of DNA sequences and intron-exon discrimination in intron-rich sequences,” Comput. Appl. Biosci., 12: 55-62, 1996. Y. Wu, A. W. Chung Liew, H. Yan and M. Yang, “Classification of short human exons and introns based on statistical features,” Phys. Rev. E, 1: 1-7, 2006. H. T. Chang, “DNA sequence visualization,” in Dr. Hui-Huang Hsu (Ed.), Advanced Data Mining Technologies in Bioinformatics, Idea Group Publishing, 4: 63-84, 2006. [16] N. W. Lo, T. H. Chang, S. W. Xiao and C. J. Kuo, “Global visualization of DNA sequences by use of three-dimensional trajectories,” J. Inf. Sci. Eng., 23: 1723-1736, 2007. [17] D. Kotlar and Y. Lavner, “Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions,” Genome Res., 13: 1930-1937, 2003. [18] S. T. Eskesen, F. N. Eskesen, B. Kinghorn and A. Ruvinsky, “Periodicity of DNA in exons,” BMC Mol. Biol., 5: 12-23, 2004. [19] S. Tiwari, S. Ramachandran, A. Bhattacharya, S. Bhattacharya and R. Ramaswamy, “Prediction of probable genes by Fourier analysis of genomic sequences,” Comput. Appl. Biosci., 13: 263-270, 1997. [20] M. Yan, Z. S. Lin and C. T. Zhang, “A new Fourier transform approach for protein coding measure based on the format of the Z curve,” Bioinformatics, 14: 685-690, 1998. [21] M. Akhtar, J. Epps and E. Ambikairajah, “Signal processing in sequence analysis: advances in eukaryotic gene prediction,” IEEE J. Sel. Top. Signal Process., 2: 310-321, 2008. [22] M. K. Choong and H. Yan, “Multi-scale parametric spectral analysis for exon detection in DNA sequences based on forward-backward linear prediction and singular value decomposition of the double-base curves,” Bioinformation, 2: 273-278, 2008. [23] P. P. Vaidyanathan and B. J. Yoon, “The role of signal processing concepts in genomics and proteomics,” J. Frankl. Inst.-Eng. Appl. Math., 341: 111-135, 2004. [24] D. Anastassiou, “Genomic signal processing,” IEEE Signal Process. Mag., 18: 8-20, 2001. [25] J. Ning, C. N. Moore and J. C. Nelson, “Preliminary wavelet analysis of genomic sequences,” Proc. of IEEE Int. Conf. Computer Society Bioinformatics (CSB'03), 1: 509-510, 2003. [26] J. W. Fickett and C. S. Tung, “Assessment of protein coding measures,” Nucleic Acids Res., 20: 6441-6450, 1992. [27] C. Burge and S. Karlin, “Prediction of complete gene structures in human genomic DNA,” J. Mol. Biol., 268: 78-94, 1997. [28] Ensembl release 57, 2010. Available: http://www.ensembl.org. [29] S. Durinck, W. Huber and S. Davis, “Interface to BioMart databases,” biomaRt, Available: http://www.bioconductor. org/packages/2.2/bioc/html/biomaRt.html [30] K. J. Breslauer, R. Franks, H. Blockers and L. A. Marky, “Predicting DNA duplex stability from the base sequence,” Proc. Natl. Acad. Sci. U. S. A., 83: 3746-3750, 1986. [31] N. Sugimoto, S. Nakano, M. Yoneyama and K. Honda, “Improved thermodynamic parameters and helix initiation factor to predict stability of DNA duplexes,” Nucleic Acids Res., 24: 4501-4505, 1996. [32] F. Hlawatsch and G. F. Boudreaux-Bartels, “Linear and quadratic time-frequency signal representations,” IEEE Signal Process. Mag., 9: 21-67, 1992. [33] F. Clariá, M. Vallverdú, R. Baranowski, L. Chojnowska and P. Caminal, “Time-frequency analysis of the RT and RR variability to stratify hypertrophic cardiomyopathy patients,” Comput. Biomed. Res., 33: 416-430, 2000. [34] L. Cohen, Time-frequency analysis, Prentice Hall Signal Processing Series, 1995. [35] F. Clariá, M. Vallverdú, R. Baranowski, L. Chojnowska and P. Caminal, “Heart rate variability analysis based on timefrequency representation and entropies in hypertrophic cardiomyopathy patients,” Physiol. Meas., 29: 401-416, 2008. [36] S. Mikat, G. Fitscht, J. Weston, B. Scholkopft and K. R. Mullert, “Fisher discriminant analysis with Kernels,” Proc. IEEE Int. Workshop on Neural Networks for Signal Process., 9: 41-48, 1999. [37] A. Pallejà, E. D. Harrington and P. Bork, “Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions?” BMC Genomics, 9: 335-345, 2008. [38] A. Facchiano, “Coding in noncoding frames,” Trends Genet., 12: 168-169, 1996. [39] W. Y. Chung, S. Wadhawan, R. Szklarczyk, S. Kosakovsky and A. Nekrutenko, “A first look at ARFome: Dual-coding genes in mammalian genomes,” PLoS Comput. Biol., 3: e91, 2007.
© Copyright 2026 Paperzz