Choi-Williams Distribution to Describe Coding and Non

Journal of Medical and Biological Engineering, 33(5): 504-512
504
Choi-Williams Distribution to Describe Coding and
Non-coding Regions in Primary Transcript Pre-mRNA
Umberto S. P. Melia1,2,3,*
Juan J. Gallardo3
Montserrat Vallverdú1,2,3
Alexandre Perera1,2,3
Francesc Clarià4
Pere Caminal1,2,3
1
Department of Automatic Control, Barcelonatech, 08028 Barcelona, Spain
Biomedical Engineering Research Centre, Barcelonatech, 08028 Barcelona, Spain
3
Centers of Biomedical Research Network of Bioengineering, Biomaterials and Nanomedicine, 08028 Barcelona, Spain
4
Department of Computer Science and Industrial Engineering, Lleida University, 25003 Lleida, Spain
2
Received 4 Nov 2011; Accepted 18 May 2012; doi: 10.5405/jmbe.1060
Abstract
Deoxyribonucleic acid (DNA) information is discrete in both “time” (sequence positions) and “amplitude”
(nucleotide values). This permits the use of signal processing techniques for its characterization. The conversion of
DNA nucleotide symbols into discrete numerical values enables signal processing to be employed to solve problems
related to sequence analysis, such as finding coding sequences. In this work, a numerical conversion method was
chosen based on the thermodynamic data of free energy changes (ΔG°) of the formation of a duplex structure of DNA
or ribonucleic acid (RNA), associated with the nucleotide sequence pre-mRNA (messenger RNA). The aim of this
work was to characterize coding regions (exons) from non-coding regions (introns) using a methodology based on
time-frequency representation (TFR). This permits the observation of the evolution of the periodicity and frequency
components with time, introducing more variables related to the gene sequences compared to those used in traditional
fast Fourier transform analysis. The parameters calculated from TFR are instantaneous frequency and instantaneous
power. It was found that instantaneous frequency and power variables in different frequency bands allowed the correct
classification between exons and introns with a prediction accuracy of more than 85%.
Keywords: Bioinformatics (genome) databases, Classification and feature extraction, Stochastic processes, Time series
analysis
1. Introduction
The genetic information necessary to reproduce, develop,
and keep alive an organism is coded in the deoxyribonucleic
acid (DNA) sequences that are mostly inside the cell nucleus.
DNA comprises a succession of the nucleotides adenine (A),
thymine (T), cytosine (C), and guanine (G). This sequence
contains sections called genes that contain the information for
building proteins, the functional units of a living organism.
When a gene is expressed, the sequence of DNA is used as
a pattern to synthesize ribonucleic acid (RNA) during a process
called transcription. The result represents a single-stranded
copy of primary transcript pre-mRNA (messenger RNA),
where thymine is replaced with uracil (U).
In eukaryotes, genes are further divided into relatively
smaller protein coding segments known as exons, interrupted
* Corresponding author: Umberto S. P. Melia
Tel: +34-93-4017160; Fax: +34-93-4017045
E-mail: [email protected]
by non-coding spacers known as introns, which are removed
during the splicing process for the formation of mature RNA.
This is the final chain made by continuous sequences of coding
regions that can be translated into proteins. In complex
organisms, the primary RNA transcript could be alternatively
edited, so that the set of exons finally expressed can vary in
response to specific biological signals. In complex organisms,
two levels of molecular machinery are involved in the splicing
of pre-mRNA transcripts: the basal machinery and gene
regulation system. The basal machinery, which is found in all
organisms whose genome contains introns, consists of five
small nuclear RNA molecules (sRNA) [1]. These molecules,
which are formed by a few nucleotides, bind to certain proteins
to form the spliceosome complex that performs the two primary
functions of splicing: recognition of the intron/exon boundaries
and catalysis of the cut-and-paste reactions that remove introns
and join exons [2]. For each intron, four nucleotide sequences
work as signals that report to the spliceosome where to cut: at
the start of the intron or cut point 5', at the end or cut point 3',
in the middle or area of branching, or at the polypyrimidine
stretch [3]. Moreover, the gene regulation system controls the
J. Med. Biol. Eng., Vol. 33 No. 5 2013
process of splicing and cutting by driving the basal machinery
to these cut point [4,5].
The classification of human gene sequences into exons
and introns is a difficult problem in DNA sequence analysis.
The prediction of genes and the classification of coding and
non-coding DNA sequences are popular research areas. In the
past twenty years, numerous advanced statistical gene-finding
algorithms have been developed using two approaches for
computational gene prediction. One is gene structure and
signal-based searches (ab initio prediction of coding or noncoding regions) and the other uses algorithms based on
knowledge of other organisms. Several complex systems for
predicting gene structure have been developed. Many genefinding systems are mainly based on a statistical or neural
network approach [6-11], or use an optimization algorithm. In
one study, a dynamic program was applied to discriminate
between exons and pseudo-exons and between introns and
pseudo-introns of yeast DNA [12]. A two-dimensional (2D)
graphical representation was used by Nandy [13] in order to
provide an indication of possible exon regions within a long
DNA sequence. Wu et al. [14] proposed a set of statistical
features, called SZ features, and combined them with other
features as the information of the stop codon (a nucleotide
triplet within mRNA that indicates a termination of translation).
A significant improvement in the recognition accuracy was
achieved with short sequences of DNA.
Chang introduced various visualization and graphical
representation schemes of symbolic DNA sequences [15].
Several visualization schemes were reviewed and the techniques
compared by analyzing possible applications for sequence
alignment, feature extraction, and sequence clustering.
Particularly, a study by Lo et al. [16] used a three-dimensional
trajectory method for DNA sequence visualization. Significant
pattern disparity, helpful in monitoring the deliberate deposition
of false sequences in public domains, was revealed. Using a
representation in which the four bases correspond to four
vertices of a tetrahedron, they observed local differences and
similarities between two DNA sequences.
Since DNA information is discrete in both time and
amplitude, digital signal processing techniques are an efficient
approach for its investigation. Several studies have tried to
demonstrate that the DNA periodicity in the coding region is
contributed by codon usage frequencies, as well as the lack of
this periodicity in introns. This periodicity is introduced by the
coding biases in the translation of codons into amino acids
[17,18]. Thus, Fourier transform analysis has been widely used
for sequence processing [19-21]. A method was developed
based on auto-regressive modeling using forward-backward
linear prediction and the singular value decomposition
algorithm [22]. One study [23] found that a frequency-domain
approach can help the extraction of exon and intron
characteristics. In fact, it was demonstrated [24] that the
frequency spectrum has some perturbations when an exon is
present.
The Fourier transform poses problems of windowing or
data truncation artifacts and spurious spectral peaks, and thus
can only reveal the global periodicity of stationary signals,
505
losing the time dependence. In contrast, a time-frequency
representation (TFR) permits the observation of the evolution of
the periodicity and frequency components with time, allowing
the analysis of non-stationary signals [25]. Moreover, this
representation, which keeps the time dependence of signal
features, gives the possibility of introducing more variables
related to the gene sequences that a traditional analysis does not
allow.
The conversion into numerical values of the symbols
associated with the nucleotide sequence of DNA permits
problems related to the localization and annotation of genes and
coding sequences to be addressed using signal processing tools.
It is worth mentioning that in this representation, time is
associated with the position of nucleotides along the gene
sequence.
In the present work, an analysis based on the TFR, from a
Choi-Williams distribution (CWD), of a large number of gene
sequences was performed in order to find which variables,
related to pre-mRNA, can better discriminate coding regions
from non-coding regions. The nucleotide sequences were
translated into numerical sequences by Gibbs energy conversion
and then TFR analysis was applied. From TFR, several variables
were proposed and statistically analyzed in order to differentiate
exon sequences, intron sequences, and their partial sequences.
This work was conducted in order to study the reliability of
constructing a detector of intron and exon sequences based on
TFR for future research. There are various detectors that can
approximate the position of the start and end of exons by
applying statistic tools to DNA sequences in FASTA string
format. One of the detectors with high accuracy uses a threeperiodic fifth-order hidden Markov model [26,27]. The present
study tests the capability of time-frequency functions to
differentiate exons and introns in order to apply them in a future
detector instead of FASTA string DNA.
2. Materials and methods
2.1 Analyzed database
About 1000 human gene sequences were processed, taken
from two chromosomes (chromosome 1 and chromosome 2)
with the corresponding notation of the positions of exons and
introns. These sequences with their respective annotation were
taken from the public database Ensembl (2009) [28] using the
R package biomaRt [29]. Software was developed in order to
organize and manage gene sequence data and particular
information. About 4000 exons and 10000 introns were
contained in the analyzed datasets. The minimum length taken
into account for introns and exons was 100 nucleotides.
2.2 Analysis of sequences
In order to characterize coding and non-coding regions in
primary transcript pre-mRNA, different analyses were
performed, taking into account different parts of the gene. In
this way, exon and intron sequences were analyzed with
different approaches:
Time-Frequency Representation of Pre-mRNA
506
1. Study A: Discrimination between the complete sequences of
exons and the complete sequences of introns.
2. Study B: Classification of fixed-length windows sliding on
the complete gene sequences.
3. Study C: Discrimination between the complete sequences of
exons and the beginning part or the ending part of introns.
Study B was used to define whether a selected window
belongs to an exon or an intron sequence. Several lengths and
several overlaps of a window were taken into account. In this
way, a window belongs to one of three kinds of region
depending which part of the gene is studied:
1. Pure exon region: all nucleotides of the window belong to an
exon sequence.
2. Pure intron region: all nucleotides of the window belong to
an intron sequence.
3. Transition region between exon and intron nucleotides: one
part of the window contains exon (or intron) nucleotides and
the other part contains intron (or exon) nucleotides. In this
case, the window contains the point where the exon (or
intron) ends and the intron (or exon) starts.
The classes considered for the classification were exon
region and intron region; therefore, the transition regions were
considered exon or intron depending on the prevalence of
nucleotides belonging to an exon or an intron region. In the
calculation of the results, pure exons, pure introns and also the
influence of the transition regions were taken into account.
Finally, the results of all the studies were obtained by
looking for the best single variable and the best combined
variables using a genetic algorithm (GA).
2.3 Numerical representation of genomic sequences
The thermodynamic values of changes of enthalpy (ΔH°),
entropy (ΔS°), and free energy (ΔG°) of the formation of a
duplex structure of DNA or RNA can be calculated from the
thermodynamic data library [30]. They are based on the
nearest-neighbor interactions between 10 base pairs of
nucleotides: AA/UU, AU/UA, UA/AU, CG/GC, GC/CG,
CU/GA, GA/UC, GU/AC, AC/GU, and GG/CC. These
thermodynamic values are also useful for predicting the energy
required to break the bond between complementary nucleotides
of the double helix. The values of the parameters of these
Watson-Crick base pairs are shown in Table 1 [31].
Table 1. Energy values for pairs of nucleotides.
Sequence
dAA/dUU
dAU/dUA
dUA/dAU
dCA/dGU
dCU/dGA
dGA/dCU
dGU/dCA
dCG/dGC
dGC/dCG
dGG/dCC
ΔH°
kcal/mol
–8.0
–5.6
–6.6
–8.2
–6.6
–8.8
–9.4
–11.8
–10.5
–10.9
ΔS°
cal/mol/K
–21.9
–15.2
–18.4
–21.0
–16.4
–23.5
–25.5
–29.0
–26.4
–28.4
ΔG°
kcal/mol
–1.2
–0.9
–0.9
–1.7
–1.5
–1.5
–1.5
–2.8
–2.3
–2.1
Thermodynamic parameters for DNA-RNA obtained at 310.15 K with an
estimate error of below 5-6%
The thermodynamic variables include information about
the stability of the sequence. The Gibbs free energy (ΔG°) was
used in order to convert into numeric values the symbols
associated with the nucleotide sequence of pre-mRNA. Thus,
the obtained sequence is a numerical sequence with the
advantage of providing a physical meaning. . It accounts for the
ease with which a number of nucleotide binds to their
complementary set, a circumstance that occurs continuously
during the editing process in the pre-mRNA sequence. This
happens during the cutting of introns and the union of exons in
order to generate the mRNA sequence that carries the protein
code. The obtained sequences are a succession of
thermodynamic values that can be assimilated to a nonstationary stochastic process with a non-zero mean value. Thus,
it can be considered that the numerical representation of the
nucleotide sequence is a signal sampled at intervals of 1 second.
Consequently, the maximum bandwidth frequency is 0.5 Hz in
accordance with the Nyquist theorem.
2.4 Time-Frequency representation tool
For continuous time and frequency variables, the Wigner
distribution (WD) for signal x(t) is defined as:



 x(t  2 ) x (t  2 )e
Wx(t , f ) 
*
 j 2 f 
d
(1)

There are many other quadratic TFRs with an energetic
interpretation [32]. The class of all time-frequency
shift-invariant quadratic TFRs is known as the quadratic
Cohen’s class. Prominent members of Cohen’s class are the
spectrogram and the WD. Every member of Cohen’s class may
be interpreted as a 2D filtered WD. In fact, it can be shown that
Tx (t , f ) is a member of Cohen’s class, if and only if it can be
derived from the WD of the signal x(t) via a time-frequency
convolution.
Tx (t , f )  

 h(t  t ', f  f ')Wx(t ', f ')dt ' df '
(2)

Each member Tx of Cohen’s class is associated with a
unique, signal-independent kernel function h(t , f ) (or 2D
filter). The convolution is transformed into a simple
multiplication in the Fourier transform domain.
For the analysis of each gene, a particular Cohen’s class
distribution was chosen, namely the CWD [33]. It was applied
to numerical sequences, after the calculation of the analytical
signal, using Eq. (2) with the WD in Eq. (1) and the following
Choi-Williams exponential:
h(t , f ) 
4
c
4 2
e
( tf )2
c
(3)
Equation (3) preserves the properties of the WD [33,34],
such as the marginal properties and instantaneous frequency.
Moreover, it is able to reduce the WD interference by
estimating an adequate c parameter for the working band. The
proposed estimation criterion was:
1
Amb
 0.95
Amt
(4)
J. Med. Biol. Eng., Vol. 33. No. 5 2013
Time (s)
Figure 2 shows an example of a CWD normalized by the
maximum value of a gene (FH gene), indicating exon and
intron edges. For the same gene, Fig. 3 shows the instantaneous
frequency along the nucleotide positions in the TF band of an
intron and an exon. The dynamics of the intron and exon are
different, with the behavior observed for the intron being more
regular. Another characterization of these sequences is shown
in Fig. 4, where the instantaneous power is divided in the four
bands defined in this study. The dynamics of the evolution of
the intron and exon sequences are different. In order to describe
these differences, variables that are able to quantify them
should be considered.
Amplitude (a.u.)
Amplitude (a.u.)
where Amb represents the mean value of the spectral amplitude
between test tones and Amt is the mean value of the spectral
amplitude at the test tones [34]. The parameter value of the CW
exponential was estimated at 0.005 in order to eliminate the
interferences produced by a test sinusoidal signal made by four
frequency components at 0.05, 0.15, 0.3, and 0.45 Hz. The WD
of this signal test and the TFR without interferences obtained
by the CWD are shown in Figs. 1(a) and 1(b), respectively.
507
Frequency (Hz)
(a)
Amplitude (a.u.)
Frequency (Hz)
Number of Nucleotides
(b)
Figure 1. (a) WD interferences of a simulated signal and (b) WD
without interferences obtained using CWD.
In order to calculate variables that might reflect
characteristics of the signal in different frequency ranges, the
spectrum was divided into the following frequency bands: very
low frequency (VLF), 0-0.1 Hz; low frequency (LF), 0.10.2 Hz; high frequency (HF), 0.2-0.4 Hz; very high frequency
(VHF), 0.4-0.5 Hz; total frequency (TF), 0-0.5Hz.
Instantaneous power was calculated for each gene signal
as the CWD integral in the frequency in each of the bands (Eq.
(5)). Subsequently, for each band, the instantaneous frequency
function was calculated [35] as the average frequency of the
spectrum with respect to time (Eq. (6)).
Enorm

(t ) 

f2
f1




f (t ) 

i



(5)
Tx (t , f )df
Tx (t , f )df
where f1 and f2 are the limits of the frequency bands.
Number of Nucleotides
(a)
Number of Nucleotides
(b)
Figure 3. Instantaneous frequency of (a) intron and (b) exon.
2.5 Shannon and Rényi entropies
Tx (t , f )df
fTx (t , f )df
Frequency (Hz)
Frequency (Hz)
Figure 2. CWD of FH gene (blue lines divide exon and intron regions).
Frequency (Hz)
Time (s)
(6)
In addition to traditional variables (mean, standard
deviation, and maximum-minimum value), Shannon and Rényi
entropies were used to quantify the regularity, uncertainty, or
randomness of the sequences. They were applied to each
sequence of instantaneous frequency and instantaneous power
calculated from each CWD and to the original sequences of the
Gibbs energy for each gene.
Time-Frequency Representation of Pre-mRNA
Power (a.u.)
508
Power (a.u.)
Number of Nucleotides
Power (a.u.)
Number of Nucleotides
Power (a.u.)
Number of Nucleotides
Number of Nucleotides
Power (a.u.)
(a)
Power (a.u.)
Number of Nucleotides
Power (a.u.)
Number of Nucleotides
variables were defined. In the analysis, the original sequences
of the Gibbs energy were considered in order to determine
whether this representation is sufficient to obtain acceptable
results without introducing TFR parameters. For each sequence,
the following parameters were calculated: mean, standard
deviation, value of the peak, peak position, minimum value,
and Shannon and Rényi entropies. All these variables were
obtained from the complete sequence and from the
n = {25,50,75,100} nucleotides at the start and at the end of
each intron and exon. However, Shannon and Rényi entropies
were only calculated for the complete sequence in order to
avoid uncertainty in the calculation of the probability of short
signals, since their consistency decreases with decreasing
number of data points. The position of the peak was calculated
only for the n samples at the start and at the end of intron and
exon sequences. Since this value depends of the length of the
sequence, this could influence the statistical classification. In
these n samples, it was considered that there is a peak if the
peak value exceeds 75% of the maximum value of the entire
sequence; otherwise it was assumed that no maximum value
exists in these regions. In this way, all variables become
independent of the length of the sequence.
2.7 Statistical analysis
where Q = 8 represents the number of levels in which the
signals are quantized. Values of Rényi entropies for
q = {0.15,0.25,0.5,2,3} were calculated.
In Eq. (8), the largest probabilities most influence the
Rényi entropy when q > 1 and the smallest probabilities most
influence the value of Rényi entropy when 0 < q < 1. The
Rényi entropy converges to the Shannon entropy when q1
[35].
Two classes are considered for the statistical analysis:
exons and introns. The U of Mann-Whitney test was calculated
for each defined variable in order to find the variables that best
classify introns versus exons. A statistically significant level of
p < 0.05 was considered for the analysis.
The Mann-Whitney test is a non-parametric test,
independent of the kind of distribution, used to compare two
independent classes of sampled data and for assessing whether
two independent samples of observations come from the same
distribution. The considered populations were exons and
introns.
A discriminant analysis between both classes was then
performed. The aim of the discriminant analysis was to study
how the defined variables were able to correctly classify the
data into the two classes, exons and introns. For this purpose,
classification functions were constructed for each variable and
also for combined variables. Each function allows classification
scores to be computed for each variable value calculated for
each class (exons and introns).
A quadratic function, Eq. (9), was used for this
classification. In this formula, the subscript i denotes the
respective class, the subscripts j = 1, 2, ..., m denote the
variable under consideration taken from the total number of
calculated variables, ci is a constant for each class, wij is the
weight for the jth variable in the computation of the
classification score for the ith class, and xj is the observed
value for the respective jth variable. The resultant classification
score is Si.
2.6 Definition of variables
Si  (ci  wi1 x1  wi 2 x2  ...  wim xm )2
Power (a.u.)
Number of Nucleotides
Number of Nucleotides
(b)
Figure 4. Instantaneous power of (a) intron and (b) exon.
Let X be a discrete random variable which takes a finite
number of possible values x1, x2 ,x3..., xn with probabilities
P(1), P(2), P(3)..., P(n), respectively, such that P(i)  i , i = 1, 2,
n
3…n, i 1 P(i)  1 . Shannon (ShaEn) and Rényi (REq) entropy
are respectively defined as:
Q
ShaEn   P(i) log 2 ( P(i))
(7)
i 1
REq 
Q
1
log 2 ( P q (i))
1 q
i 1
(8)
From the calculated parameters (instantaneous frequency
and instantaneous spectral energy) and the original sequences
of the Gibbs energy along nucleotide position, a total of 316
(9)
The classification function is used to directly compute
classification scores for some new observations. The first step
for classification is to build a discriminant function by
J. Med. Biol. Eng., Vol. 33 No. 5 2013
estimating the coefficients win; the aim is to make Si as different
as possible between the classes (introns and exons). The
optimal ratio is the one that maximizes Eq. (10).
Table 2. Best single variables for study A.
Chromosome 1
Variables
R
quadratic sum of "between class differences "
quadratic sum of " within class differences "
509
Mean
(std)
for exons
(10)
where the numerator contains the differences of the terms wi’n
xn and wi’’n xn , with i’≠i’’, and the denominator contains the
differences of the terms wi’n xn and wi’’n xn, with i’=i’’ [36].
The classification algorithm uses half of the number of
genes for training to build a classification function, and the
other half for validating its accuracy. The percentage of
correctly classified exons and introns was respectively
calculated as the ratio of the number of exons or introns
correctly classified to the total number of exons or introns,
respectively.
A feature selection based on a GA was used in order to
find an optimal combination of variables, chosen from all
variables, which provides high accuracy in the classification of
exons and introns. This allows a set of variables with good
percentages of correct classification to be obtained without
exploring all combinations. This algorithm performs a number
of iterations using Eq. (9) with different variables randomly
chosen in an attempt to get a cost function that maximizes the
percentages of correctly classified exons and introns. A more
thorough analysis of all combinations of all defined variables is
thus achieved.
3. Results
After obtaining the classification function constructed using
half of the dataset by the training procedure, the best
classification variables were validated using the other half of
the dataset. Results presented in Table 2 show the best
percentages of the validated variables. For study A, only
entropies of instantaneous frequency and spectral energy were
able to characterize (p < 0.0005) exons and introns with a
prediction accuracy higher than 60%. Table 2 shows the mean
value, the standard deviation, and the percentages of correctly
classified exons and introns for the best variables. It can be
noted that the mean value of entropy of exons is always higher
than that of introns. Furthermore, Rényi entropy with q > 2 was
not able to discriminate exons and introns with a prediction
accuracy higher than 60%. The best prediction accuracy was
obtained with the variable of Rényi entropy (q = 0.25) of
instantaneous power (Inst_Pw_HF_RE025).
Combining variables using the GA led to higher
percentages of correctly classified exon and intron sequences.
The GA was applied to three datasets of variables:
1. Only the best 20 single variables with statistical significance
level of p < 0.0005 and prediction accuracy higher than
60%.
2. All the variables calculated from chromosome 1.
3. All the variables calculated from chromosome 2.
Table 3 shows the best combination of variables obtained
from these datasets. The best accuracy was obtained with the
Inst_Freq_
VHF_RE015
Inst_Freq_
VHF_RE025
Inst_Freq_
VHF_RE05
Inst_Freq_
VHF_ShaEn
Inst_Freq_
VHF_RE2
Inst_Freq_
HF_RE05
Inst_Freq_
HF_ShaEn
Inst_Freq_
LF_ShaEn
Inst_Freq_
VLF_ShaEn
Inst_Freq_ShaEn
Inst_Pw_
HF_RE025
Chromosome 2
Inst_Freq_
VHF_RE015
Inst_Freq_
VHF_RE025
Inst_Freq_
VHF_RE05
Inst_Freq_
VHF_ShaEn
Inst_Freq_
VHF_RE2
Inst_Freq_
HF_RE05
Inst_Freq_
HF_ShaEn
Inst_Freq_
LF_ShaEn
Inst_Freq_
VLF_ShaEn
Inst_Freq_ShaEn
Inst_Pw_
HF_RE025
Mean
(std)
for
introns
2.7869
(0.1638)
2.7457
(0.1530)
2.7548
(0.1547)
2.8997
(0.1239)
2.7168
(0.1490)
0.7722
(0.0209)
2.9351
(0.1231)
2.9289
(0.1214)
2.9649
(0.1258)
2.8669
(0.1650)
0.6354
(0.0600)
2.7016
(0.1526)
2.6240
(0.1554)
2.6265
(0.1588)
2.8149
(0.1215)
2.6276
(0.1348)
0.7505
(0.0261)
2.8186
(0.1395)
2.8158
(0.1372)
2.8834
(0.1283)
2.7867
(0.1473)
0.5949
(0.0555)
2.8236
(0.1481)
2.7655
(0.1486)
2.7682
(0.1500)
2.9212
(0.1135)
2.7428
(0.1367)
0.7747
(0.0200)
2.9480
(0.1190)
2.9447
(0.1184)
2.9921
(0.1124)
2.9071
(0.1346)
0.6412
(0.0584)
2.6854
(0.1521)
2.6050
(0.1555)
2.5961
(0.1588)
2.7894
(0.1256)
2.6009
(0.1381)
0.7450
(0.0270)
2.7900
(0.1421)
2.7967
(0.1381)
2.8680
(0.1308)
2.7656
(0.1514)
0.5901
(0.0536)
Correctly
classified
exons (%)
Correctly
classified
introns (%)
Nex = 1960
Nintr = 3557
61.0
65.1
64.1
64.7
67.1
66.4
67.9
66.0
67.3
67.1
62.4
64.2
61.5
66.6
65.8
77.2
62.1
67.3
73.0
62.6
66.5
71.3
62.7
66.2
67.5
61.2
63.7
60.8
64.6
63.0
60.0
72.0
68.3
Accurac
y (%)
Nex = 2116 Nintr = 5761
67.9
69.1
69.5
73.6
69.0
70.8
73.1
70.4
71.9
77.1
65.9
69.7
71.9
68.7
70.2
82.9
65.5
70.9
78.4
68.3
71.8
77.7
66.9
70.7
74.8
65.2
68.6
76.0
64.0
67.5
66.7
72.4
71.4
Variables with statistical significance level p < 0.0005 for the classification of
exons and introns. Inst: instantaneous; Freq: frequency; Pw: power; std, standard
deviation; REq: Rényi entropy with q = {0.15,0.25,0.5,2,3}; ShaEn: Shannon
entropy; Nex: Number of exons; Nintr: Number of introns.
variables obtained from chromosome 2. It should be noted that
the single variables with the highest accuracy do not give the
best accuracy when combined. However, the accuracy
increased when those variables were combined with the
remaining variables.
For study B (Table 4), the best classification for
chromosome 1 was obtained with a window length of 300
nucleotides. The prediction accuracy was higher when only the
pure zones of introns and exons were taken into account.
Furthermore, the prediction accuracy was low when only the
change zone was taken into account. This means that if the
window is placed on the change zone, the characteristics of
both exons and introns are mixed, showing an insignificant
statistical level because the features that can differentiate the
two classes are hidden.
Time-Frequency Representation of Pre-mRNA
510
Table 3. Best combination of variables for study A (percentages are
correctly classified exons and introns).
*
Inst_Freq_VHF_RE015
Inst_Freq_VHF_RE025
Inst_Freq_VHF_RE05
Inst_Freq_VHF_ShaEn
Inst_Freq_VHF_RE2
Inst_Freq_HF_RE025
Inst_Freq_HF_RE05
Inst_Freq_HF_ShaEn
Inst_Freq_LF_RE025
Inst_Freq_LF_RE05
Inst_Freq_LF_ShaEn
Inst_Freq_VLF_RE025
Inst_Freq_VLF_RE05
Inst_Freq_VLF_ShaEn
Inst_Freq_ShaEn
**
Inst_Freq_HF_mean_End50
Inst_Freq_VHF_mean_End25
Inst_Freq_LF_std_End75
Inst_Freq_HF_std_End100
Inst_Freq_HF_RE05
Inst_Freq_LF_RE2
Inst_Freq_RE3
Inst_ Pw_mean_End100
Inst_Pw_HF_mean_End75
Inst_ Pw _HF_std_End50
Inst_ Pw _VLF_peak_Start100
Inst_ Pw _LF_peak_Start75
Inst_ Pw _VLF_min_End100
Inst_ Pw _VLF_tauMx_Start100
Inst_ Pw _VHF_ShaEn
Inst_ Pw _HF_RE015
***
Inst_Freq_HF_mean_Start25
Inst_Freq_VHF_mean_End50
Inst_Freq_LF_std
Inst_Freq_LF_std_Start100
Inst_Freq_VHF_std_Start25
Inst_Freq_LF_RE025
Inst_Freq_VLF_RE05
Inst_Freq_mean
Inst_Freq_ShaEn
Inst_ Pw _VLF_std_End25
Inst_ Pw _VLF_std_End75
Inst_ Pw _HF_std_End50
Inst_ Pw _HF_peak_Start25
Inst_ Pw _VLF_min_End75
Inst_ Pw _HF_tauMx_Start100
Inst_ Pw _VLF_tauMx_End100
Inst_ Pw _VHF_RE025
Chromosome 1
Chromosome 2
exon: 73.9 %
intron: 73.1%
exon: 73.6 %
intron: 77.2%
accuracy: 73.4%
accuracy: 76.2%
exon: 78.7%
intron: 78.2%
exon: 79.8%
intron: 76.3%
accuracy: 78.5%
accuracy: 77.9%
exon: 75.8%
intron: 80.1%
exon: 82.8%
intron: 82.4%
accuracy: 79.7%
accuracy: 83.0%
*Best combination found for the 20 best single variables
**Best combination found for chromosome 1
***Best combination found for chromosome 2
Variables with statistical significance level p < 0.0005 for the classification of
exons and introns. Inst: instantaneous; Freq: frequency; Pw: power ; std:
standard deviation; REq: Rényi entropy with q = {0.15,0.25,0.5,2,3 ; ShaEn:
Shannon entropy; min: minimum value; tauMx: normalized position of
maximum value; Start/End_n: first/last n nucleotides.
Table 5 shows the subset of variables that gives the
highest prediction accuracy for both studies A and B, for both
chromosomes. These results were obtained by combining the
variables of instantaneous frequency at HF and VHF. The best
results were obtained using the entire sequences, but it can be
noted that sequences of 300 nucleotides were sufficient for
characterizing exons and introns with an accuracy of about
65%.
In study C, the best subset of variables for discrimination
between the complete sequence of exons and the 100
nucleotides at the start (AEIS, all exon versus intron start) and
at the end (AEIE, all exon versus intron end) of introns is
shown in Table 6 for both chromosomes. This study had the
highest prediction accuracy, higher than 85%.
The obtained results seem to show that using only the
original sequences of Gibbs energy is insufficient to
discriminate exons and introns with acceptable accuracy. In fact,
only one variable related to Gibbs energy is included in the
subsets of variables that give the best prediction accuracy and
occurred only for study B. Variables related to the parameters
obtained from TFR, such as instantaneous frequency and power.
The best results were obtained by combining less than 20
variables of TFR. Adding more variables did not significantly
increase the prediction accuracy. Furthermore, the variables
related to the first or last 100 nucleotides in study A gave higher
prediction accuracy than variables related to the first and last 75,
50, and 25 nucleotides. Therefore, for study C, only variables
related to the first and last 100 nucleotides were taken in order
to classify the complete sequences of exons versus the
beginning or the end parts of introns.
In this preliminary evaluation, there were sequences with
very different lengths, especially in study A since the length of
introns is generally higher than that of exons. Of note, all
proposed variables are independent of the length of the sequence.
Consequently, the results are not influenced by the differences in
the length of sequences. The prediction accuracy could be also
influenced by the presence of unknown alternative splicing in
the intron sequences. This was confirmed by Pallejà et al. [37],
who identified the main cause of incorrect annotation of genes
as the error in the prediction of initial codons. Particularly, a
frameshift at the 3'-end or a point mutation at the stop codon
may cause the loss of the stop codon, thus extending the reading
frame to the next in-frame stop codon. For this cause, some of
the introns could have features of exons, which could cause
misclassification of these introns, decreasing accuracy.
Facchiano [38] and Chung et al. [39] confirmed the existence of
overlapping regions between canonical and alternative reading
in the exons of a eukaryote genome. TFR methodology can be
applied to the characterization of this dual coding region, often
common in bacteriophages and viruses, in the future.
For both chromosomes, a quite high accuracy was obtained
by selecting 10 to 15 variables from the 316 variables.
Furthermore, it is possible to isolate many subsets of 10-15
variables with a quite high accuracy for both chromosomes. This
suggests that these variables are sufficiently robust for the
characterization of intron and exon sequences for all
chromosomes of the human genome.
4. Conclusion
A methodology based on TFR was proposed for analyzing
genomic sequences. It was applied to the Gibbs energy of the
boundary between nucleotides for exploring the information in
this energy in the transcription process of genomic sequences.
Variables related to Gibbs energy sequences are insufficient
for correctly classifying exons and introns; their prediction
accuracies are low compared to those of TFR variables.
J. Med. Biol. Eng., Vol. 33. No. 5 2013
511
Table 4. Best combination of variables for the best window length (300 nucleotides) for study B (percentages are correctly classified exons and
introns)
VARIABLES
Inst_Freq_HF_mean_Start75
Inst_Freq_HF_mean_End100
Inst_Pw_VLF_mean
Inst_Pw_VLF_std
Inst_ Pw _VHF_std_Start25
Inst_ Pw _VHF_std_End100
Inst_ Pw_LF_min
Inst_ Pw _VHF_tauMx_Start100
Inst_ Pw _HF_RE2
RESULTS FOR PURE EXON/INTRON
REGIONS
RESULTS FOR
TRANSITION REGIONS
OVERLAP:10
exon: 73.2 %
intron: 60.4 %
OVERLAP:25
exon: 72.1 %
intron: 61.8 %
OVERLAP:50
exon: 70.5 %
intron: 64.1 %
accuracy: 63.5 %
accuracy: 64.7 %
accuracy: 65.8 %
exon: 76.7%
intron: 62.9%
accuracy: 66.5%
exon: 63.2%
intron: 48.8%
accuracy: 52.1%
exon: 65.3%
intron: 68.3%
exon: 76.1%
intron: 64.1%
accuracy: 66.7%
exon: 61.5%
intron: 51.2%
accuracy: 54.3%
exon: 65.4%
intron: 68.3%
exon: 74.0%
intron: 66.3%
accuracy: 68.9%
exon: 60.8%
intron: 53.8%
accuracy: 56.9%
exon: 65.5%
intron: 68.4%
accuracy: 68.0%
accuracy: 68.0%
exon: 70.5%
intron: 70.8%
accuracy: 70.7%
exon: 51.2%
intron: 56.6%
accuracy: 55.5%
exon: 71.1%
intron: 61.7%
exon: 70.7%
intron: 70.7%
accuracy: 70.7%
exon: 50.9%
intron: 57.2%
accuracy: 55.5%
exon: 71.0%
intron: 64.2%
accuracy: 63.8%
accuracy: 65.6%
exon: 72.8%
intron: 65.0%
accuracy: 66.9%
exon: 66.4%
intron: 46.3%
accuracy: 51.8%
exon: 71.7%
intron: 65.8%
accuracy: 67.8%
exon: 65.7%
intron: 46.2%
accuracy: 51.7%
Inst_Freq_HF_mean
Inst_Freq_VLF_std_Start25
Inst_Freq_HF_RE015 Inst_Freq_VLF_RE2
Inst_Freq_LF_RE2
accuracy: 68.0 %
Inst_Pw_VLF_mean_Start100
Inst_ Pw _LF_min
Inst_ Pw _HF_tauMx_End100
RESULTS FOR PURE EXON/INTRON exon: 70.5%
REGIONS
intron: 70.8%
accuracy: 70.7%
RESULTS FOR TRANSITION REGIONS
exon: 51.0%
intron: 56.7%
accuracy: 55.5%
GEnergy_mean
Inst_Freq_LF_mean_Start50 exon: 70.8%
Inst_Freq_HF_mean_Start50
intron: 61.7%
Inst_Freq_HF_mean_End100
Inst_Freq_std_End50
accuracy: 63.8%
Inst_Freq_LF_std_Start100
Inst_Freq_VLF_std_End25
Inst_ Pw _VLF_peak_Start75
Inst_ Pw_VLF_min
RESULTS FOR PURE EXON/INTRON exon: 72.6%
REGIONS
intron: 65.0%
accuracy: 66.9%
RESULTS FOR TRANSITION REGIONS
exon: 65.7%
intron: 46.2%
accuracy: 51.7%
Variables with statistical significance level p < 0.0005 for the classification of exons and introns Inst: instantaneous; Freq: frequency; Sp: spectral; En: energy; std:
standard deviation; REq: Rényi entropy with q = {0.15,0.25,0.5,2,3}; ShaEn: Shannon entropy; min: minimum value; tauMx: normalized position of maximum value;
Start/End_n: first/last n nucleotides.
Table 5. Best combination of variables for studies A and B and
chromosomes 1 and 2
STUDY A
STUDY B
CHR 1
CHR 2
DETECTION
DETECTION
CHR 1
CHR 2
WINDOW: 300 WINDOW: 300
OVERLAP: 25 OVERLAP: 25
exon: 75.0%
exon: 72.4%
exon: 69.9%
exon: 74.1%
intron: 73.8%
intron: 77.0% intron: 62.6%
intron: 64.7%
accuracy: 74.6% accuracy: 76.2% accuracy: 64.8% accuracy: 66.8%
Four
Variables:
Inst_Freq_HF_mean;
Inst_Freq_HF_ShaEn;
Inst_Freq_VHF_mean_End5; Inst_Freq_VHF_min_Start100
The start and end parts of introns (cut points 5' and 3',
respectively) and the polypyrimidine tract (toward the end of an
intron) contain nucleotide sequences that work as signals for
indicating where the spliceosome has to cut. For this, the
signals obtained in this region, with the conversion of
nucleotides into numerical signals, must have more regularity
compared with the signals obtained from exons that do not
need to notify the spliceosome about their position. This
Table 6. Best combination of variables for study C (percentages are
correctly classified exons and introns).
Inst_Freq_VLF_mean
Inst_Freq_LF_mean
Inst_Freq_HF_mean
Inst_Freq_VHF_mean
Inst_Freq_VLF_std
Inst_Freq_LF_std
Inst_Freq_HF_std
Inst_Freq_VHF_std
Inst_Freq_mean
Inst_Freq_std
Inst_Pw_VLF_mean
Inst_ Pw_LF_mean
Inst_ Pw_HF_mean
Inst_ Pw_VHF_mean
Inst_ Pw_VLF_std
Inst_ Pw_LF_std
Inst_ Pw_HF_std
Inst_ Pw_VHF_std
AEIS
AEIE
CHROMOSOME 1
exon: 80.8%
intron : 89.4%
accuracy: 86%
CHROMOSOME1
exon: 80.5%
intron: 92.3%
accuracy: 88%
CHROMOSOME2
exon: 85.5%
intron : 92.3%
accuracy: 90%
CHROMOSOME2
exon: 85.2%
intron : 93.6%
accuracy: 91%
Variables with statistical significance level p < 0.0005 for the classification of
exons and introns.
Time-Frequency Representation of Pre-mRNA
512
hypothesis is supported by the fact that the mean value of the
entropy of the sequences of instantaneous frequency and
instantaneous power is always higher for exons compared to
that for introns, indicating that introns are structurally less
complex (i.e., more regular). This work was a preliminary
study on the application of TFR on the characterization of
nucleic acid sequences. A further validation should be
performed by applying this methodology to different
mammalian genomes.
Acknowledgment
This work was supported within the framework of the
CICYT grants TEC2010-20886-C02-01 and TEC2010-20886C02-02, the Ramón y Cajal Program from the Spanish
Government, and the Research Fellowship Grant FPU AP20090858 from the Spanish Government. CIBER of Bioengineering,
Biomaterials and Nanomedicine is an initiative of ISCIII.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
J. Valcárcel, R. K. Gaur, R. Singh and M. R. Green, “Interaction
of U2AF65 RS region with pre-mRNA branch point and
promotion of base pairing with U2 snRNA,” Science, 273:
1706-1709, 1996.
N. A. Faustino and T. A. Cooper, “Pre-mRNA splicing and
human disease,” Genes Dev., 17: 419-437, 2003.
B. R. Graveley, K. J. Hertel and T. Maniatis, “The role of
U2AF35 and U2AF65 in enhancer-dependent splicing,” RNAPubl. RNA Soc., 7: 806-818, 2001.
B. J. Blencowe, “Exonic splicing enhancers: Mechanism of
action, diversity and role in human genetic disease,” Trends
Biochem. Sci., 25: 106-110, 2000.
J. L. Kan and M. R. Green, “Pre-mRNA splicing of IgM exons
M1 and M2 is directed by a juxtaposed splicing enhancer and
inhibitor,” Genes Dev., 13: 462-471, 1999.
E. E. Snyder and G. D. Stormo, “Identification of coding regions
in genomic DNA sequences: an application of dynamic
programming and neural networks,” Nucleic Acids Res., 21:
607-613, 1993.
Y. Xu, R. J. Mural and E. C. Uberbacher, “Constructing gene
models from accurately predicted exons: an application of
dynamic programming,” Comput. Appl. Biosci., 10: 613-623,
1994.
R. Lopez, F. Larsen and H. Prydz, “Evaluation of the exon
predictions of the GRAIL software,” Genomics, 24: 133-136,
1994.
V. V. Solovyev, A. A. Salamov and C. B. Lawrence, “Predicting
internal exons by oligonucleotide composition and discriminant
analysis of spliceable open reading frames,” Nucleic Acids Res.,
22: 5156-5163, 1994.
M. Q. Zhang and T. G. Marr, “Fission yeast gene structure and
recognition,” Nucleic Acids Res., 22: 1750-1759, 1994.
M. Q. Zhang, “Identification of protein coding regions in the
human genome by quadratic discriminant analysis,” Proc. Natl.
Acad. Sci. U. S. A., 94: 565-568, 1997.
T. Chen and M. Q. Zhang, “Pombe: a gene-finding and
exon-intron structure prediction system for fission yeast,” Yeast,
14: 701-710, 1998.
A. Nandy, “Two-dimensional graphical representation of DNA
sequences and intron-exon discrimination in intron-rich
sequences,” Comput. Appl. Biosci., 12: 55-62, 1996.
Y. Wu, A. W. Chung Liew, H. Yan and M. Yang, “Classification
of short human exons and introns based on statistical features,”
Phys. Rev. E, 1: 1-7, 2006.
H. T. Chang, “DNA sequence visualization,” in Dr. Hui-Huang
Hsu (Ed.), Advanced Data Mining Technologies in
Bioinformatics, Idea Group Publishing, 4: 63-84, 2006.
[16] N. W. Lo, T. H. Chang, S. W. Xiao and C. J. Kuo, “Global
visualization of DNA sequences by use of three-dimensional
trajectories,” J. Inf. Sci. Eng., 23: 1723-1736, 2007.
[17] D. Kotlar and Y. Lavner, “Gene prediction by spectral rotation
measure: a new method for identifying protein-coding regions,”
Genome Res., 13: 1930-1937, 2003.
[18] S. T. Eskesen, F. N. Eskesen, B. Kinghorn and A. Ruvinsky,
“Periodicity of DNA in exons,” BMC Mol. Biol., 5: 12-23, 2004.
[19] S. Tiwari, S. Ramachandran, A. Bhattacharya, S. Bhattacharya
and R. Ramaswamy, “Prediction of probable genes by Fourier
analysis of genomic sequences,” Comput. Appl. Biosci., 13:
263-270, 1997.
[20] M. Yan, Z. S. Lin and C. T. Zhang, “A new Fourier transform
approach for protein coding measure based on the format of the
Z curve,” Bioinformatics, 14: 685-690, 1998.
[21] M. Akhtar, J. Epps and E. Ambikairajah, “Signal processing in
sequence analysis: advances in eukaryotic gene prediction,”
IEEE J. Sel. Top. Signal Process., 2: 310-321, 2008.
[22] M. K. Choong and H. Yan, “Multi-scale parametric spectral
analysis for exon detection in DNA sequences based on
forward-backward linear prediction and singular value
decomposition of the double-base curves,” Bioinformation, 2:
273-278, 2008.
[23] P. P. Vaidyanathan and B. J. Yoon, “The role of signal processing
concepts in genomics and proteomics,” J. Frankl. Inst.-Eng.
Appl. Math., 341: 111-135, 2004.
[24] D. Anastassiou, “Genomic signal processing,” IEEE Signal
Process. Mag., 18: 8-20, 2001.
[25] J. Ning, C. N. Moore and J. C. Nelson, “Preliminary wavelet
analysis of genomic sequences,” Proc. of IEEE Int. Conf.
Computer Society Bioinformatics (CSB'03), 1: 509-510, 2003.
[26] J. W. Fickett and C. S. Tung, “Assessment of protein coding
measures,” Nucleic Acids Res., 20: 6441-6450, 1992.
[27] C. Burge and S. Karlin, “Prediction of complete gene structures
in human genomic DNA,” J. Mol. Biol., 268: 78-94, 1997.
[28] Ensembl release 57, 2010. Available: http://www.ensembl.org.
[29] S. Durinck, W. Huber and S. Davis, “Interface to BioMart
databases,” biomaRt, Available: http://www.bioconductor.
org/packages/2.2/bioc/html/biomaRt.html
[30] K. J. Breslauer, R. Franks, H. Blockers and L. A. Marky,
“Predicting DNA duplex stability from the base sequence,” Proc.
Natl. Acad. Sci. U. S. A., 83: 3746-3750, 1986.
[31] N. Sugimoto, S. Nakano, M. Yoneyama and K. Honda,
“Improved thermodynamic parameters and helix initiation factor
to predict stability of DNA duplexes,” Nucleic Acids Res., 24:
4501-4505, 1996.
[32] F. Hlawatsch and G. F. Boudreaux-Bartels, “Linear and quadratic
time-frequency signal representations,” IEEE Signal Process.
Mag., 9: 21-67, 1992.
[33] F. Clariá, M. Vallverdú, R. Baranowski, L. Chojnowska and P.
Caminal, “Time-frequency analysis of the RT and RR variability
to stratify hypertrophic cardiomyopathy patients,” Comput.
Biomed. Res., 33: 416-430, 2000.
[34] L. Cohen, Time-frequency analysis, Prentice Hall Signal
Processing Series, 1995.
[35] F. Clariá, M. Vallverdú, R. Baranowski, L. Chojnowska and P.
Caminal, “Heart rate variability analysis based on timefrequency representation and entropies in hypertrophic
cardiomyopathy patients,” Physiol. Meas., 29: 401-416, 2008.
[36] S. Mikat, G. Fitscht, J. Weston, B. Scholkopft and K. R. Mullert,
“Fisher discriminant analysis with Kernels,” Proc. IEEE Int.
Workshop on Neural Networks for Signal Process., 9: 41-48,
1999.
[37] A. Pallejà, E. D. Harrington and P. Bork, “Large gene overlaps
in prokaryotic genomes: result of functional constraints or
mispredictions?” BMC Genomics, 9: 335-345, 2008.
[38] A. Facchiano, “Coding in noncoding frames,” Trends Genet., 12:
168-169, 1996.
[39] W. Y. Chung, S. Wadhawan, R. Szklarczyk, S. Kosakovsky and
A. Nekrutenko, “A first look at ARFome: Dual-coding genes in
mammalian genomes,” PLoS Comput. Biol., 3: e91, 2007.