a novel approach for compressing dna sequences using semi

International Journal of Computers and Applications, Vol. 33, No. 3, 2011
A NOVEL APPROACH FOR COMPRESSING
DNA SEQUENCES USING
SEMI-STATISTICAL COMPRESSOR
Ashutosh Gupta∗ and Suneeta Agarwal∗∗
genomes, these symbols are very elongated. For creature
genome, it consists of three billions symbols over 23 pairs
of chromosomes. As the number of genome sequences
is growing very fast, the complexity of storing them has
to be mentioned. The significance of mutual compaction
for identifying interested patterns from genomes is documented by Stern et al. [4]. In [5], it is shown that
compaction is a well measurement to establish a relationship between sequences. Traditional textual compression
methods are not suitable when DNA sequences are considered. As DNA sequences are consisted of only four symbols
A, T, C, G, each symbol can be represented by 2 bits.
When compression of DNA data sequences are considered,
typical compression tools like gzip, bzip2, and compress
have greater than 2 bits per symbol. This makes challenge
for DNA compression. Some of the algorithms such as
GenCompress [6], Biocompress [1] , Biocompress-2 [7]
uses the distinctiveness of DNA and obtains a compression
rate around 1.76 bits per symbol [8]. A lot of compression schemes have been showing for DNA sequence compression [6], [9]–[15]. All these schemes take the benefit
that DNA sequences are consist of only four symbols, along
with the methods to make use of the recurring nature of
DNA [16].
This paper presents a method for compression of DNA
sequence. There are two phases of compressor. In first
phaset, the original sequence is transformed into sequence
consisting of words. We have experimentally seen that
conversion of four base symbols (i.e., sequence of characters) into words given an excellent compression ratio. The
encoding is done with WBTC [17] in second phase. The
designed algorithm obtains nearly 0.6574 bits per symbol.
The result given in Section 5 shows that proposed algorithm obtains better performance than existing compressors on standard DNA sequences.
This paper is organized as follows. In Section 2, we
review some presented research on DNA compression. The
proposed algorithm is illustrated in Section 3. In Section 4,
we describe actual encoding and decoding algorithm followed by experimental results in Section 5. At last, we
conclude our work in Section 6.
Abstract
In this paper, we present an algorithm for DNA sequence compression that uses a replacement method.
The replacement method
introduces words and a word-based compression scheme is used for
encoding.
The encoder uses ranks to assign the code of words.
The developed statistical compression algorithm is competent and
useful for DNA chain compression. We have experimentally showed
that the designed algorithm is better than existing compressors on
typical DNA sequence datasets.
Key Words
DNA sequences, DNA compression, word-based tagged code
1. Introduction
There are overabundance of particular category of data
which ought to be compacted, for simple storage space
and communication [1]. Amongst them are textual documents (plain text, programming language, etc.), pictures,
audio, etc. In this study, we highlight the compression of
a definite type of texts only, namely biological sequences.
The DNA constitutes the substantial means in which all
properties of breathing organisms are programmed. The
understanding of DNA sequence is a primary concern in
molecular biology [2]. Some of the important molecular
biology databases are designed to accumulate nucleotide
sequences and amino acid sequences of proteins [2]. The
size of these databases increases in exponential order nowadays [3]. The compaction of inherited information as a
result forms a very significant work.
A DNA sequence is consist of four types of nucleotides:
adenine (abbreviated A), cytosine (C), guanine (G),
and thymine (T). The DNA has double-helix structure
in which two opposite strands are attached witth hydrogen bonds connecting T with A and G with C. In whole
∗
Institute of Engineering and Technology, M. J. P. Rohilkhand
University, Bareilly, UP, India; e-mail: ashutosh333@rediffmail
.com
∗∗ Motilal Nehru National Institute of Technology, Allahabad,
UP 211004, India; e-mail: [email protected]
Recommended by Dr. L. C. Monticone
(DOI: 10.2316/Journal.202.2011.3.202-3114)
1
rithm iteratively selects recurring subsequences meant for
encoding thus achieving maximum compression. A similar
approach is given by Chen et al. [6], [32] which exploits
approximate repeats. Rivals et al. [16] developed another
DNA compressor named as Cfact. It generates the suffix
tree of the string in the first pass, and encoding is done in
second pass.
A good number of compression schemes uses similar
methods of GenCompress to encode estimated replicates.
These methods only differ in the encoding of non-replicate
regions and in identifying repeats. The DNAPack algorithm is developed by Behzadi and Fessant [33] which
uses dynamic programming approach to locate replicates.
Non-replicates regions are coded by the finest choice from
Markov model of order 2, context tree weighting, and
2 bits per symbol schemes. Matsumoto et al. [8] developed CTW +LZ algorithm that encodes considerably
long replicates by the replacement method, and encodes
short replicates and non-replicates areas by context tree
weighting [3].
Many DNA compression schemes combine statistical
and substitution methods. An inaccurate replicate is coded
with a pointer to a preceding incidence and the probabilities of characters being derivative, distorted, inserted or
deleted. The MNL and GeMNL algorithms are given by
Tabus et al. [35] and Korodi and Tabus [36]. In these
methods, the DNA string is divided into permanent size
blocks. The block is coded by the method which searches
the past for a regressor. The bit mask is encoded using a
probability allocation expected by the NML of similarity
between the regressor and the block.
CDNA and ARM are developed by Loewenstern and
Yianilos [37] and Allison et al. [38]. These compressors
are based on statistical compression methods. In CDNA
method, the probability division of each character is obtained by estimated incomplete matches from the past.
Each estimated match is with a preceding substring have
a little Hamming distance to the earlier context of the
symbol to be encoded. These predictions are united with a
set of weights. In ARM algorithm, it makes the probability
of a substring by taking the summation of probabilities in
excess of all explanations.
In this paper, an algorithm named DNACompact is
presented, which is a statistical algorithm. The working
of an algorithm is divided into two phases. In first phase,
original sequences are transformed, and in second phase
actual encoding is done.
2. Related Work
The DNA chain is represented by four symbols (A,C,G,T),
if the chain were to be totally random (i.e., totally uneven
or incompressible [18]), so, we require 2 bits to encode each
symbol. DNA sequences are though recognized to state
significant information among dissimilar generations of organisms. Furthermore, as study of compression and series
perceptive, the repetitions intrinsic in DNA sequences engage redundancies which can present an avenue for a significant compaction. The recognition of these dependencies
is sources for compression of DNA sequences.
2.1 Standard Text Compression
The text compression is categorized into three methods,
namely substitution-, dictionary-, and context-based
methods [19], [20]. In first method [21], every symbol
is substituted with a new code, such that symbols that
happen more often are substituted with shorter codes,
consequently achieving an large compression. In second
method, a vocabulary of recurrently happening symbols
or group of symbols is constructed from the input string.
Compaction is achieved by substituting the indexes of the
characters with a pointer to their indexes in the vocabulary.
This dictionary (or vocabulary) may be constructed off-line
[22] or on-line. For on-line dictionary methods [17], the
text itself is used as the vocabulary, and symbols that have
previously been observed in the progression are substituted
with pointers to the indexes of their preceding incidence.
Examples of on-line dictionary methods are compressors
of LZ-family [23], [24]. In off-line dictionary methods,
the input sequence is compressed in two passes. In first
pass, the compressor builds the lexicon by identifying the
recurring sequences, and second pass is used to encode the
recurring symbols with pointers into the lexicon. Rubin
[25] and Wolff [26] describe a variety of issues in dictionarybased methods. Storer and Szymanski [27] give a broad
structure to explain dissimilar substitution-based methods.
At last, the third method makes use of the information
that the likelihood of a symbol may possibly be exaggerated
by the close symbols. The contexts are normally expressed
in requisites of symbol neighbourhoods in the input series.
The prediction by partial match (PPM) makes [28] use
of contexts of dissimilar sizes. In this scheme, symbols
are prearranged by taking the relation with the preceding
occurrences of their present context and after that selecting
the best corresponding contexts [29], [30].
3. Overview of DNACompact
2.2 Biological Sequence Compression
As we already pointed in Section 1 that DNA sequences
are consist of four symbols namely: A, C, G, and T.
The working of DNACompact is divided into two phases.
The transformation phase transforms the original DNA
string into three symbols namely, A, C, and space. For
simplicity space symbol is represented as ψ throughout the
paper. The step of transformation process is described in
next section. Let the transformed sequence be S1 . Next,
in encoding phase, Word-based Tagged Code (WBTC)
method [17] is used for encoding of transformed sequence
The initial DNA compression scheme is given by Grumbach
and Tahi [1]. This algorithm is named as BioCompress
and it detects specific replicate in DNA using an automaton, and uses Fibonacci coding scheme to encode the extent
and location of its preceding position. If a substring is not
a replicate, it is set by 2 bits per symbol. The improved
description of BioCompress is named as Bio-Compress-2
[7]. The Off-line method of DNA compression is developed by Apostolico and Lonardi [31] in which the algo2
S1 . We have developed WBTC for text compression. In the
subsequent sections, these two phases of DNACompact
are described.
3.2.1 WBTC Method
The WBTC method is given by Gupta and Agarwal [2],
[17]. One of the properties of this code is that it always
ends with 2 bit pattern viz., 01 or 10. The bit combinations
01 or 10 acts as end of code. In WBTC, source text is
read and all the vocabulary statistics in sequence S1 is
gathered. Here, the transformed sequence S1 is source text.
After completion of first phase, the vocabulary is generated
and sorted with decreasing frequency. The codewords in
WBTC are generated with 2 bit patterns (00, 11, 01, 10).
The code assignment procedure is given below:
1. The first 2l words having rank 0 to 20 of vocabulary
are initialized with 01 and 10 codes, respectively.
2. In next level l = 2, 2l words in loctions from 21 + 0 to
22 + 20 are coded with 4 bits by prefixing 00 and 11 to
all the codes of preceding level.
3. In general, any value lying at level l, subsequent 2l
words nearby in the locations from 2l − 1 + (2l − 2 +
. . . + 0) to 2l + (2l − 1 + . . . + 20 ) are encoded using l∗2
bits, by prefixing 00 and 11 to all the codes generated
at previous level.
4. The steps 1–3 are repeated till all the N words in the
transformed sequence S1 are coded.
One of the important features of WBTC is that the
code of words is only depending upon rank. This way,
neither we store the frequencies nor the codewords along
with compressed stream.
3.1 First Phase (Transformation)
The DNACompact convert the original DNA sequence
into three symbols in first phase. These symbols are A,
C and space. The symbol space is represented as ψ for
simplicity. Following example is used to demonstrate the
working of transformer.
Let original sequence be T
= AATTCCGGGATACGACATA
The length of sequence T is 19 characters. The transformed symbols are indicated by bold symbols. The transformer replaces every occurrence of T by ψA (i.e., space
followed by symbol A) and G by ψC (i.e., space followed
by symbol C). Let the transformed sequence be S1 .
The sequence S1 is:
AAψAψACCψCψCψCAψAACψCACAψAA
The size of S1 is 27 characters. Note that the size of
S1 increases due to addition of extra space (ψ) symbols.
This transformation is necessary because we are trying to
convert the original sequence of four symbols into word
sequence. As spaces are considered as word separators, it is
easy to distinguish between two consecutive words. Since
the DNA sequence has high repeatability of base symbols,
numbers of space symbols are introduced. Next, encoding
is performed on transformed sequence S1 .
3.2.2 Some Useful Results
The observations presented in this section are useful in
encoding and decoding process.
Result 1: (Relation between Level(l ) and Range
of Ranks of Words (R )): Let l be the level and Rl be
the range of ranks of words in the level l. Then, range (Rl )
is expressed as,
3.2 Encoding Scheme
1
The transformed sequence S1 is consisting of words separated by space symbol. As these four base symbols have
replicating property in DNA sequence, most of the symbols
in the S1 are converted into spaces. This necessitates a
model for transformed text as text is not merely consisting
of words but also of spaces. Witten [39] and Bell et al. [40]
uses two dissimilar alphabets: one alphabet is for words
and other alphabet is for separator. As there is a severe
sporadic belongings holds, there is no uncertainty concerning which alphabet is to use once it is recognized that the
text starts with separator or word. de Moura et al. [41] introduces spaceless model for words. If a space is followed
by a word, encoder coded the space, otherwise word is
encoded followed by coding of separator. During decoding
phase, word is decoded as usual and decoder assumes that
a space follows it. We have used the concept of spaceless
model for transformed sequence S1 . The sequence S1 so
generated is constituted with words followed by a space.
The information about words present in transformed sequence S1 is presented in Section 5. Next, Word-Based
Tagged Code encoding method is explained.
j=l−1
2j ≤ R ≤
1
2j − 1
(1)
j=l
Result 2: (Relationship between Level (l ) and
Rank (r ) of Word): For a word of given rank r, its level
l is expressed in terms of r as:
l = lg2 (r + 3) − 1
(2)
Result 3: (Estimation of Maximum Codeword
Length for a given Source Message): Let r be the
highest rank of a word, then maximum codeword length
bmax is expressed in terms of r as:
bmax = 2 × lg2 (r + 3) − 1
(3)
Result 4: (Estimation of Total Number of Bits
(Loose Upper Bound) upto Level l ): Let l be the
level, Then, total number of bits (B) used for the text upto
highest level is given by:
3
−1
l N
B=2
pj · 2i i where i = j, i ≥ 1, j ≥ 0
(4)
i=1 j=0
where N = 2(l+1) − 2; l > 0
Result 5: (Estimation of Total Number of Bits
upto Rank (r ) of Word): If nth word is the last word,
having rank r, then total number of bits upto rank r
(Br ) = total number of bits upto level (l − 1) + number of
bits upto rank r at level l.
Figure 2. Decoding algorithm.
l
Br = 2
l−1 2
−2
i
pj · i · 2
i=1 j=0
+
r
pk [{2 × lg2 (r + 3) − 1} × (r − R1 + 1)] (5)
k=R1
where R1 =
1
Table 1
Details of Test Data
2j ,
Family
(6)
j=l−1
n
60819
667
CHNTXX
155939
78070
731
HEHCMVCG
229354
4.1 Encoding Algorithm
In this algorithm, the codes of all the words are computed and stored in a separate data structure named as
CodeVector. Its pseudo-code is presented in Fig. 1.
Mouse
Yeast
114968 1013
HUMDYSTROP
38770
19608
374
HUMGHCSA
66495
32913
406
HUMHBB
73308
37094
535
HUMHDABCD
58864
30596
439
HUMHPRTB
56737
29767
425
MPOMTCG
186609
94189
906
MTPACGA
100314
51083
442
Vaccg
191737
95806
667
mm19.chr
134180
64222
956
mmy.chr
711108
363119 1855
SporAll2x.fasta
444906
206054
913
SporAll.fasta
222453
103027
913
SporEarlyI.fasta
31039
14524
328
SporEarlyII.fasta
25008
11575
286
SporMiddle.fasta
54325
24933
469
Y14.chr
784328
391117 1923
Y1.chr
230203
115733 1010
Y4.chr
131056
65897
708
Ymit.chr
85779
42747
293
AllUp1M
1001002
476102 1977
AllUp400k
399615
189780 1228
HeldenAll
112507
51273
670
32871
14995
345
HeldenCGN
4
N
121024
4. Encoding and Decoding Algorithm
The encoder reads a word and assign the corresponding
code from the CodeVector. Doing this till the S1 is
exhausted yield compressed file (C1 ). Now, the vocabulary
is kept along with C1 so that it is useful in decompression.
Another interesting thing is that there is no requirement
Size (bytes)
Historical CHMPXX
and i = j, i ≥ 1, j ≥ 0.
The above results are quite useful in encoding and
decoding process.
Figure 1. Encoding algorithm.
Filename
Table 2
Comparison of DNA Compression
Sequence
CHMPXX CHNTXX HEHCMVCG HUMDYSTR HUMGHCSA HUMHBB HUMHDAB HUMHPRTB MPOMTCG MTPACG VACCG Average
Length
121024
155844
229354
33770
66495
73308
58864
56737
186609
100314
191737
–
gzip
2.2818
2.3345
2.3275
2.3618
2.0648
2.245
2.2389
2.2662
2.3288
2.2919
2.2518
2.2721
bzip
2.1218
2.1845
2.1685
2.1802
1.7289
2.1481
2.0678
2.0944
2.1701
2.1225
2.0949
2.0983
ac-o2
1.8364
1.9333
1.9647
1.9235
1.9377
1.9176
1.9422
1.9283
1.9654
1.8723
1.904
1.9205
ac-o3
1.8425
1.9399
1.9619
1.9446
1.9416
1.9305
1.9466
1.9352
1.9689
1.8761
1.9064
1.9267
gzip-4
1.8635
1.9519
1.9817
1.9473
1.7372
1.8963
1.9141
1.9207
1.9727
1.8827
1.8741
1.9038
bzip-4
1.9667
2.009
2.0091
2.0678
1.8697
1.9957
1.9921
2.0045
2.0117
1.9847
1.952
1.9875
dna2(12)
1.6733
1.6162
1.8487
1.9326
1.3668
1.8677
1.9036
1.9104
1.9275
1.8696
1.7634
1.7891
Off-line(2)
1.9022
1.9985
2.0157
2.0682
1.5993
1.9697
1.974
1.9836
1.9867
1.9155
1.9075
1.9383
BioCompress(7)
1.6848
1.6172
1.848
1.9262
1.3074
1.88
1.877
1.9066
1.9378
1.8752
1.7614
1.7837
GenCompress(5)
1.673
1.6146
1.847
1.9231
1.0969
1.8204
1.8192
1.8466
1.9058
1.8624
1.7614
1.7428
CTW+LZ(13)
1.669
1.6129
1.8414
1.9175
1.0972
1.8082
1.8218
1.8433
1.9
1.8555
1.7616
1.7389
DNACompress(6)
1.6716
1.6127
1.8492
1.9116
1.0272
1.7897
1.7951
1.8165
1.892
1.8556
1.758
1.7254
DnaPack(3)
1.6602
1.6103
1.8346
1.9088
1.039
1.7771
1.7394
1.7886
1.8932
1.8535
1.7583
1.7148
CDNA(11)
–
1.93
0.95
1.77
1.67
1.72
1.87
1.85
1.81
1.65
–
–
GeMNL(10)
1.6617
1.6101
1.842
1.9085
1.0089
–
1.7059
1.7639
1.8822
1.844
1.7644
–
Expert Model(4)
1.6575
1.6086
1.8404
1.9031
0.9845
1.7524
1.6696
1.7378
1.8783
1.8466
1.7639
1.6946
DNACompact
0.641
0.6331
0.6335
0.742
0.6532
0.699
0.6946
0.6946
0.6422
0.6038
0.5944
0.6574
to store either the frequencies or the codewords in C1 . It is
sufficient to put the plain words sorted by frequency. This
makes vocabulary very small.
From Table 1 it is obvious that repeatability of words is
excessive in all test data. This redundancy created in
transformation phase lends them for compression.
We applied DNACompact on a standard dataset of
DNA sequences for comparison. The DNA corpus is consists of DNA sequences (available at http://www.cs.ucr.
edu/∼stelo/Offline/) belonging to two dissimilar organisms: yeast (Saccharomyces cerevisiae ) and mouse (Mus
musculus ). Some collection of “historical” sequences
is also included for comparison. The dataset1 contains
26 sequences consisting of two genomes namely CHMPXX and CHNTXX, five genes of humans consisting
of HUMHPRTB, HUMHDABCD, HUMHBB, HUMDYSTROP, HUMGHCSA, and, two genomes of mitochondria
namely MPOMTCG and MTPACG, two virus genomes
(HEHCMVCG and VACCG), two mouse sequences and 13
sequences from yeast.
Table 2 shows the comparison of compression results
(in terms of bits per symbol), of DNACompact with
respect to other DNA compressors and standard compressors like gzip, bzip2, gzip-4, bzip-4, ac, dna2, Offline, CTW+LZ, BioCompress-2 (BioC) [7], GenCompress (GenC) [6], DNACompress (DNAC) [32], DNAPack
(DNAP) [33], CDNA [37], GeMNL [36], XM and on the
dataset. The result on of CHMPXX and HEHCMVCG on
CDNA compressor is not available. The results of GeMNL
are also reported without HUMHBB sequence. The results
of each algorithm (average compression results) are given
in the last row of Table 2.
Table 3 shows the compression results (bps) of DNACompact for rest of the datasets. As the result of other
compressors for files given in Table 3 are not given in literature, we only show the performance of our compressor.
It is clearly seen from Tables 2 and 3 that DNACompact
obtains better result than all other algorithms from the
standard dataset. The complete compression time for 26
4.2 Decoding Algorithm
In first step, the decoder loads the words that arrange the
vocabulary in a separate data structure. As the words are
kept in sorted appearance (with respect to frequency) along
with the compressed text, the vocabulary is retrieved. In
second step, the decoding of codes can be started. For
any step k, where k = 1 to . . ., the decoder reads a code
from compressed text until it ends with 01 or 10. Let the
read code be Ci and its length be bi bits. Next, decoder
computes the level of Ci and range of words with the help of
(3) and (1), respectively. Then, binary search is applied at
level l to retirve the exact rank r of word wi corresponding
to the code Ci by inspecting the decimal value of code
Ci . The pseudo-code for the decode algorithm is shown in
Fig. 2.
5. Experimental Results
The encoder and decoder of DNACompact is implemented
in C. The experiments are conducted on a machine with
1 GB of RAM and Pentium IV 2.8 Ghz CPU, using the Fedora Core 2 Linux. The information of original sequences
after the transformation is shown in Table 1. The transformer converts the original sequences into separators and
words. The transformed sequence is consisting of total
number of words represented by N and number of distinct
words is represented by n. The first column shows the
name of family for DNA sequences. Second column of
Table 1 shows the name of test files and their respective
size in bytes. The total number of words (N ) and number
of distinct words (n) are shown in column three and four.
5
Table 3
Compression Performance for other Data Set
Filename
Size (bytes)
bps
AllUp1M
1001002
0.5380
AllUp400k
399615
0.5693
HeldenAll
112507
0.6332
32871
0.7245
mm19.chr
134180
0.7083
Mmy.chr
711108
0.5794
SporAll2x.fasta
444906
0.5230
SporAll.fasta
222453
0.5856
SporEarlyI.fasta
31039
0.7258
SporEarlyII.fasta
25008
0.7274
SporMiddle.fasta
54325
0.6976
y14.chr
784328
0.5758
y1.chr
230203
0.6299
y4.chr
131056
0.6551
85779
0.5141
HeldenCGN
ymit.chr
Average
[6] X. Chen, S. Kwong, & M. Li, A compression algorithm for
DNA sequences and its applications in genome comparison,
RECOMB, 2000, 107.
[7] S. Grumbach & F. Tahi, A new challenge for compression algorithms: genetic sequences. Information, Process Management,
30(6), 1994, 875–866.
[8] T. Matsumoto, K. Sadakane, & H. Imai, Biological sequence
compression algorithms, Genome Informatics, 11, 2000, 43–52.
[9] D. Adjeroh & F. Nan, On compressibility of protein sequences,
DCC, 2006, 422–434.
[10] D.M. Boulton & C.S. Wallace, The information content of
a multistate distribution, Theoretical Biology, 23(2), 1969,
269–278.
[11] J.G. Cleary & I.H. Witten, Data compression using adaptive
coding and partial string matching, IEEE Transaction and
Communication, COM-32(4), 1984, 396–402.
[12] T.I. Dix, D.R. Powell, L. Allison, S. Jaeger, J. Bernal, & L.
Stern, Exploring long DNA sequences by information content,
Probabilistic Modeling and Machine Learning in Structural and
Systems Biology, Workshop Proc, 2006, 97–102.
[13] T.I. Dix, D.R. Powell, L. Allison, S. Jaeger, J. Bernal, & L.
Stern, Comparative analysis of long DNA sequences by per
element information content using different contexts, BMC
Bioinformatics, 2007.
[14] A. Hategan & I. Tabus, Protein is compressible, NORSIG,
2004, 192–195.
[15] C.G. Nevill-Manning & I.H. Witten, Protein is incompressible,
DCC, 1999, 257–266.
[16] E. Rivals, J.-P. Delahaye, M. Dauchet, & O. Delgrange, A
guaranteed compression scheme for repetitive DNA sequences,
DCC, 1996, 453.
[17] A. Gupta & S. Agarwal, A scheme that facilitates searching
and partial decompression of textual documents, International
Journal of Advanced Computer Engineering, 1(2), 2008.
[18] M. Li & P. Vit’anyi, An introduction to Kolmogorov Complexity
and its Applications (Springer Verlag, 1993).
[19] T.C. Bell, J.C. Cleary, & I.H. Witten, TextCompression (Prentice Hall, Englewood Cliffs, NJ, 1990).
[20] I.H. Witten, A. Moffat, & T.C. Bell, Managing gigabytes:
compressing and indexing documents and Images (Morgan
Kaufman, 1999).
[21] A. Gupta, & S. Agarwal, Transforming the natural language
text for improving compression performance, Lecture Notes
in Electrical Engineering, Trends in Intelligent Systems and
Computer Engineering (ISCE) Springer, 6, 2008, 637–644.
[22] A. Gupta, & S. Agarwal, A novel approach of data compression
for dynamic data, Proc. of IEEE 3rd Int. Conf. System of
Systems Engineering, June 2–4, 2008, California, USA.
[23] J. Ziv & A. Lempel, A universal algorithm for sequential data
compression, IEEE Transaction Information System, 23(3),
1977, 337–342.
[24] J. Ziv & A. Lempel, Compression of individual sequences via
variable-rate coding, IEEE Transaction Information System,
24(5), 1978, 530–536.
[25] F. Rubin, Experiments in text file compression, Communications of the ACM, 19(11), 1976, 617–623.
[26] J.G. Wolff, Recoding of natural language for economy of
transmission or storage, The Computer Journal, 21(1), 1978,
42–44.
[27] J.A. Storer & T.G. Szymanski, Data compression via textual
substitution, Journal of the ACM ssociation for Computing
Machinery, 29(4), 1982, 928–951.
[28] J.G. Cleary & W.J. Teahan, Unbounded length contexts for
PPM, The Computer Journal, 40(2/3), 1997, 67–75.
[29] M. Burrows & D.J. Wheeler, A block sorting lossless data
compression algorithm, Technical Report, Digital Equipment
Corporation, Palo Alto, CA, 1994.
[30] P. Fenwick, The Burrows-Wheeler Transform for block sorting
text compression, The Computer Journal, 39(9), 1996, 731–
740.
[31] A. Apostolico & S. Lonardi, Compression of biological sequences
by greedy off-line textual substitution, DCC, 2000, 143–152.
[32] X. Chen, M. Li, B. Ma, & T. John, DNACompress: fast and
effective DNA sequence compression, Bioinformatics, 18(2),
2002, 1696–1698.
0.6258
sequences is nearly 0.5 s and decoding time is just 0.35 s.
The experimental results show that DNACompact obtains better compression ratio (bps) as compared to other
DNA compressor.
6. Conclusions
In this paper, a simple and statistical compressor
DNACompact for DNA sequences is presented. The proposed compression algorithm is proficient and cooperative
for DNA compaction. The proposed algorithm works in
two phases and uses arithmetical properties of the biological sequence. Our algorithm is shown to obtain better results than all existing DNA compressors and also maintains
practical running time.
References
[1] S. Grumbach & F. Tahi, Compression of DNA sequences, DCC,
1993, 340–350.
[2] A. Gupta, V. Rishiwal, & S. Agarwal, Efficient storage of
massive biological sequences in compact form, Proc. of 3rd
Intl. Conf. Contemporary Computing-Part II, 2010 Communications in Computer and Information Science Series, Springer,
Noida, India, 95, Aug 9–11, 2010.
[3] N. Kamel, Panel: Data and knowledge bases for genome
mapping: What lies ahead?, Proc. Intl. Very Large Databases,
1991.
[4] L. Stern, L. Allison, R.L. Coppel, & T.I. Dix, Discovering
patterns in plasmodium falciparum genomic DNA, Molecular
& Biochemical Parasitology, 118, 2001, 175–186.
[5] D.R. Powell, L. Allison, & T.I. Dix, Modelling-alignment
for non-random sequences, Advances in Artificial Intelligence,
2004, 203–214.
6
Suneeta Agarwal received her
bachelor’s degree in Science in
1973 from Allahabad University
(India) and masters degree in
Mathematics from the same University in 1975. She did Ph.D.
degree from Indian Institute of
Technology, Kanpur in Operations Research in 1980. She has
a teaching experience of around
years. She is an assistant Professor in Motilal Nehru National
Institute of technology, Allahabad, (India). Her current
research interests include data compression, information
retrieval, finger print recognition, signature verification,
and algorithm design. She is a member of IEEE, ISTE,
CSE and IAENG.
[33] B. Behzadi & F.L. Fessant, DNA compression challenge revisited: a dynamic programming approach, CPM, 2005, 190–200.
[34] F.M.J. Willems, Y.M. Shtarkov, & T.J. Tjalkens, The contexttree weighting method: Basic properties, IEEE Transaction
Information Theory, 653–664, 1995.
[35] I. Tabus, G. Korodi, & J. Rissanen, DNA sequence compression
using the normalized maximum likelihood model for discrete
regression, DCC, 2003, 253.
[36] G. Korodi & I. Tabus, An efficient normalized maximum
likelihood algorithm for DNA sequence compression, ACM
Transactions Information System, 23(1), 2005, 3–34.
[37] D. Loewenstern & P.N. Yianilos, Significantly lower entropy
estimates for natural DNA sequences, Computational Biology,
6(1), 1999, 125–142.
[38] L. Allison, T. Edgoose, & T.I. Dix, Compression of strings
with approximate repeats, ISMB, 1998, 8–16.
[39] I.H. Witten, R.M. Neal, & J.G. Cleary. Arithmetic coding for
data compression, Communication ACM, 30(6), 1987, 520–540.
[40] T.C. Bell, J.G. Cleary, & I.H. Witten, Text Compression,
(Prentice Hall, 1990).
[41] E.S. de Moura, G. Navarro, N. Ziviani, & R. Baeza-Yates,
Fast and flexible word searching on compressed text, ACM
Transaction on Information Systems, 18(2), 2000, 113–139.
[42] A. Moffat, Word based text compression, Software Practice
and Experience, 19(2), 1989, 185–198.
[43] O. Bat, M. Kimmel, & D.E. Axelrod, Cmputer simulation of
expansions of DNA triplet repeats in the Fragile-X Syndrome
and Huntington’s disease, Journal of Theoretical Biology, 188,
1997, 53–67.
[44] M.D. Cao, T.I. Dix, L. Allison, & C. Mears, A simple statistical algorithm for biological sequence compression, Data
Compression Conference, 2007, 43–52.
[45] D. Loewenstern & P.N. Yianilos, Significantly lower entropy estimates for natural DNA sequences, Journal of Computational
Biology, 6(1), 1999, 125–142.
[46] G. Manzini & M. Rastero, A simple and fast DNA compressor,
Software: Practice and Experience, 34(14), 2004, 1397–1411.
Biographies
Ashutosh Gupta received his
bachelor’s degree in Electronics
& Communication Engineering.
in 1998 from Institution of Engineers (India) and M.E. degree in
Computer Science and Engineering in 2001 from Motilal Nehru
Regional Engineering College,
Allahabad (India). He is a Ph.D.
candidate in Computer Science
and Engineering at Motilal Nehru
National Institute of Technology
and lecturer at Institute of Engineering and Rural Technology, Allahabad (India). His current research interests
include data compression, information retrieval, and algorithm design. He is a associate member of Institution
of Engineers (India). He is also a member of ICEB and
IAENG.
7