International Journal of Computers and Applications, Vol. 33, No. 3, 2011 A NOVEL APPROACH FOR COMPRESSING DNA SEQUENCES USING SEMI-STATISTICAL COMPRESSOR Ashutosh Gupta∗ and Suneeta Agarwal∗∗ genomes, these symbols are very elongated. For creature genome, it consists of three billions symbols over 23 pairs of chromosomes. As the number of genome sequences is growing very fast, the complexity of storing them has to be mentioned. The significance of mutual compaction for identifying interested patterns from genomes is documented by Stern et al. [4]. In [5], it is shown that compaction is a well measurement to establish a relationship between sequences. Traditional textual compression methods are not suitable when DNA sequences are considered. As DNA sequences are consisted of only four symbols A, T, C, G, each symbol can be represented by 2 bits. When compression of DNA data sequences are considered, typical compression tools like gzip, bzip2, and compress have greater than 2 bits per symbol. This makes challenge for DNA compression. Some of the algorithms such as GenCompress [6], Biocompress [1] , Biocompress-2 [7] uses the distinctiveness of DNA and obtains a compression rate around 1.76 bits per symbol [8]. A lot of compression schemes have been showing for DNA sequence compression [6], [9]–[15]. All these schemes take the benefit that DNA sequences are consist of only four symbols, along with the methods to make use of the recurring nature of DNA [16]. This paper presents a method for compression of DNA sequence. There are two phases of compressor. In first phaset, the original sequence is transformed into sequence consisting of words. We have experimentally seen that conversion of four base symbols (i.e., sequence of characters) into words given an excellent compression ratio. The encoding is done with WBTC [17] in second phase. The designed algorithm obtains nearly 0.6574 bits per symbol. The result given in Section 5 shows that proposed algorithm obtains better performance than existing compressors on standard DNA sequences. This paper is organized as follows. In Section 2, we review some presented research on DNA compression. The proposed algorithm is illustrated in Section 3. In Section 4, we describe actual encoding and decoding algorithm followed by experimental results in Section 5. At last, we conclude our work in Section 6. Abstract In this paper, we present an algorithm for DNA sequence compression that uses a replacement method. The replacement method introduces words and a word-based compression scheme is used for encoding. The encoder uses ranks to assign the code of words. The developed statistical compression algorithm is competent and useful for DNA chain compression. We have experimentally showed that the designed algorithm is better than existing compressors on typical DNA sequence datasets. Key Words DNA sequences, DNA compression, word-based tagged code 1. Introduction There are overabundance of particular category of data which ought to be compacted, for simple storage space and communication [1]. Amongst them are textual documents (plain text, programming language, etc.), pictures, audio, etc. In this study, we highlight the compression of a definite type of texts only, namely biological sequences. The DNA constitutes the substantial means in which all properties of breathing organisms are programmed. The understanding of DNA sequence is a primary concern in molecular biology [2]. Some of the important molecular biology databases are designed to accumulate nucleotide sequences and amino acid sequences of proteins [2]. The size of these databases increases in exponential order nowadays [3]. The compaction of inherited information as a result forms a very significant work. A DNA sequence is consist of four types of nucleotides: adenine (abbreviated A), cytosine (C), guanine (G), and thymine (T). The DNA has double-helix structure in which two opposite strands are attached witth hydrogen bonds connecting T with A and G with C. In whole ∗ Institute of Engineering and Technology, M. J. P. Rohilkhand University, Bareilly, UP, India; e-mail: ashutosh333@rediffmail .com ∗∗ Motilal Nehru National Institute of Technology, Allahabad, UP 211004, India; e-mail: [email protected] Recommended by Dr. L. C. Monticone (DOI: 10.2316/Journal.202.2011.3.202-3114) 1 rithm iteratively selects recurring subsequences meant for encoding thus achieving maximum compression. A similar approach is given by Chen et al. [6], [32] which exploits approximate repeats. Rivals et al. [16] developed another DNA compressor named as Cfact. It generates the suffix tree of the string in the first pass, and encoding is done in second pass. A good number of compression schemes uses similar methods of GenCompress to encode estimated replicates. These methods only differ in the encoding of non-replicate regions and in identifying repeats. The DNAPack algorithm is developed by Behzadi and Fessant [33] which uses dynamic programming approach to locate replicates. Non-replicates regions are coded by the finest choice from Markov model of order 2, context tree weighting, and 2 bits per symbol schemes. Matsumoto et al. [8] developed CTW +LZ algorithm that encodes considerably long replicates by the replacement method, and encodes short replicates and non-replicates areas by context tree weighting [3]. Many DNA compression schemes combine statistical and substitution methods. An inaccurate replicate is coded with a pointer to a preceding incidence and the probabilities of characters being derivative, distorted, inserted or deleted. The MNL and GeMNL algorithms are given by Tabus et al. [35] and Korodi and Tabus [36]. In these methods, the DNA string is divided into permanent size blocks. The block is coded by the method which searches the past for a regressor. The bit mask is encoded using a probability allocation expected by the NML of similarity between the regressor and the block. CDNA and ARM are developed by Loewenstern and Yianilos [37] and Allison et al. [38]. These compressors are based on statistical compression methods. In CDNA method, the probability division of each character is obtained by estimated incomplete matches from the past. Each estimated match is with a preceding substring have a little Hamming distance to the earlier context of the symbol to be encoded. These predictions are united with a set of weights. In ARM algorithm, it makes the probability of a substring by taking the summation of probabilities in excess of all explanations. In this paper, an algorithm named DNACompact is presented, which is a statistical algorithm. The working of an algorithm is divided into two phases. In first phase, original sequences are transformed, and in second phase actual encoding is done. 2. Related Work The DNA chain is represented by four symbols (A,C,G,T), if the chain were to be totally random (i.e., totally uneven or incompressible [18]), so, we require 2 bits to encode each symbol. DNA sequences are though recognized to state significant information among dissimilar generations of organisms. Furthermore, as study of compression and series perceptive, the repetitions intrinsic in DNA sequences engage redundancies which can present an avenue for a significant compaction. The recognition of these dependencies is sources for compression of DNA sequences. 2.1 Standard Text Compression The text compression is categorized into three methods, namely substitution-, dictionary-, and context-based methods [19], [20]. In first method [21], every symbol is substituted with a new code, such that symbols that happen more often are substituted with shorter codes, consequently achieving an large compression. In second method, a vocabulary of recurrently happening symbols or group of symbols is constructed from the input string. Compaction is achieved by substituting the indexes of the characters with a pointer to their indexes in the vocabulary. This dictionary (or vocabulary) may be constructed off-line [22] or on-line. For on-line dictionary methods [17], the text itself is used as the vocabulary, and symbols that have previously been observed in the progression are substituted with pointers to the indexes of their preceding incidence. Examples of on-line dictionary methods are compressors of LZ-family [23], [24]. In off-line dictionary methods, the input sequence is compressed in two passes. In first pass, the compressor builds the lexicon by identifying the recurring sequences, and second pass is used to encode the recurring symbols with pointers into the lexicon. Rubin [25] and Wolff [26] describe a variety of issues in dictionarybased methods. Storer and Szymanski [27] give a broad structure to explain dissimilar substitution-based methods. At last, the third method makes use of the information that the likelihood of a symbol may possibly be exaggerated by the close symbols. The contexts are normally expressed in requisites of symbol neighbourhoods in the input series. The prediction by partial match (PPM) makes [28] use of contexts of dissimilar sizes. In this scheme, symbols are prearranged by taking the relation with the preceding occurrences of their present context and after that selecting the best corresponding contexts [29], [30]. 3. Overview of DNACompact 2.2 Biological Sequence Compression As we already pointed in Section 1 that DNA sequences are consist of four symbols namely: A, C, G, and T. The working of DNACompact is divided into two phases. The transformation phase transforms the original DNA string into three symbols namely, A, C, and space. For simplicity space symbol is represented as ψ throughout the paper. The step of transformation process is described in next section. Let the transformed sequence be S1 . Next, in encoding phase, Word-based Tagged Code (WBTC) method [17] is used for encoding of transformed sequence The initial DNA compression scheme is given by Grumbach and Tahi [1]. This algorithm is named as BioCompress and it detects specific replicate in DNA using an automaton, and uses Fibonacci coding scheme to encode the extent and location of its preceding position. If a substring is not a replicate, it is set by 2 bits per symbol. The improved description of BioCompress is named as Bio-Compress-2 [7]. The Off-line method of DNA compression is developed by Apostolico and Lonardi [31] in which the algo2 S1 . We have developed WBTC for text compression. In the subsequent sections, these two phases of DNACompact are described. 3.2.1 WBTC Method The WBTC method is given by Gupta and Agarwal [2], [17]. One of the properties of this code is that it always ends with 2 bit pattern viz., 01 or 10. The bit combinations 01 or 10 acts as end of code. In WBTC, source text is read and all the vocabulary statistics in sequence S1 is gathered. Here, the transformed sequence S1 is source text. After completion of first phase, the vocabulary is generated and sorted with decreasing frequency. The codewords in WBTC are generated with 2 bit patterns (00, 11, 01, 10). The code assignment procedure is given below: 1. The first 2l words having rank 0 to 20 of vocabulary are initialized with 01 and 10 codes, respectively. 2. In next level l = 2, 2l words in loctions from 21 + 0 to 22 + 20 are coded with 4 bits by prefixing 00 and 11 to all the codes of preceding level. 3. In general, any value lying at level l, subsequent 2l words nearby in the locations from 2l − 1 + (2l − 2 + . . . + 0) to 2l + (2l − 1 + . . . + 20 ) are encoded using l∗2 bits, by prefixing 00 and 11 to all the codes generated at previous level. 4. The steps 1–3 are repeated till all the N words in the transformed sequence S1 are coded. One of the important features of WBTC is that the code of words is only depending upon rank. This way, neither we store the frequencies nor the codewords along with compressed stream. 3.1 First Phase (Transformation) The DNACompact convert the original DNA sequence into three symbols in first phase. These symbols are A, C and space. The symbol space is represented as ψ for simplicity. Following example is used to demonstrate the working of transformer. Let original sequence be T = AATTCCGGGATACGACATA The length of sequence T is 19 characters. The transformed symbols are indicated by bold symbols. The transformer replaces every occurrence of T by ψA (i.e., space followed by symbol A) and G by ψC (i.e., space followed by symbol C). Let the transformed sequence be S1 . The sequence S1 is: AAψAψACCψCψCψCAψAACψCACAψAA The size of S1 is 27 characters. Note that the size of S1 increases due to addition of extra space (ψ) symbols. This transformation is necessary because we are trying to convert the original sequence of four symbols into word sequence. As spaces are considered as word separators, it is easy to distinguish between two consecutive words. Since the DNA sequence has high repeatability of base symbols, numbers of space symbols are introduced. Next, encoding is performed on transformed sequence S1 . 3.2.2 Some Useful Results The observations presented in this section are useful in encoding and decoding process. Result 1: (Relation between Level(l ) and Range of Ranks of Words (R )): Let l be the level and Rl be the range of ranks of words in the level l. Then, range (Rl ) is expressed as, 3.2 Encoding Scheme 1 The transformed sequence S1 is consisting of words separated by space symbol. As these four base symbols have replicating property in DNA sequence, most of the symbols in the S1 are converted into spaces. This necessitates a model for transformed text as text is not merely consisting of words but also of spaces. Witten [39] and Bell et al. [40] uses two dissimilar alphabets: one alphabet is for words and other alphabet is for separator. As there is a severe sporadic belongings holds, there is no uncertainty concerning which alphabet is to use once it is recognized that the text starts with separator or word. de Moura et al. [41] introduces spaceless model for words. If a space is followed by a word, encoder coded the space, otherwise word is encoded followed by coding of separator. During decoding phase, word is decoded as usual and decoder assumes that a space follows it. We have used the concept of spaceless model for transformed sequence S1 . The sequence S1 so generated is constituted with words followed by a space. The information about words present in transformed sequence S1 is presented in Section 5. Next, Word-Based Tagged Code encoding method is explained. j=l−1 2j ≤ R ≤ 1 2j − 1 (1) j=l Result 2: (Relationship between Level (l ) and Rank (r ) of Word): For a word of given rank r, its level l is expressed in terms of r as: l = lg2 (r + 3) − 1 (2) Result 3: (Estimation of Maximum Codeword Length for a given Source Message): Let r be the highest rank of a word, then maximum codeword length bmax is expressed in terms of r as: bmax = 2 × lg2 (r + 3) − 1 (3) Result 4: (Estimation of Total Number of Bits (Loose Upper Bound) upto Level l ): Let l be the level, Then, total number of bits (B) used for the text upto highest level is given by: 3 −1 l N B=2 pj · 2i i where i = j, i ≥ 1, j ≥ 0 (4) i=1 j=0 where N = 2(l+1) − 2; l > 0 Result 5: (Estimation of Total Number of Bits upto Rank (r ) of Word): If nth word is the last word, having rank r, then total number of bits upto rank r (Br ) = total number of bits upto level (l − 1) + number of bits upto rank r at level l. Figure 2. Decoding algorithm. l Br = 2 l−1 2 −2 i pj · i · 2 i=1 j=0 + r pk [{2 × lg2 (r + 3) − 1} × (r − R1 + 1)] (5) k=R1 where R1 = 1 Table 1 Details of Test Data 2j , Family (6) j=l−1 n 60819 667 CHNTXX 155939 78070 731 HEHCMVCG 229354 4.1 Encoding Algorithm In this algorithm, the codes of all the words are computed and stored in a separate data structure named as CodeVector. Its pseudo-code is presented in Fig. 1. Mouse Yeast 114968 1013 HUMDYSTROP 38770 19608 374 HUMGHCSA 66495 32913 406 HUMHBB 73308 37094 535 HUMHDABCD 58864 30596 439 HUMHPRTB 56737 29767 425 MPOMTCG 186609 94189 906 MTPACGA 100314 51083 442 Vaccg 191737 95806 667 mm19.chr 134180 64222 956 mmy.chr 711108 363119 1855 SporAll2x.fasta 444906 206054 913 SporAll.fasta 222453 103027 913 SporEarlyI.fasta 31039 14524 328 SporEarlyII.fasta 25008 11575 286 SporMiddle.fasta 54325 24933 469 Y14.chr 784328 391117 1923 Y1.chr 230203 115733 1010 Y4.chr 131056 65897 708 Ymit.chr 85779 42747 293 AllUp1M 1001002 476102 1977 AllUp400k 399615 189780 1228 HeldenAll 112507 51273 670 32871 14995 345 HeldenCGN 4 N 121024 4. Encoding and Decoding Algorithm The encoder reads a word and assign the corresponding code from the CodeVector. Doing this till the S1 is exhausted yield compressed file (C1 ). Now, the vocabulary is kept along with C1 so that it is useful in decompression. Another interesting thing is that there is no requirement Size (bytes) Historical CHMPXX and i = j, i ≥ 1, j ≥ 0. The above results are quite useful in encoding and decoding process. Figure 1. Encoding algorithm. Filename Table 2 Comparison of DNA Compression Sequence CHMPXX CHNTXX HEHCMVCG HUMDYSTR HUMGHCSA HUMHBB HUMHDAB HUMHPRTB MPOMTCG MTPACG VACCG Average Length 121024 155844 229354 33770 66495 73308 58864 56737 186609 100314 191737 – gzip 2.2818 2.3345 2.3275 2.3618 2.0648 2.245 2.2389 2.2662 2.3288 2.2919 2.2518 2.2721 bzip 2.1218 2.1845 2.1685 2.1802 1.7289 2.1481 2.0678 2.0944 2.1701 2.1225 2.0949 2.0983 ac-o2 1.8364 1.9333 1.9647 1.9235 1.9377 1.9176 1.9422 1.9283 1.9654 1.8723 1.904 1.9205 ac-o3 1.8425 1.9399 1.9619 1.9446 1.9416 1.9305 1.9466 1.9352 1.9689 1.8761 1.9064 1.9267 gzip-4 1.8635 1.9519 1.9817 1.9473 1.7372 1.8963 1.9141 1.9207 1.9727 1.8827 1.8741 1.9038 bzip-4 1.9667 2.009 2.0091 2.0678 1.8697 1.9957 1.9921 2.0045 2.0117 1.9847 1.952 1.9875 dna2(12) 1.6733 1.6162 1.8487 1.9326 1.3668 1.8677 1.9036 1.9104 1.9275 1.8696 1.7634 1.7891 Off-line(2) 1.9022 1.9985 2.0157 2.0682 1.5993 1.9697 1.974 1.9836 1.9867 1.9155 1.9075 1.9383 BioCompress(7) 1.6848 1.6172 1.848 1.9262 1.3074 1.88 1.877 1.9066 1.9378 1.8752 1.7614 1.7837 GenCompress(5) 1.673 1.6146 1.847 1.9231 1.0969 1.8204 1.8192 1.8466 1.9058 1.8624 1.7614 1.7428 CTW+LZ(13) 1.669 1.6129 1.8414 1.9175 1.0972 1.8082 1.8218 1.8433 1.9 1.8555 1.7616 1.7389 DNACompress(6) 1.6716 1.6127 1.8492 1.9116 1.0272 1.7897 1.7951 1.8165 1.892 1.8556 1.758 1.7254 DnaPack(3) 1.6602 1.6103 1.8346 1.9088 1.039 1.7771 1.7394 1.7886 1.8932 1.8535 1.7583 1.7148 CDNA(11) – 1.93 0.95 1.77 1.67 1.72 1.87 1.85 1.81 1.65 – – GeMNL(10) 1.6617 1.6101 1.842 1.9085 1.0089 – 1.7059 1.7639 1.8822 1.844 1.7644 – Expert Model(4) 1.6575 1.6086 1.8404 1.9031 0.9845 1.7524 1.6696 1.7378 1.8783 1.8466 1.7639 1.6946 DNACompact 0.641 0.6331 0.6335 0.742 0.6532 0.699 0.6946 0.6946 0.6422 0.6038 0.5944 0.6574 to store either the frequencies or the codewords in C1 . It is sufficient to put the plain words sorted by frequency. This makes vocabulary very small. From Table 1 it is obvious that repeatability of words is excessive in all test data. This redundancy created in transformation phase lends them for compression. We applied DNACompact on a standard dataset of DNA sequences for comparison. The DNA corpus is consists of DNA sequences (available at http://www.cs.ucr. edu/∼stelo/Offline/) belonging to two dissimilar organisms: yeast (Saccharomyces cerevisiae ) and mouse (Mus musculus ). Some collection of “historical” sequences is also included for comparison. The dataset1 contains 26 sequences consisting of two genomes namely CHMPXX and CHNTXX, five genes of humans consisting of HUMHPRTB, HUMHDABCD, HUMHBB, HUMDYSTROP, HUMGHCSA, and, two genomes of mitochondria namely MPOMTCG and MTPACG, two virus genomes (HEHCMVCG and VACCG), two mouse sequences and 13 sequences from yeast. Table 2 shows the comparison of compression results (in terms of bits per symbol), of DNACompact with respect to other DNA compressors and standard compressors like gzip, bzip2, gzip-4, bzip-4, ac, dna2, Offline, CTW+LZ, BioCompress-2 (BioC) [7], GenCompress (GenC) [6], DNACompress (DNAC) [32], DNAPack (DNAP) [33], CDNA [37], GeMNL [36], XM and on the dataset. The result on of CHMPXX and HEHCMVCG on CDNA compressor is not available. The results of GeMNL are also reported without HUMHBB sequence. The results of each algorithm (average compression results) are given in the last row of Table 2. Table 3 shows the compression results (bps) of DNACompact for rest of the datasets. As the result of other compressors for files given in Table 3 are not given in literature, we only show the performance of our compressor. It is clearly seen from Tables 2 and 3 that DNACompact obtains better result than all other algorithms from the standard dataset. The complete compression time for 26 4.2 Decoding Algorithm In first step, the decoder loads the words that arrange the vocabulary in a separate data structure. As the words are kept in sorted appearance (with respect to frequency) along with the compressed text, the vocabulary is retrieved. In second step, the decoding of codes can be started. For any step k, where k = 1 to . . ., the decoder reads a code from compressed text until it ends with 01 or 10. Let the read code be Ci and its length be bi bits. Next, decoder computes the level of Ci and range of words with the help of (3) and (1), respectively. Then, binary search is applied at level l to retirve the exact rank r of word wi corresponding to the code Ci by inspecting the decimal value of code Ci . The pseudo-code for the decode algorithm is shown in Fig. 2. 5. Experimental Results The encoder and decoder of DNACompact is implemented in C. The experiments are conducted on a machine with 1 GB of RAM and Pentium IV 2.8 Ghz CPU, using the Fedora Core 2 Linux. The information of original sequences after the transformation is shown in Table 1. The transformer converts the original sequences into separators and words. The transformed sequence is consisting of total number of words represented by N and number of distinct words is represented by n. The first column shows the name of family for DNA sequences. Second column of Table 1 shows the name of test files and their respective size in bytes. The total number of words (N ) and number of distinct words (n) are shown in column three and four. 5 Table 3 Compression Performance for other Data Set Filename Size (bytes) bps AllUp1M 1001002 0.5380 AllUp400k 399615 0.5693 HeldenAll 112507 0.6332 32871 0.7245 mm19.chr 134180 0.7083 Mmy.chr 711108 0.5794 SporAll2x.fasta 444906 0.5230 SporAll.fasta 222453 0.5856 SporEarlyI.fasta 31039 0.7258 SporEarlyII.fasta 25008 0.7274 SporMiddle.fasta 54325 0.6976 y14.chr 784328 0.5758 y1.chr 230203 0.6299 y4.chr 131056 0.6551 85779 0.5141 HeldenCGN ymit.chr Average [6] X. Chen, S. Kwong, & M. Li, A compression algorithm for DNA sequences and its applications in genome comparison, RECOMB, 2000, 107. [7] S. Grumbach & F. Tahi, A new challenge for compression algorithms: genetic sequences. Information, Process Management, 30(6), 1994, 875–866. [8] T. Matsumoto, K. Sadakane, & H. Imai, Biological sequence compression algorithms, Genome Informatics, 11, 2000, 43–52. [9] D. Adjeroh & F. Nan, On compressibility of protein sequences, DCC, 2006, 422–434. [10] D.M. Boulton & C.S. Wallace, The information content of a multistate distribution, Theoretical Biology, 23(2), 1969, 269–278. [11] J.G. Cleary & I.H. Witten, Data compression using adaptive coding and partial string matching, IEEE Transaction and Communication, COM-32(4), 1984, 396–402. [12] T.I. Dix, D.R. Powell, L. Allison, S. Jaeger, J. Bernal, & L. Stern, Exploring long DNA sequences by information content, Probabilistic Modeling and Machine Learning in Structural and Systems Biology, Workshop Proc, 2006, 97–102. [13] T.I. Dix, D.R. Powell, L. Allison, S. Jaeger, J. Bernal, & L. Stern, Comparative analysis of long DNA sequences by per element information content using different contexts, BMC Bioinformatics, 2007. [14] A. Hategan & I. Tabus, Protein is compressible, NORSIG, 2004, 192–195. [15] C.G. Nevill-Manning & I.H. Witten, Protein is incompressible, DCC, 1999, 257–266. [16] E. Rivals, J.-P. Delahaye, M. Dauchet, & O. Delgrange, A guaranteed compression scheme for repetitive DNA sequences, DCC, 1996, 453. [17] A. Gupta & S. Agarwal, A scheme that facilitates searching and partial decompression of textual documents, International Journal of Advanced Computer Engineering, 1(2), 2008. [18] M. Li & P. Vit’anyi, An introduction to Kolmogorov Complexity and its Applications (Springer Verlag, 1993). [19] T.C. Bell, J.C. Cleary, & I.H. Witten, TextCompression (Prentice Hall, Englewood Cliffs, NJ, 1990). [20] I.H. Witten, A. Moffat, & T.C. Bell, Managing gigabytes: compressing and indexing documents and Images (Morgan Kaufman, 1999). [21] A. Gupta, & S. Agarwal, Transforming the natural language text for improving compression performance, Lecture Notes in Electrical Engineering, Trends in Intelligent Systems and Computer Engineering (ISCE) Springer, 6, 2008, 637–644. [22] A. Gupta, & S. Agarwal, A novel approach of data compression for dynamic data, Proc. of IEEE 3rd Int. Conf. System of Systems Engineering, June 2–4, 2008, California, USA. [23] J. Ziv & A. Lempel, A universal algorithm for sequential data compression, IEEE Transaction Information System, 23(3), 1977, 337–342. [24] J. Ziv & A. Lempel, Compression of individual sequences via variable-rate coding, IEEE Transaction Information System, 24(5), 1978, 530–536. [25] F. Rubin, Experiments in text file compression, Communications of the ACM, 19(11), 1976, 617–623. [26] J.G. Wolff, Recoding of natural language for economy of transmission or storage, The Computer Journal, 21(1), 1978, 42–44. [27] J.A. Storer & T.G. Szymanski, Data compression via textual substitution, Journal of the ACM ssociation for Computing Machinery, 29(4), 1982, 928–951. [28] J.G. Cleary & W.J. Teahan, Unbounded length contexts for PPM, The Computer Journal, 40(2/3), 1997, 67–75. [29] M. Burrows & D.J. Wheeler, A block sorting lossless data compression algorithm, Technical Report, Digital Equipment Corporation, Palo Alto, CA, 1994. [30] P. Fenwick, The Burrows-Wheeler Transform for block sorting text compression, The Computer Journal, 39(9), 1996, 731– 740. [31] A. Apostolico & S. Lonardi, Compression of biological sequences by greedy off-line textual substitution, DCC, 2000, 143–152. [32] X. Chen, M. Li, B. Ma, & T. John, DNACompress: fast and effective DNA sequence compression, Bioinformatics, 18(2), 2002, 1696–1698. 0.6258 sequences is nearly 0.5 s and decoding time is just 0.35 s. The experimental results show that DNACompact obtains better compression ratio (bps) as compared to other DNA compressor. 6. Conclusions In this paper, a simple and statistical compressor DNACompact for DNA sequences is presented. The proposed compression algorithm is proficient and cooperative for DNA compaction. The proposed algorithm works in two phases and uses arithmetical properties of the biological sequence. Our algorithm is shown to obtain better results than all existing DNA compressors and also maintains practical running time. References [1] S. Grumbach & F. Tahi, Compression of DNA sequences, DCC, 1993, 340–350. [2] A. Gupta, V. Rishiwal, & S. Agarwal, Efficient storage of massive biological sequences in compact form, Proc. of 3rd Intl. Conf. Contemporary Computing-Part II, 2010 Communications in Computer and Information Science Series, Springer, Noida, India, 95, Aug 9–11, 2010. [3] N. Kamel, Panel: Data and knowledge bases for genome mapping: What lies ahead?, Proc. Intl. Very Large Databases, 1991. [4] L. Stern, L. Allison, R.L. Coppel, & T.I. Dix, Discovering patterns in plasmodium falciparum genomic DNA, Molecular & Biochemical Parasitology, 118, 2001, 175–186. [5] D.R. Powell, L. Allison, & T.I. Dix, Modelling-alignment for non-random sequences, Advances in Artificial Intelligence, 2004, 203–214. 6 Suneeta Agarwal received her bachelor’s degree in Science in 1973 from Allahabad University (India) and masters degree in Mathematics from the same University in 1975. She did Ph.D. degree from Indian Institute of Technology, Kanpur in Operations Research in 1980. She has a teaching experience of around years. She is an assistant Professor in Motilal Nehru National Institute of technology, Allahabad, (India). Her current research interests include data compression, information retrieval, finger print recognition, signature verification, and algorithm design. She is a member of IEEE, ISTE, CSE and IAENG. [33] B. Behzadi & F.L. Fessant, DNA compression challenge revisited: a dynamic programming approach, CPM, 2005, 190–200. [34] F.M.J. Willems, Y.M. Shtarkov, & T.J. Tjalkens, The contexttree weighting method: Basic properties, IEEE Transaction Information Theory, 653–664, 1995. [35] I. Tabus, G. Korodi, & J. Rissanen, DNA sequence compression using the normalized maximum likelihood model for discrete regression, DCC, 2003, 253. [36] G. Korodi & I. Tabus, An efficient normalized maximum likelihood algorithm for DNA sequence compression, ACM Transactions Information System, 23(1), 2005, 3–34. [37] D. Loewenstern & P.N. Yianilos, Significantly lower entropy estimates for natural DNA sequences, Computational Biology, 6(1), 1999, 125–142. [38] L. Allison, T. Edgoose, & T.I. Dix, Compression of strings with approximate repeats, ISMB, 1998, 8–16. [39] I.H. Witten, R.M. Neal, & J.G. Cleary. Arithmetic coding for data compression, Communication ACM, 30(6), 1987, 520–540. [40] T.C. Bell, J.G. Cleary, & I.H. Witten, Text Compression, (Prentice Hall, 1990). [41] E.S. de Moura, G. Navarro, N. Ziviani, & R. Baeza-Yates, Fast and flexible word searching on compressed text, ACM Transaction on Information Systems, 18(2), 2000, 113–139. [42] A. Moffat, Word based text compression, Software Practice and Experience, 19(2), 1989, 185–198. [43] O. Bat, M. Kimmel, & D.E. Axelrod, Cmputer simulation of expansions of DNA triplet repeats in the Fragile-X Syndrome and Huntington’s disease, Journal of Theoretical Biology, 188, 1997, 53–67. [44] M.D. Cao, T.I. Dix, L. Allison, & C. Mears, A simple statistical algorithm for biological sequence compression, Data Compression Conference, 2007, 43–52. [45] D. Loewenstern & P.N. Yianilos, Significantly lower entropy estimates for natural DNA sequences, Journal of Computational Biology, 6(1), 1999, 125–142. [46] G. Manzini & M. Rastero, A simple and fast DNA compressor, Software: Practice and Experience, 34(14), 2004, 1397–1411. Biographies Ashutosh Gupta received his bachelor’s degree in Electronics & Communication Engineering. in 1998 from Institution of Engineers (India) and M.E. degree in Computer Science and Engineering in 2001 from Motilal Nehru Regional Engineering College, Allahabad (India). He is a Ph.D. candidate in Computer Science and Engineering at Motilal Nehru National Institute of Technology and lecturer at Institute of Engineering and Rural Technology, Allahabad (India). His current research interests include data compression, information retrieval, and algorithm design. He is a associate member of Institution of Engineers (India). He is also a member of ICEB and IAENG. 7
© Copyright 2026 Paperzz