International Language Character Code with DNA Molecules

Advanced Science and Technology Letters
Vol.81 (CST 2015), pp.161-166
http://dx.doi.org/10.14257/astl.2015.81.33
International Language Character Code
with DNA Molecules
Wei Wang, Zhengxu Zhao, Qian Xu
School of Information Science and Technology, Shijiazhuang Tiedao University,
Shijiazhuang, Hebei, 050043, China
{wangwei, zhaozx, xuqian}@stdu.edu.cn
Abstract. In 1994, Dr Adleman solved problem using DNA as computational
mechanism. He proved the principle that DNA computing could be used to
solve computationally complex problems. Recent 20 years with the rapid
development of biological molecular computer, scientist have set a series of
theoretical model and succeed in biochemical experiment. DNA computing has
become an important research direction of the computer science and molecular
biology. This research present a novel approach in which character could be
encoded by the permutation and combination of the four nitrogenous bases
(Adenine, Guanine, Cytosine and Thymine) in DNA molecules. The character
encoding should support multi-language and unique identifier.
Keywords: DNA Storage, Character Encoding, DNA Computing
1
Introduction
The rapid development of science and information industry, especially the
development of multimedia technology, cloud computer and computer network,
computer storage equipment not only has a larger data storage capacity, higher data
transmission rate and more reliable data storage quality. Also on how to make the data
more economic and safe storage, storage in time and space on the extensibility, have
put forward higher requirements. Current computer storage system the birth defects
are revealed and the subsequent development of lack of power, has become one of the
bottleneck of the computer promotion. Whether the HDD or optical storage
technology is unable to cope with the future demand for storage of computer. It is
estimated that in the future semiconductor, disk, and CD-ROM data storage density
will achieve its physical limit [1], it is urgent need to develop a new generation of
alternative storage technology.
On the other hand, Biological molecular computer which Adleman [2] completed
the first experimental verification has been rapid development. Nearly two decades, a
variety of theoretical models and experimental methods emerge in endlessly, such as
Adleman model, Splicing System model, Insertion-Deletion System model and DNAEC model [3]. DNA storage as an important branch in the field of biological
molecular computer, because it has high storage density and low hardware cost,
access procedure parallelizable, good scalability and integration, and long term
ISSN: 2287-1233 ASTL
Copyright © 2015 SERSC
Advanced Science and Technology Letters
Vol.81 (CST 2015)
storage. In the foreseeable future DNA storage system will be likely to replace the
traditional storage systems. [4] [5]
DNA molecule is a powerful and effective natural information storage medium, it
has been widely used since 1985 when DNA molecule was synthesized for the first
time. There are obvious similarities between DNA storage system and traditional
storage system, both of two storage system are sequential storage devices, and use
special symbols to indicate the beginning and end of a single information section, and
the data error correction coding is used to ensure the integrity of their information. As
a result, DNA molecules can be used as a medium of the information is stored. DNA
storage technology is based on the DNA molecule storage medium. The four
nitrogenous bases (Adenine, Guanine, Cytosine and Thymine) what are contained
within DNA molecule can be used to encode information. With the existing
biochemical experiment method, it's easily complete the clone operation of DNA
molecules and the modify operation of the nitrogenous bases what has been encode in
the DNA molecules, these operations are similar with the traditional storage system
which read and write operations. Because of the advantages of DNA storage system
such as stable and reliable work, no wear, huge information capacity, long life, high
quality, low price of bits of information and access procedure parallelizable, DNA
storage system is seen as high density and large capacity of storage.
Although DNA molecule as a data storage method has been proposed, but at this
stage how to encode the information what will be stored in DNA molecule has not yet
been determined. The method of character encoding is one of most important
foundations of computer system, there is an exploratory research what use
permutation and combination of four nitrogenous bases of DNA molecule to encode
the character information. This research include two major problems, storage medium
select and coding rules.
2
Storage Medium
DNA molecule as information storage medium can take many forms. As information
storage medium of DNA molecule can be a single-stranded, also can be doublestranded; can be a long chain, can also be a circular strand, some with special
biological meaning chain is called the plasmid [6]. These different modes have their
different advantages and disadvantages when they are as information storage medium,
therefore must consider these factors when choosing storage medium, to make the
DNA molecule storage advantages and simplicity of operation have been play. DNA
storage system using circular single-stranded DNA molecule as storage medium.
Compared with single-stranded and double-stranded each have each advantages
and disadvantages. Double-stranded DNA is more stability than single-stranded DNA,
that is one of the most important reasons what the most living organisms choose
double-stranded DNA as their genetic materials, but the data which stored in the
double-stranded DNA are difficult to read. Double-stranded should be unzipped their
two attached chains into single-stranded before reading and clone. Single-stranded
DNA can use Watson-Crick Complement principle to read data, but it is not stable,
and single-stranded DNA is not only more easily fracture than double-stranded DNA,
162
Copyright © 2015 SERSC
Advanced Science and Technology Letters
Vol.81 (CST 2015)
but also easily to form own complementary hairpin structure. It is the reasons why we
choose single-stranded that single-stranded easier to read and clone than doublestranded. In addition we can avoid the generation of the hairpin structure in the singlestranded special design.
Compare with long-chain DNA than circular strand DNA, long chain will be cut
into two independent segments by endonuclease at a time, but circular strand is still
together, under certain conditions can also even the back circular strand again. Even
more long chain easy to be degraded by certain exonuclease from its ends, and this
degradation possibility of a circular strand is less than long chain
3
Coding Rules
The DNA molecule is composed of four nitrogenous bases, therefore the permutation
and combination of the four nitrogenous bases can be used to encode information
which will be stored in the DNA storage system.
The coding rules are as follows:
3.1
Unique Code
In order to compatible with different countries and languages, multi-language
environment, it is must be defined each character as unique code. Coding using an
abstract way which combines Adenine, Guanine, Cytosine and Thymine (A, G, C and
T for short) to deal with characters, and the visual image work, such as font size,
shape, font, form, style and so on for application software to deal with, such as a web
browser or word processor.
3.2
Permutation and Combination of Nitrogenous Bases
Use The coding rule is composed of four nitrogenous bases permutation and
combination. In order to maximize the including information about the character of all
countries and languages, from 0 to 0x10FFFF are used to indicate all countries and
the language character in Unicode encoding, a total of 1114112 code points. If use the
nitrogenous bases permutation and combination to represent 1114112 code points, in
order to defined each character as unique code, it need 11 nitrogenous bases to
represent each code point. For economizing on space of storage, reducing duplication
of nitrogenous bases which are from the high-order to low-order. And the adenine (A
for short) as '00', the guanine (G for short) as '01', the cytosine (C for short) as '10', the
thymine (T for short) as '11'.
The table 1 is mapping table of nitrogenous bases.
Copyright © 2015 SERSC
163
Advanced Science and Technology Letters
Vol.81 (CST 2015)
Table 1.
The mapping table of nitrogenous bases
Unicode
0
0x1
0x2
0x3
0xA
0xAF
0x10FFFF
3.3
Binary
0
1
10
11
1010
1010 1111
1 0000 1111 1111 1100 0000
Sequence
A
G
C
T
CC
CCTT
GAATTTTTAAA
Latin Letters
Computer system support the basic Latin letters. In the ISO8859-1 it defined 256
commonly used characters, such as numbers, uppercase Latin letters, lowercase Latin
letters, etc. So the first 256 positions in the character encoding reserved for the
characters which include in the ISO8859-1, in order to improve the character
encoding efficiency and compatibility.
3.4
Multi-Languages Environment
To improve the efficiency and compatibility of multi-languages, the character
encoding provide independent zone for different language. The Unicode plane is a
good reference for the character encoding.
5
Algorithm
Algorithm describes how to perform the character encode with nitrogenous bases.
First import the text file which will be transform into the memory. According to the
order of the characters in the text, get the Unicode of the character one by one. Follow
the code rules, transcode the Unicode to nitrogenous bases. Output the final result to
store DNA sequence. For example, the character "A" Unicode is 0x41 (01000001),
the corresponding nitrogenous bases is AAAAAAAAGAAG, simplified nitrogenous
bases is GAAG. In encryption round, the nitrogenous bases (DNA sequence) will add
round key, sub bytes, shift rows, mix columns. The final ciphertext will be storage.
1:
Initialization
2:
Import the plaintext file
3:
for each character do
4:
Get Unicode of the characters Cunicode
5:
Transcode Cunicode to CDNA
6:
Output CDNA to store DNA sequence
7:
end for
164
Copyright © 2015 SERSC
Advanced Science and Technology Letters
Vol.81 (CST 2015)
6
Verification of Algorithm
The Import the text file which include Latin alphabets, Chinese characters, Japanese
characters, numbers, and symbols. The application software (Fig. 1 is an example) get
Unicode of the character in binary at first. Then follow the coding rules the
application software transcode the Unicode to the nitrogenous bases. Inverse this
operation, the application software also get the raw text from DNA sequence.
Fig. 1. Example of the Character encoding
7
Conclusions
This paper puts forward a set of encoding of characters used to DNA storage system.
The character encoding can be implemented to convert character to sequence of
nitrogenous bases so as to implement the encoding and decoding of character
information. This character encoding are more compatible with the multi-language
environment, and all character encoding is uniqueness.
Acknowledgment. Dr. Yang Guo are greatly acknowledged for supporting this
study. Laboratory of complex network and visualization has made publishing of this
article possible.
Copyright © 2015 SERSC
165
Advanced Science and Technology Letters
Vol.81 (CST 2015)
References
1. Wei Dan, "Review of magnetic information storage technology," in Physics, vol. 33(9),
2004, pp. 646-651
2. Adleman LM., "Molecular Computation of Solution to Combination Problems," in Science,
vol. 266(11), 1994, pp. 1021-1023
3. ZINGEL T., "Formal models of DNA computing:a survey," in Proc Estonian Acad Sci Phys
Math, vol. 49(2), 2000, pp. 90-99.
4. Dietrich A. and Been W., "Memory and DNA," in J theor Biol, vol. 208, 2001, pp. 145-149
5. Garzon MH., Neel A., Chen H., "Efficiency and Reliability of DNA Based Memories," in
GECCO, 2003, pp. 379-389
6. ROBERT F W., Molecular Biology, 2nd ed., Beijing:Science Press, 2003, pp. 642 -682.
166
Copyright © 2015 SERSC