Multi-Language Character Encoding Technique for DNA Storage

Multi-Language Character Encoding
Technique for DNA Storage
Wei Wang*1, Zhengxu Zhao2, Wei Zhang3
School of Information Science and Technology, Shijiazhuang Tiedao University, Shijiazhuang, China
1,2
Beijing Aerospace Control Center, Beijing, China
3
[email protected]; 2zhaozx @stdu.edu.cn; [email protected]
*1
Abstract
In 1994, Dr Adleman solved problem using DNA as
computational mechanism. He proved the principle that
DNA computing could be used to solve computationally
complex problems. Recent 20 years with the rapid
development of biological molecular computer, scientist
have set a series of theoretical model and succeed in
biochemical experiment. DNA computing has become an
important research direction of the computer science and
molecular biology. This research present a novel approach in
which character could be encoded by the permutation and
combination of the four nitrogenous bases (Adenine,
Guanine, Cytosine and Thymine) in DNA molecules. And
the character encoding should support multi-language and
unique identifier.
Keywords
DNA Storag; Character Encoding; DNA Computing
Introduction
The rapid development of science and information
industry, especially the development of multimedia
technology, cloud computer and computer network,
computer storage equipment not only has a larger data
storage capacity, higher data transmission rate and
more reliable data storage quality. Also on how to
make the data more economic and safe storage,
storage in time and space on the extensibility, have put
forward higher requirements. Current computer
storage system the birth defects are revealed and the
subsequent development of lack of power, has become
one of the bottleneck of the computer promotion.
Whether the HDD or optical storage technology is
unable to cope with the future demand for storage of
computer. It is estimated that in the future
semiconductor, disk, and CD-ROM data storage
density will achieve its physical limit, it is urgent need
to develop a new generation of alternative storage
technology.
On the other hand, Biological molecular computer
which Adleman completed the first experimental
verification has been rapid development. Nearly two
decades, a variety of theoretical models and
experimental methods emerge in endlessly, such as
Adleman model, Splicing System model, InsertionDeletion System model and DNA-EC model. DNA
storage as an important branch in the field of
biological molecular computer, because it has high
storage density and low hardware cost, access
procedure parallelizable, good scalability and
integration, and long term storage. In the foreseeable
future DNA storage system will be likely to replace
the traditional storage systems.
DNA molecule is a powerful and effective natural
information storage medium, it has been widely used
since 1985 when DNA molecule was synthesized for
the first time. There are obvious similarities between
DNA storage system and traditional storage system,
both of two storage system are sequential storage
devices, and use special symbols to indicate the
beginning and end of a single information section, and
the data error correction coding is used to ensure the
integrity of their information. As a result, DNA
molecules can be used as a medium of the information
is stored. DNA storage technology is based on the
DNA molecule storage medium. The four nitrogenous
bases (Adenine, Guanine, Cytosine and Thymine)
what are contained within DNA molecule can be used
to encode information. With the existing biochemical
experiment method, it's easily complete the clone
operation of DNA molecules and the modify operation
of the nitrogenous bases what has been encode in the
DNA molecules, these operations are similar with the
traditional storage system which read and write
operations. Because of the advantages of DNA storage
system such as stable and reliable work, no wear, huge
information capacity, long life, high quality, low price
of bits of information and access procedure
parallelizable, DNA storage system is seen as high
International Journal of Automation and Control Engineering, Vol. 4, No. 1—April 2015
2325-7407/15/01 019-3 © 2015 DEStech Publications, Inc.
doi:10.12783/ijace.2015.0401.05
19
20
Wei Wang, Zhengxu Zhao, Wei Zhang
density and large capacity of storage.
Although DNA molecule as a data storage method has
been proposed, but at this stage how to encode the
information what will be stored in DNA molecule has
not yet been determined. The method of character
encoding is one of most important foundations of
computer system, There is an exploratory research
what use permutation and combination of four
nitrogenous bases of DNA molecule to encode the
character information. This research include two major
problems, storage medium select and coding rules.
Storage Medium
DNA molecule as information storage medium can
take many forms. As information storage medium of
DNA molecule can be a single-stranded, also can be
double-stranded; can be a long chain, can also be a
circular strand, some with special biological meaning
chain is called the plasmid[6]. These different modes
have their different advantages and disadvantages
when they are as information storage medium,
therefore must consider these factors when choosing
storage medium, to make the DNA molecule storage
advantages and simplicity of operation have been
play. DNA storage system using circular singlestranded DNA molecule as storage medium.
Compared with single-stranded and double-stranded
each have each advantages and disadvantages.
Double-stranded DNA is more stability than singlestranded DNA, that is one of the most important
reasons what the most living organisms choose
double-stranded DNA as their genetic materials, but
the data which stored in the double-stranded DNA are
difficult to read. Double-stranded should be unzipped
their two attached chains into single-stranded before
reading and clone. Single-stranded DNA can use
Watson-Crick Complement principle to read data, but
it is not stable, and single-stranded DNA is not only
more easily fracture than double-stranded DNA, but
also easily to form own complementary hairpin
structure. It is the reasons why we choose singlestranded that single-stranded easier to read and clone
than double-stranded. In addition we can avoid the
generation of the hairpin structure in the singlestranded special design.
Compare with long-chain DNA than circular strand
DNA, long chain will be cut into two independent
segments by endonuclease at a time, but circular
strand is still together, under certain conditions can
also even the back circular strand again. Even more
long chain easy to be degraded by certain exonuclease
from its ends, and this degradation possibility of a
circular strand is less than long chain.
easy way to comply with the journal paper formatting
requirements is to use this document as a template
and simply type your text into it.
Coding Rules
The DNA molecule is composed of four nitrogenous
bases, therefore the permutation and combination of
the four nitrogenous bases can be used to encode
information which will be stored in the DNA storage
system. The coding rules are as follows:
Unique Code
A In order to compatible with different countries and
languages, multi-language environment, it is must be
defined each character as unique code. Coding using
an abstract way which combines Adenine, Guanine,
Cytosine and Thymine (A, G, C and T for short) to
deal with characters, and the visual image work, such
as font size, shape, font, form, style and so on for
application software to deal with, such as a web
browser or word processor.
Permutation and Combination of Nitrogenus Bases
Use The coding rule is composed of four nitrogenous
bases permutation and combination. In order to
maximize the including information about the
character of all countries and languages, from 0 to
0x10FFFF are used to indicate all countries and the
language character in Unicode encoding, a total of
1114112 code points. If use the nitrogenous bases
permutation and combination to represent 1114112
code points, in order to defined each character as
unique code, it need 11 nitrogenous bases to represent
each code point. For economizing on space of storage,
reducing duplication of nitrogenous bases which are
from the high-order to low-order. And the adenine (A
for short) as '00', the guanine (G for short) as '01', the
cytosine (C for short) as '10', the thymine (T for short)
as '11', Table 1 is mapping table of nitrogenous bases.
TABEL 1. MAPPING TABLE OF NITROGENUS BASES
No.
1
2
3
4
5
6
7
Unicode
Binary
Sequence
0
0
A
0x1
1
G
0x2
10
C
0x3
11
T
0xA
1010
CC
0xAF
1010 1111
CCTT
0x10FFFF 1 0000 1111 1111 1100 0000 GAATTTTTAAA
Multi-Language Character Encoding Technique for DNA Storage
Latin Letters
Computer system support the basic Latin letters. In
the ISO8859-1 it defined 256 commonly used
characters, such as numbers, uppercase Latin letters,
lowercase Latin letters, etc. So the first 256 positions in
the character encoding reserved for the characters
which include in the ISO8859-1, in order to improve
the character encoding efficiency and compatibility.
21
The Import the text file which include Latin alphabets,
Chinese characters, Japanese characters, numbers, and
symbols. The application software (Figure. 1 is an
example) get Unicode of the character in binary at
first. then follow the coding rules the application
software transcodes the Unicode to the nitrogenous
bases. Inverse this operation, the application software
also get the raw text from DNA sequence.
Multi-Languages Enviroment
Conclusions
To improve the efficiency and compatibility of multilanguages,
the
character
encoding
provide
independent zone for different language. The Unicode
plane is a good reference for the character encoding.
This paper puts forward a set of encoding of
characters used to DNA storage system. The character
encoding can be implemented to convert character to
sequence of nitrogenous bases so as to implement the
encoding and decoding of character information. This
character encoding are more compatible with the
multi-language environment, and all character
encoding is uniqueness.
Algorithm
Algorithm describes how to perform the character
encode with nitrogenous bases. First import the text
file which will be transform into the memory.
According to the order of the characters in the text, get
the Unicode of the character one by one. Follow the
code rules, transcodes the Unicode to nitrogenous
bases. Output the final result to store DNA sequence.
For example, the character "A" Unicode is 0x41
(01000001), the corresponding nitrogenous bases is
AAAAAAAAGAAG, simplified nitrogenous bases is
GAAG
1:
2:
3:
4:
5:
6:
7:
Initialization
Import the text file
for each character do
Get Unicode of the characters Cunicode
Transcode Cunicode to CDNA
Output CDNA to store DNA sequence
end for
Verification of Algorithm
ACKNOWLEDGMENT
Dr. Qian Xu, Dr. Yang Guo are greatly acknowledged
for supporting this study. Laboratory of complex
network and visualization has made publishing of this
article possible.
REFERENCES
Adleman LM., "Molecular Computation of Solution to
Combination Problems," in Science, vol. 266(11), 1994,
pp. 1021-1023
Dietrich A. and Been W., "Memory and DNA," in J theor
Biol, vol. 208, 2001, pp. 145-149
Garzon MH., Neel A., Chen H., "Efficiency and Reliability of
DNA Based Memories," in GECCO, 2003, pp. 379-389
ROBERT F W., Molecular Biology, 2nd ed., Beijing:Science
Press, 2003, pp. 642 -682.
Wei
Dan, "Review of
magnetic
information
storage
technology," in Physics, vol. 33(9), 2004, pp. 646-651
Zhengxu Zhao, Yang Guo, Scale-free Model in Software
Engineering: a New Design Method, 2013 International
Conference on Geo-Informatics in Resource Management
& Sustainable Ecosystem, 2013. ISSN: 1865-0929. Print
ISBN: 978-3-642-41907-2. Online ISBN: 978-3-642-41908-9.
Conference location: Wuhan, China.
ZINGEL T., "Formal models of DNA computing:a survey,"
in Proc Estonian Acad Sci Phys Math, vol. 49(2), 2000, pp.
FIGURE 1. EXAMPLE OF THE CHARACTER ENCODING
90-99.