Study of Data Localities in Suffix

Study of Data Localities in Suffix-Tree Based Genetic
Algorithms
Carl I. Bergenhem, Michael T. Smith
Abstract. This paper focuses on the study of cache localities of two genetic
algorithms based on the Suffix Tree structure. As well as a description of the
cache performance of the Suffix Tree.
Keywords. Suffix Tree, SimpleScalar, REPuter, Probe Selection Problem
Algorithm, Cache Aware
1.
Introduction
Suffix Trees are a well known data structure for algorithms that require string
comparisons. A Suffix Tree can be used for various problems such as suffix matching,
sub-string matching, index-at, longest common substring, and genome related
applications such as string merging. Suffix Tree has the ability to solve most these
problems in O(m) time (where m is a substring of length m). It is the defining
structure of the Suffix Tree that enables this kind of quick search time.
One of the most basic implementations of this structure is a Suffix Trie. This
implementation starts by defining a root node and then from the first character of the
input string attaches a suffix of the string of size n (n is the length of the input string).
It then attempts to add the substring of n-1, subtracting a character from the beginning
of the string. The algorithm goes through the entire string until the terminating
character ‘$’ has been used, which grants a complete Suffix Trie.
Fig 1. Suffix Trie Generated from Cocoa
2
Carl I. Bergenhem, Michael T. Smith
This particular algorithm grants each character its own node, until a previously
attached node can be re-used for the suffix that is currently being attached to the Trie.
In order to find a matching substring within this structure, one simply starts from the
root node and matches the first character of the input substring with all the children of
the root. When a match is found one then matches the second character of the input
string with the children of the node who had matched the previous character and so on
until either a full match has been found, or a mismatch occurs. When a full match has
occurred, one simply traverses the subtree created by allowing the last matching node
to become a root node for a suffix tree until all leaf nodes (nodes associated with the
terminating character ‘$’) have been found. These leaf nodes contain the index at
which the specific suffix they are attached to started in the initial string. This
implementation has the search run time of O(m) which is desired for a Suffix Trie,
however the building time can take as long as O(n2). The overall memory efficiency
of this implementation is also very low, with a worst-case space requirement of O(n2).
A more efficient version of the Suffix Tree algorithm is the Compressed Suffix
Tree (alternatively: Suffix Tree). This implementation removes the redundancy that
comes with the Suffix Trie and grants more efficiency in runtime and space
requirement. The most obvious difference between the compressed and uncompressed
tree structure is the number of nodes.
Fig 2. Compressed Suffix Tree generated from Cocoa
Within the compressed structure each node has a label that can be between the
lengths of 1 to n, where n is the size of the input string. In order to achieve this during
construction the algorithm simply searches through existing labels of the relevant
nodes in the current, but incomplete, tree until a partial, or full, match is found within
the tree, or no match is found. When a partial match is found a node is created that
separates the matching characters of the previous branch with the current suffix from
the unmatched characters. This allows for the current suffix to use the matched
characters, and simply attach what characters remain from the suffix onto this node.
*Insert picture of this*.
This implementation reduces the run time and space requirement from O(n2) to
O(n). The Compressed Suffix Tree was not perfected until Esko Ukkonen published
his proposal of the construction of a Suffix Tree. Previous Suffix Trees were not
online algorithms, in other words they had to know the entire input before the
Study of Data Localities in Suffix-Tree Based Genetic Algorithms
3
construction could start. With Ukkonen’s algorithm, not only can the Suffix Tree go
character by character it also allows the input to be read from left to right (previous
versions only used a backwards progression of the input).
Even though the Suffix Tree structure has reached these kinds of theoretical run
times and space requirements, there is always the issue of the real world. When
applied in practice, the practical running time can be far degraded from these previous
estimates due to several reasons. The main focus that we have observed is the
degradation of the Suffix Tree structure due to poor cache performance. A universal
fact for all of the implementations of a Suffix Tree is as the tree is being generated the
nodes that are being created are stored as they are created. Thus, when a search is
performed there is a low probability that a cache hit will occur within every node that
is traversed.
Fig 3. Example hits and misses throughout a simple search
As seen in figure 3, when the cache is traversed for a certain search pattern there
can be a high amount of misses (assuming the cache block size is large enough for a
single node) if the search pattern contains a list of nodes that are scattered over the
cache. For each miss that occurs there is an allotted amount of cycles in the CPU to
fetch the data from another memory source which will cause delays for the execution
of the instructions that are attached to the result of that cache block. This, along with
other factors, can result in actual runtimes that are far worse than the runtimes that
have been computed theoretically.
In order to match the theoretical values with the actual runtime one can modify
algorithms to become either cache aware or cache oblivious. An algorithm that is
cache aware is modified in accordance to what type of cache the system running the
algorithm is implementing, and can then reduce the amount of cache misses by
4
Carl I. Bergenhem, Michael T. Smith
adhering to the specific system. A cache oblivious algorithm maintains the same
consistency in runtime as the theoretical values regardless of what kind of cache the
host system is utilizing.
2.
Previous Work
Many people have delved into the issue of cache performance with regards to
algorithm and data structure performance. As our work is based off two genetic
algorithms and the supporting Suffix Tree data structure, the research on memory
performance with respect to a Tree data structure, and the implementations of the
Suffix Tree algorithm and genetic algorithms are most prevalent. Our work could not
have been accomplished without the previous work of our referenced authors. The
primary resources we used were guides dealing with the installation and use of
SimpleScalar and standard implementations of the Suffix Tree algorithm. Our future
work will rely more on the cited thesis papers dealing with cache aware data
structures.
3.
Methodology
In order to be able to observe and track the performance of our algorithms we used
the SimpleScalar Suite. The suite is a collection of programs that allows a user to
specify what kind of CPU architecture is being used and then simulate said
architecture with the given code written in FORTRAN or C. This allows a user to
write a program and then measure how well the code would perform on a specific
architecture. For this research the program that was used was called sim-cache. Simcache is a cache simulator that allows for simulation of the L1 instruction and data
cache, as well as the L2 cache. It also allows the user to specify the level of sociativity
along with what kind of replacement algorithm to use for cache misses. In order to
successfully simulate the CPU with the code given, a cross-compiler in the Linux
environment is needed to compile the code specifically for SimpleScalar. Once the
code is compiled, using it with sim-cache generates an output file that can be read
with any Linux text editor. This output file contains detailed information such as total
amount of cache references, misses and hits. A guide for SimpleScalar installation
along with the commands needed to utilize sim-cache has been included in section
4.2.
3.1
REPuter Algorithm
The REPuter algorithm is used by genetic researchers to find maximal repeats in a
given genomic sequence. A maximal repeat is any sub-section of the genome that
appears in multiple locations within the genome. A simple example would be the
string “an” within “banana”. A valid maximal repeat is any sub-string of the given
Study of Data Localities in Suffix-Tree Based Genetic Algorithms
5
text that appears at least twice and has a length greater than a set threshold. If the
above example were to be valid, that length threshold would have to be set to 2. A
slightly larger example is as follows. Take the sequence “banabana” and a threshold
of two. This sequence has several maximal repeats as the conditions are that the
substring appears at least twice and the length is at least the threshold. Thus, “bana”
appears twice, “ana” appears twice, “an” appears twice, and “na” appears twice. The
use of this algorithm with the genome allows researchers to find recurring motifs
within a DNA sequence. Or, it can become a part of a larger algorithm to find
recurring sequences that have minor mutations.
What makes this algorithm so powerful is the data structure it is implemented with,
the Suffix Tree. The Suffix Tree allows all repeating substrings, and their locations,
to be found efficiently. This means that the algorithm runs in a time and space linear
to the length of the genome sequence being operated on. The running of this
algorithm operates in the following way. Starting from the root, and for every node
thereafter- proceed as follows.
REPUTER(Node current node)
If the current node is a leaf node, return 1 as a counter;(marks an occurrence)
If the current node is not a leaf
Keep a sum starting at 0
then for each child node/path
sum the results of calling REPUTER on the child nodes
If the sum is at least 2 (the number of occurrences)
And the length of the common string is at least the
threshold
A maximal repeat has been found
Then return the sum to keep a tally for the parent nodes
The above over-simplified algorithm will find all the maximal repeats, however a
few details have been left out for ease of understanding. What is clear from the
description though is that the whole tree must be traversed. The operation at each
node is a constant time act as finding the length is a simple operation using the start
and end indexes stored within it and no actual character comparisons need to be
performed as an internal node means all its children share that substring. That is the
power that the suffix tree offers the REPuter algorithm. The ability to find all
maximal repeats while only iterating a number of times linearly proportional to the
size of the input sequence.
The problems encountered while trying to measure the cache hit ratio of the
algorithm while using simple scalar was the sim-cache tool configuration. Using a
file with a sample sequence 1 million characters in length consisting of A, C, G, and T
resulted in statistics that were probabilistically much too high. Building the tree itself
was in the upper 90’s for the hit rate percentage, and the REPuter algorithm running
on top of that was only slightly lower. The reason this should not be is the way the
Suffix Tree sprawls out across various memory blocks due to the way it is created and
nodes are inserted ‘out of order’ meaning the last inserted node could be the first node
from the root. We ran our tests with a cache configuration of 1 kilobyte for the 1st
level data configuration and a tree size around 20Mbs (based on a 20 byte size node).
6
Carl I. Bergenhem, Michael T. Smith
When the tree is constructed, random paths of the tree are always being accessed in
different orders which should alone yield a low hit rate as the algorithm has low
spatial locality. Thus, the REPuter function should not perform much better as a full
tree traversal must be performed. What this traversal means is that each path which
consists of nodes in different memory blocks must all be loaded for one path to be
evaluated. Then when the next path is traversed, different blocks must be called
upon, or the same blocks- but in a different order leading to constant replacements
within the data cache.
3.2
Probe Selection Problem (PSP) Algorithm
In order to identify viruses that cause diseases and to control the quality of items in
the food industry the usage of DNA arrays are very popular for fast identification of
biological agents present in a given sample. A large part of this is the selection of
oligos that are to be attached to the array surface. Given a set A of genomic
sequences, one has to find at minimum one olignucleotide (probe) for each sequence
S. This probe must be identified in a way that allows it to not hybridize with any other
sequences aside from the target. Also, all probes must hybridize to their specific
targets under the same reaction conditions. The most important condition is the
temperature T under which the experiment is conducted. The Probe Selection
Problem Algorithm, using the Suffix Tree structure, allows for the computation of the
temperature T efficiently.
Before any modification of any aspects of the Suffix Tree were to be made, an
understanding and implementation were required. Initially a simple program
implemented in Java was written. This program allowed, through a graphical
interface, a user to load their string to be used for the suffix tree through a text file. It
also allowed for a search to be done on said tree, giving an output of all occurrences
of the substring within the original string. Another feature includes generating a
random string of length L consisting only of A, T, C, and G. Along with this,
generating a substring with the same letters of a length K in order to allow the
randomization of the experiments. Unfortunately the later usage of SimpleScalar
forced the usage of the C language. The installation of SimpleScalar generated
another string of problems, resulting in the discovery that the latest cross-compiler
designed for SimpleScalar was severely out-dated, and thusly an old version of Linux
was required in order to configure SimpleScalar. Once set up on Red Hat Linux 9,
SimpleScalar was configured and sim-cache was tested on a simple program. Once an
implementation of the Suffix Tree was written, it was run through the sim-cache
utility with cache sizes ranging from 0.5-8 kilobytes, along with 1-8 way
associativity. All CPU configurations had direct-mapping as the replacement
structure. The results observed were however not what we expected. According to the
output files generated by sim-cache the hit-rate of the Suffix Tree implementation
ranged from 97% to 99% during creation, and for a search ranged between 95% and
97%. As seen in the previous example, when searching for a substring within the
suffix tree the expected hit rate should be around 50% or 60%. In order to confirm
that the SimpleScalar suite is working correctly a simple program was designed that
generated a two dimensional array and filled each entry with a number. Then a
Study of Data Localities in Suffix-Tree Based Genetic Algorithms
7
traversal of the array both row-wise and column-wise was done. These different forms
of traversal should have yielded a large difference in the hit ratio, due to the fact that
for the row-wise traversal the next index in the array is most likely the next block in
the cache, thus making the miss ratio fairly small. However, for the column-wise
traversal there should be a cache miss for almost every index that is traversed.
4.
Conclusion
Although our theoretical computations generated a hit ratio around 50% when run
through sim-cache the implementations of the Suffix Tree had hit ratios around 98%.
Even the check program, a simple two dimensional array which was then traversed
row-wise as well as column-wise, granted high results for the hit ratios. Especially the
column-wise traversal which theoretically should have a lower hit rate in comparison
to the row-wise traversal. This, however, hints towards the conclusion that there is an
issue with the SimpleScalar suite. Whether this issue was from the usage of sim-cache
or sim-cache itself is still left to be looked further into. The fact that both programs
yielded much higher results than expected grants consistency and thus a claim can
still be made the implementations of the Suffix Tree data structure still are correct,
and can be used for future research within the Suffix Tree.
4.1
Future Work
The value of our work as presented in this paper is that it will serve as a launch pad
to now explore the various modifications to the algorithms and the runtime impacts
they have on them. As the procedures for and commands have been documented now
on more up to date systems the SimpleScalar suite can now be used easily and
effectively to monitor cache performance along with the many other tools it offers. A
last hurdle is understanding why the sim-cache simulator was yielding such high hit
rates when it should obviously be much lower. However, once that is past, serious
modifications and improvements can begin to be made to the Suffix Tree creation
algorithm and the two genetic algorithms allowing for decreased actual runtime and
increased productivity for the researchers who rely on these tools.
Some of the larger modifications that can be made to the Suffix Tree could include
a reconstruction of how the tree is allocated in memory. Despite the fact that nodes
are created out of order compared to the way they may be accessed at a later time, a
simple mechanism to, in constant time, allocate related nodes (parents and children)
to the same block in memory would dramatically improve the performance of tree
traversals.
4.2
SimpleScalar Installation Guide
As a major complication arose in understanding the use of the simple scalar toolset,
the following is a brief guide on how to use the rather un-maintained simple scalar
8
Carl I. Bergenhem, Michael T. Smith
simulator package. The following is tested on a 7.04 ubuntu system. The source files
and 'installer' were created by Cameron Palmer and are hosted by csrl.unt.edu.
From a terminal window run the following commands:
• sudo apt-get install subversion
• svn co http://csrl.unt.edu/svn/simplescalar
• sudo apt-get install bison
• sudo apt-get install g++-3.3 gcc-3.3
• cd simplescalar/
• sudo sh simpleinstaller-little.sh
The above takes care of the installation. To compile your programs and run them,
the following two commands run from the simplescalar/ directory will work.
• bin/sslittle-na-sstrix-gcc -o simple_program simple_program.c
• simplesim-3.0/sim-cache simple_program
5.
Acknowledgements
This project was supported in part by the National Science Foundation Grant CCF0755373, and was supervised by Professor Chun-Hsi Huang.
6.
References
1. SimpleScalar 3.0 – BBSWiki
http://wiki.bigbuddysociety.net/index.php?title=SimpleScalar_3.0
2. Data Structures, Algorithms, & Applications in Java Suffix Trees,
http://www.cise.ufl.edu/~sahni/dsaaj/enrich/c16/suffix.htm
3. Thomas B. Puzak, B.S.: The Effects of Spatial Locality on the Cache Performance
of Binary Search Trees, MS Thesis, University of Connecticut – Department of
Computer Science
4. Stefan Kurtz, Chris Schleiermacher: PERuter: fast computation of maximal repeats
in complete genomes. Bioinformatics Applications. Vol 15, 426-427
5. ANSI C implementation of a Suffix Tree,
http://mila.cs.technion.ac.il/~yona/suffix_tree/
6. SimpleScalar LLC, http://www.simplescalar.com
7. SimpleScalar Evolved: Archived Mail, http://csrl.unt.edu/pipermail/research/2007August/000070.html
8. Growing A Suffix Tree, http://pauillac.inria.fr/~quercia/documents-info/Luminy98/albert/JAVA+html/SuffixTreeGrow.html
9. Fast String Searching with Suffix Trees, http://marknelson.us/1996/08/01/suffixtrees/
10. dynamicSimpleScalar, http://www-ali.cs.umass.edu/DSS/
11. Suffix Tree, http://www.allisons.org/ll/AlgDS/Tree/Suffix/