Obtaining Provably Good Performance from Suffix

Obtaining Provably Good
Performance from Suffix Trees
in Secondary Storage
Pang Ko & Srinivas Aluru
Department of Electrical and Computer
Engineering
Iowa State University.
Motivation



Large amount of biological sequence data.
Index for text usually is bigger than the text
itself.
Requires efficient ways to store and query
these data.
Related Works

String B-tree



Many other works that only focus on construction of suffix tree, and without
worst case bound




Focus on reducing the space usage of suffix trees.
Performance depends on the height of the tree.
Farach, odd even tree construction.



S.J. Bedathur and J.R. Haritsa. Search-optimized suffix-tree storage for biological
applications.
E. Hunt, M.P. Atkinson, and R.W. Irving. Database indexing for large DNA and protein
sequence collections.
Clark and Munro. “Efficient suffix trees on secondary storage”


Has the best worst case performance in secondary storage, allowing updates.
However, most existing programs still uses suffix tree instead of string B-tree.
Optimal construction time in secondary storage
The performance for search and update operations are not studied.
We show that suffix tree can achieve the same level of efficiency with constant
size alphabet.
Definitions





Let v be an internal node of a suffix tree.
size(v) is the number of leaves in the subtree
rooted at v.
rank(v) = i, iff Ci  size(v)  Ci+1.
Internal nodes u and v belong to the same
partition, iff u is the parent of v and
rank(v)=rank(u).
The rank of a partition P, rank(P) is the rank of
the internal nodes in the partition.
A Suffix Tree Partitioned
rank = 2
rank = 0
rank = 0
Each root to leaf path
goes through at most
logCn partitions.
rank = 1
rank = 0
rank = 0
Suffix Tree & Partition Example
C=3
Partitions of rank 0
Suffix Tree & Partition Example
C=3
Partitions of rank 1
Properties of a Partition



Nodes in a partition without any child in the
same partition are referred to as leaves.
The node whose parent is in another partition
is referred to as the root.
There are at most C-1 leaves for each
partition.



size(root) ≥ size(u), for all leaves u of the partition.
Ci+1-1 ≥ size(root) ≥ size(u) ≥ Ci
C*Ci = Ci+1
Properties of a Partition

If a node v has more than 1 child in the same partition as v, it is
referred to as a branching node.

There can be at most C-2 branching nodes, because there are at
most C-1 leaves.
A skeleton partition tree for a partition P contains the root, all the
leaves and branching nodes of a partition.
 There are at most 2C-2 nodes in a skeleton partition tree.
 With a suitable choice of C, it can be stored in 1 disk page.

Partition and Skeleton Partition Tree
Store a representative suffix in each
nodes of the skeleton partition tree
Searching for an Exact Match (1)
p = TTAATGAT
Searching for an Exact Match (1)
p = TTAATGAT
Load the representative
suffix and compare to p.
Searching for an Exact Match (1)
p = TTAATGAT
Load the representative
suffix and compare to p.
Suppose the representative
suffix is TTATTAGGA……
The lcp between p and the
representative suffix is 3.
Searching for an Exact Match (2)
p = TTAATGAT
The lcp between p and the
representative suffix is 3.
Move to the appropriate
next partition.
Total number of disk access:
O(p/B+logBn)
Supporting Update Operations


With insertion and deletion the size of a node
as well as the partition changes.
During insertion of a suffix,



Size(v) changes if and only if node v is an ancestor
of the newly inserted leaf.
Rank(v) may change only if size(v) changes and
node v is the root of a partition.
If rank(v) changes node v will became either a new
partition by itself or a leaf in its parent’s partition.
Only the Rank of the Root of a Partition
Changes
Root
Rank(v) increased by one
 size(v) was Crank(v)+1 - 1
 size(root) was  Crank(v)+1
 Root was not in the
partition 
Insertion and Deletion


By the same argument only a leaf’s rank
can change during the deletion of a suffix.
Store and keep size(v) up to date for node v
if
1. Node v is the root of the partition,
2. Node v, such that v is connected to the root by a
chain of branching nodes.
3. Node v is a non-branching node and is the child
of a node u that satisfies one of the conditions
above.
The Root of a Partition is Removed

Let v be a child of the old root in the partition.



If v is a branching node, nothing need to be done,
and the new partition with v as the root have all
the size set correctly.
If v is a non-branching node, we can calculated
the size of its only child in the partition by subtract
the size of all other children from size(v).
After the updates all the size value will be set
correctly as stated previously.
The Root of a Partition is Removed
Old Root
New Roots
The Leaf of a Partition is Removed

If a leaf is removed from a partition,



The leaf became the root, its size can be
calculated as the sum of the size of all its children,
which were all roots of different partitions.
Either a previously branching node became a
non-branching node, no update of size is
necessary, or
A previously non-branching node became a new
leaf, in this case the size of the new leaf can be
calculated by added the size of all its children.
The Leaf of a Partition is Removed
Leaf from
another partition
Results



Let B be the size of a disk block.
Let n be the total length of strings.
Let m be the length of the string being
inserted or deleted.



Construction takes O(n logB n) disk accesses.
Insertion and deletion takes O(m logB (n+m)) and
O(m logB (n)) disk accesses, respectively.
Let p be the length of a pattern.

Searching takes disk O(p/B + logB (n)) accesses.