Saving Space #3

Tries & Suffix Trees
Storing Collections of Strings
•
All our dictionary data structures can be used to
store strings, of course.
•
But strings have several properties that might make
you want a different data structure:
-
Comparing strings can be slow (not constant time)
Strings can “overlap” or be similar: e.g. cat, car, cars, cart
Special kinds of queries are important: e.g. “Does string S
contain “marmoset” as a substring?”
Tries
•
A trie, pronounced “try”, is a tree that exploits some
structure in the keys
-
e.g. looking at their individual characters
ace, add, added,
cede, dad, deed
Alphabet = ACDE,
special $ “end of
string” character
A
C
D
E
$
A
A
C
D
E
C
D
E
A
$
C
D
E
$
C
D
E
$
A
C
D
E
A
C
D
E
$
A
C
D
E
$
A
C
D
C
D
E
E
C
D
E
A
$
C
D
E
$
add$
A
C
$
A
C
D
E
$
D
$
A
C
D
E
$
A
C
D
E
$
$
dad$
ace$
E
$
A
A
D
$
A
A
C
E
$
deed$
A
C
D
E
$
added$
cede$
Saving Space #1
•
Store nodes in a 2-d
table.
•
table size = |∑| = size of
the alphabet by number
of nodes m
•
Each entry contains the
index of the node it
“points to”
•
Uses O(log m) space
instead of the size of the
pointer (e.g. 32 or 64 bits)
•
General trick: you ensure
nodes are contained in
some region of memory
of size M.
A
C
D
E
$
Saving Space #2
•
Replace
•
with
•
at each node
A
A
C
D
E
$
D
$
NULL
Array at each node becomes a linked list.
Saves space when the branching factor is low
don’t need to store an entry for each character in
the alphabet)
Also imagine a hybrid method, using arrays at
nodes with high branching factors
de la Brandais tree
Saving Space #3
A lot of nodes were non-discriminatory: they didn’t
discriminate between two keys:
A C D E $
A C D E $
A C D E $
A C D E $
A C D E $
A C D E $
A C D E $
A C D E $
A C D E $
A C D E $
A C D E $
A C D E $
A C D E $
A C D E $
dad$
A C D E $
A C D E $
A C D E $
add$
ace$
deed$
A C D E $
cede$
added$
Idea: only store the discriminatory nodes
Practical Algorithm to
Retrieve Information Coded
in Alphanumeric
Saving Space #3 – Patricia Tries
Same tree, but only storing the discriminatory nodes.
BUT: now have to store the index of the character
position the node is testing
before, a node at depth d tested position d, but now
that isn’t true: we can skip over positions)
1
2
cede$
A C D E $
ace$
4
A C D E $
2
dad$
A C D E $
deed$
A C D E $
added$
add$
Also: now you must CHECK whether
a leaf you reach matches the query.
E.g. what if we searched for cedar.
Suffix Tree Variant
1
abbracadabbra$
suffix i is the part of the
string from i to the end
2
bbracadabbra$
3
bracadabbra$
4
racadabbra$
5
acadabbra$
suffix identifier = the
prefix of the suffix that
is unique within the
string
cadabbra$
adabbra$
Every suffix has a
identifier because of the
$ character
dabbra$
Suffix tree variant: store
the n+1 suffix identifiers
in a trie
bra$
ra$
a$
abbra$
bbra$
abbracadabbra$
a
b
c
b
b
d
a
$
5 7 13
r
b
r
c
a
a
c
1
2
$
9
r
$
a
c
6
4
14
$
12
3 11
$
c
$
d
8
d
a
r
c
10
Store the suffix number
(start position of the
suffix) at each leaf
Searching
abbracadabbra$
a
b
c
b
b
d
a
$
5 7 13
r
b
r
c
a
r
$
a
c
6
4
14
$
12
3 11
$
c
$
d
8
d
a
r
c
a
c
1
2
$
9
10
Search for abbr:
follow the path for abbr
leaves under end node
correspond to the locations of
abbr.
Space usage:
•
•
Use Patricia trees to store the suffix tree
•
Every internal node is at least a binary split.
•
Therefore, # number of internal nodes is about equal
to the number of leaves
•
Hence, linear space.
Then: # leaves = O(n) [one leaf for each position in
the string]
Topics for Final
•
Pre-midterm:
C++, Stacks, queues, decks, graphs, BFS, DFS, sparse matrices, selforganizing lists, trees, threaded trees, tree traversals, bitvectors for sets,
full binary tree thm, array implementation of trees, general trees as
binary trees
B-trees, AVL-trees, Splay trees, heaps (leftist, skew), skip lists
-
•
Post-midterm:
-
PR, MX, Point quadtrees
kd-trees, range searching
incremental nearest neighbor
2-d range trees, “1-d range trees”, fractional cascading
Interval trees
Segment trees
Binary Space Partitions (painters algorithm, BSP trees)
Doubly-connected edge lists
Grid files
tries, patricia trees, de la Brandais tres, O(log m) pointer space, suffix
trees