Tries - CS @ Purdue

Tries
e
mize
i
nimize
4/18/2016 6:48 AM
mi
ze
nimize
Tries
nimize
ze
ze
1
Outline
Standard tries
Compressed tries
Suffix tries
Huffman encoding tries
4/18/2016 6:48 AM
Tries
2
Where does “trie” come from?
From the word retrieval
Introduced in the 1960’s
4/18/2016 6:48 AM
Tries
3
Preprocessing Strings
Preprocessing the pattern speeds up pattern matching
queries

After preprocessing the pattern, KMP’s algorithm performs
pattern matching in time proportional to the text size
If the text is large, immutable and searched for often
(e.g., works by Shakespeare), we may want to
preprocess the text instead of the pattern

Thus do better than O(n+m) for text of size n and pattern of
size m
A trie is a compact data structure for representing a
set of strings, such as all the words in a text

A tries supports pattern matching queries in time
proportional to the pattern size (~O(m))!
4/18/2016 6:48 AM
Tries
4
Standard Trie
The standard trie for a set of strings S is an ordered tree such that:



Each node but the root is labeled with a character
The children of a node are alphabetically ordered
The paths from the external nodes to the root yield the strings of S
Example: standard trie for the set of strings
S = { bear, bell, bid, bull, buy, sell, stock, stop }
b
e
i
a
l
r
l
4/18/2016 6:48 AM
s
d
u
l
y
l
e
t
l
o
l
Tries
c
k
p
5
Standard Trie
What space does the trie use?
What is the maximum height of the tree?
b
e
i
a
l
r
l
4/18/2016 6:48 AM
s
d
u
l
y
l
e
t
l
o
l
Tries
c
k
p
6
Standard Trie
A standard trie uses O(n) space and supports
searches, insertions and deletions in time O(dm),
where:
n total size of the strings in S
m size of the (maximum) string parameter of the operation
d size of the alphabet
b
e
i
a
l
r
l
4/18/2016 6:48 AM
s
d
u
l
y
l
e
t
l
o
l
Tries
c
k
p
7
Standard Trie
When is n, and thus space, maximum?

When S consists of mutually unique words with no letters in
common
What type of word(s) produces the largest search time?



Short word? Long word?
Answer: long words, especially those whose prefix is very common
 will do improvements later…
b
e
i
a
l
r
l
4/18/2016 6:48 AM
s
d
u
l
y
l
e
t
l
o
l
Tries
c
k
p
8
Word Matching with a Trie
We insert the
words of the
text into a
trie
Each leaf
stores the
occurrences
of the
associated
word in the
text e
a
l
r
l
6
78
4/18/2016 6:48 AM
s e e
a
b e a r ?
s e l
l
s t o c k !
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
s e e
a
b u l
l ?
b u y
s t o c k !
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
b i d
s t o c k !
b i d
s t o c k !
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
h e a r
t h e
b e l
l ?
s t o p !
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
b
h
i
d
47, 58
u
l
s
e
y
36
a
l
r
30
69
Tries
e
e
t
l
0, 24
l
12
o
c
k
17, 40,
51, 62
p
84
9
Standard Trie Construction
How do you build it?
b
e
i
a
l
r
l
4/18/2016 6:48 AM
s
d
u
l
y
l
e
t
l
o
l
Tries
c
k
p
10
Standard Trie Construction
Assuming the input strings are words in the
English language, how many children does
the root node have?
b
e
i
a
l
r
l
4/18/2016 6:48 AM
s
d
u
l
y
l
e
t
l
o
l
Tries
c
k
p
11
Standard Trie Construction
The number of children of the root node
equals to the maximum number of distinct
first letters all the words in the input string

2 in this example, maximum of 26 in English
b
e
i
a
l
r
l
4/18/2016 6:48 AM
s
d
u
l
y
l
e
t
l
o
l
Tries
c
k
p
12
Standard Trie Construction
What does the tree for “bid” look like?
b
i
d
4/18/2016 6:48 AM
Tries
13
Standard Trie Construction
What does the tree for “bid” and “sell” look
like?
b
s
i
e
d
l
l
4/18/2016 6:48 AM
Tries
14
Standard Trie Construction
What is the tree after adding “buy” and
“stop”?
b
s
i
e
d
l
l
4/18/2016 6:48 AM
Tries
15
Standard Trie Construction
What is the tree after adding “buy” and
“stop”?
4/18/2016 6:48 AM
b
s
u
i
e
t
y
d
l
o
l
p
Tries
16
Standard Trie Construction
What is the tree after adding “stock”?
4/18/2016 6:48 AM
b
s
u
i
e
t
y
d
l
o
l
p
Tries
17
Standard Trie Construction
What is the tree after adding “stock”?
b
s
u
i
e
t
y
d
l
o
l
c
p
k
4/18/2016 6:48 AM
Tries
18
Standard Trie Construction
After “bear, bell, bid, bull, buy, sell, stock,
stop”…
b
e
s
i
a
l
r
l
d
u
l
y
l
e
t
l
o
l
c
p
k
4/18/2016 6:48 AM
Tries
19
Improvements
What comes to mind for tries?

e.g., is this entire tree really necessary?
4/18/2016 6:48 AM
b
s
u
i
e
t
y
d
l
o
l
c
Tries
k
p
20
Improvements
Two types of compression

Compress internal single-children node sequences
 Also called “PATRICIA Tries” – Why?
 = Practical AlgoriThm to Retrieve Information Coded In
Alphanumeric (also called a “radix tree”)

Compress external single-children leaf-node sequences
4/18/2016 6:48 AM
b
s
u
i
e
t
y
d
l
o
l
c
Tries
k
p
21
Compressed Trie
A compressed trie has
internal nodes of degree
at least two
It is obtained from
standard trie by
compressing chains of
“redundant” nodes
b
e
ar
id
ll
l
r
l
4/18/2016 6:48 AM
d
ell
y
to
ck
p
s
i
a
u
ll
b
e
s
u
l
y
l
e
t
l
o
l
Tries
c
k
p
22
Compact Representation
Compact representation of a compressed trie for an array of strings:


Stores at the nodes ranges of indices instead of substrings
Serves as an auxiliary index structure
0 1 2 3 4
0 1 2 3
S[4] =
S[1] =
s e e
b e a r
S[2] =
s e l l
S[3] =
s t o c k
S[0] =
S[7] =
S[5] =
b u l l
b u y
S[8] =
h e a r
b e l l
S[6] =
b i d
S[9] =
s t o p
1, 0, 0
1, 1, 1
1, 2, 3
4/18/2016 6:48 AM
0, 0, 0
7, 0, 3
4, 1, 1
6, 1, 2
8, 2, 3
0 1 2 3
4, 2, 3
0, 1, 1
5, 2, 2
Tries
0, 2, 2
3, 1, 2
2, 2, 3
3, 3, 4
9, 3, 3
23
More Improvements
We can build a standard trie and then
compress it
But, can we build some sort of
compressed trie directly?
Ideas?
4/18/2016 6:48 AM
Tries
24
Suffix Trie
The suffix trie of a string X is the compressed trie of all the
suffixes of X
m i n i m i z e
0 1 2 3 4 5 6 7
e
mize
i
nimize
4/18/2016 6:48 AM
mi
ze
nimize
Tries
nimize
ze
ze
25
Suffix Trie
Compact representation of the suffix trie for a string X of size n from
an alphabet of size d



Uses O(n) space
Supports arbitrary pattern matching queries in X in O(dm) time, where m is
the size of the pattern
Repetitive words not stored repetitively
m i n i m i z e
0 1 2 3 4 5 6 7
7, 7
4, 7
4/18/2016 6:48 AM
1, 1
2, 7
0, 1
6, 7
2, 7
Tries
2, 7
6, 7
6, 7
26
More Improvements
There is still some repetition in this tree

e.g., “mize” appears several times
How can we further compress the trie, thus
reducing space and improving query time?
Ideas?
e
mize
4/18/2016 6:48 AM
i
nimize
mi
ze
nimize
Tries
nimize
ze
ze
27
Encoding Trie
A code is a mapping of each character of an alphabet to a binary
code-word
A prefix code is a binary code such that no code-word is the prefix
of another code-word
An encoding trie represents a prefix code


Each leaf stores a character
The code word of a character is given by the path from the root to
the leaf storing the character (0 for a left child and 1 for a right child)
00
010
011
10
11
a
b
c
d
e
4/18/2016 6:48 AM
a
Tries
d
b
c
e
28
Encoding Trie
Given a text string X, we want to find a prefix code for the
characters of X that yields a small encoding for X



Frequent characters should have short code-words
Rare characters should have long code-words
Why?
Example


T1
X = abracadabra
T1 encodes X into 29 bits
 29=3+2+3+3+2+3+2+3+2+3+3

c
T2 encodes X into how many bits?
a
 24=2+2+2+2+3+2+3+2+2+2+2
b
b
r
r
T2
How can we build a good
encoding trie?
a
4/18/2016 6:48 AM
d
Tries
c
d
29
Huffman’s Algorithm
Given a string X,
Huffman’s algorithm
constructs a prefix
code that minimizes
the size of the
encoding of X
It runs in time
O(n + d log d), where
n is the size of X
and d is the number
of distinct characters
of X
A heap-based
priority queue is
used as an auxiliary
structure
4/18/2016 6:48 AM
Algorithm HuffmanEncoding(X)
Input string X of size n
Output optimal encoding trie for X
C  distinctCharacters(X)
computeFrequencies(C, X)
Q  new empty heap
for all c  C
T  new single-node tree storing c
Q.insert(getFrequency(c), T)
while Q.size() > 1
f1  Q.minKey()
T1  Q.removeMin()
f2  Q.minKey()
T2  Q.removeMin()
T  join(T1, T2)
Q.insert(f1 + f2, T)
return Q.removeMin()
Tries
30
Example
11
a
5
2
a
b
c
d
r
5
2
1
1
2
b
2
c
1
d
1
6
a
X = abracadabra
Frequencies
c
b
2
c
4/18/2016 6:48 AM
d
b
2
r
2
a
5
c
2
d
r
2
r
6
2
a
5
4
a
5
Tries
c
4
d
b
r
4
d
b
r
31
Summary of Pattern Matching
Algorithm
Search Time
Brute force
O(nm)
Boyer-Moore
O(nm+s)
KMP
O(n+m)
Standard Trie
O(dm)
O(n) preprocessing, d = size of alphabet
fast
Suffix Trie
O(dm)
O(n) preprocessing
faster in practice because “compressed”
O(dm)
O(n+dlogd) preprocessing
fastest and smallest in practice
leads to lossless compression: ZIP
Huffman-Encoding
Trie
4/18/2016 6:48 AM
Notes
simple, no preprocessing
slow (good for small inputs)
O(m) preprocessing
significantly faster than previous in practice
O(m+s) preprocessing
more complex, but ideal very fast
Tries
32