Using tries to eliminate pattern collisions in perfect hashing

239
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 2, APRIL 1994
Using Tries to Eliminate Pattern
Collisions in Perfect Hashing
Marshall D. Brain and Alan L. Tharp
Abstract4any current perfect hashing algorithms suffer from
the problem of pattern collisions. In this paper, a perfect hashing
technique that uses array-based tries and a simple sparse matrix
packing algorithm is introduced. This technique eliminates all
pattern collisions, and because of this it can be used to form
ordered minimal perfect hash functions on extremely large word
lists.
This algorithm is superior to other known perfect hashing
functions for large word lists in terms of function building
efflciency, pattern collision avoidance, and retrieval function
complexity. It has been successfully used to form an ordered
minimal perfect hashing function for the entire 24, 481 element
Unix word list without resorting to segmentation. The item lists
addressed by the perfect hashing function formed can be ordered
in any manner, including alphabetically, to easily allow other
forms of access to the same list.
Index Tenns-Perfect hashing, minimal perfect hashing, hashing tries, sparse array packing.
I. INTRODUCTION
H
ASHING is a technique used to build modifiable data
structures that maintain efficient retrieval times for all
items in the structure. As items are added to the structure,
the attempt is made to maintain O( 1) access time at retrieval.
This efficiency is achieved by placing each unique item that
is added at a unique address in the structure. A number
of algorithms have been suggested, but no current hashing
algorithm can maintain O(1) retrieval times over the lifetime
of the data structure: Eventually multiple items collect at
identical addresses, and these collections must be handled
in some way. All current algorithms used to resolve these
collections have a retrieval penalty, which causes the retrieval
efficiency to rise above the ideal 0(1)efficiency.
A number of pe$ect hashing algorithms have been created
recently that do allow the 0(1)retrieval of all items in the
data structure as long as the items are static. These algorithms
all start with an original and static list of items and use various
techniques to build a hash function that is specific to that set
of items. A perfect hashing function may perform multiple
computations or table look-ups to produce the unique retrieval
address, but each item in the list can be retrieved with O( 1)
efficiency.
Perfect hashing functions were used as early as 1977 to
allow 0(1) retrieval of items such as month names [23].
Manuscript received October 26, 1990, revised September 30, 1991, and
March 16, 1992.
The authors are with the Computer Science Department, North Carolina
State University, Raleigh, NC, 27695-8206 USA.
IEEE Log Number 9212669.
Algorithms such as Cichelli’s algorithm [6], which is capable
of handling words lists of up to about 50 words, have been
used for tasks such as the 0(1)retrieval of Pascal reserved
words within a Pascal compiler.
Interest in perfect hashing algorithms has risen recently with
the advent of extremely large static databases, particularly
those held on CD-ROM’s. CD-ROM’s are a nonwriteable
medium, which guarantees static data. CD-ROM’s also have
slow track seek times, which makes multiple accesses to the
disk extremely frustrating for the user of the CD-ROM. Perfect
hashing functions have been used in this environment to
provide quick access to large medical dictionaries [9] and other
word lists. The perfect hashing function allows any word to be
accessed from the disk with only one disk access operation.
Unfortunately, many current perfect hashing algorithms are
severely limited in the number of words that can be processed.
Future application of perfect hashing functions to large
static databases will continue. On-line periodical indexes and
card catalogs, especially those committed to CD-ROM, will
continue to demand the high-efficiency retrieval capabilities
offered by perfect hashing. Perfect hashing algorithms must
expand to handle immense item lists of this type.
A new perfect hashing technique based on compressed tries
has been developed. It can be used to form ordered minimal
perfect hashing functions for extremely large word lists. The
new technique is unique in its speed and efficiency, as well as
in its immunity to pattem collisions.
11. AVAILABLEPERFECT HASHINGALGORITHMS
Sprugnoli [23] introduced the first perfect hashing algorithm. Two techniques-the quotient reduction method and the
remainder reduction method-were developed to allow perfect
hashing of relatively small sets of words (10-12 words). These
methods both rely on the algorithm’s ability to create a simple
formula of the form H = (W - Cl)/C2 (quotient reduction
method) or H = (FLOOR((C1 W * C2)modC3)/C4))
(remainder reduction method), where H is the hash value,
W is the word to be hashed, and Cl,C2,C3, and C4
are constants calculated by the algorithm, for the word set
in question. For larger sets of words, Sprugnoli suggests a
segmentation process to break the list into smaller subsets.
Sprugnoli’s methods are unique in the fact that they do not
require auxiliary tables of information at retrieval time-only
the four constants must be stored.
Jaeschke [ 141 created a technique called reciprocal hashing.
This technique is similar to Sprugnoli’s in that the perfect
hashing functions are relatively simple, but like Sprugnoli’s
+
1041-4347/94$04.00 0 1994 IEEE
~
~
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 2, APRIL 1994
240
TABLE I
B U ~ LORDER
D
OF DIFFERENT ALGORITHMS
Algorithm
Cichelli
Karplus
Sager
Fox
Reference
Build
161
O(c‘)
O(rl 5)*
0(r6)
[151
1221
[91
0 ( T 3 )
*Derived from experimental results rather than theoretical analysis.
algorithm, Jaeschke’s algorithm cannot handle more than about
20 words without. segmentation.
Cichelli [6] developed an extremely straightforward minimal perfect hashing algorithm that can be used on word lists
up to about 50 words in size. Cichelli’s method makes use of
a table of 26 integers that is used at retrieval time. The perfect
hashing algorithm uses a brute-force technique to find a set
of values for the table that allows each word to be hashed
to a unique address. For larger word sets Cichelli suggests
segmentation. Cichelli’s algorithm is extremely sensitive to
pattem collisions. We introduce the term pattern collisions to
describe the situation that occurs when two words cannot be
handled simultaneously by a perfect hashing function. If the
technique used by the perfect hashing function to determine the
address for a word causes two words to be indistinguishable, a
pattern collision has occured. For example, a perfect hashing
function formed by Cichelli’s algorithm uses the first and
last characters of the word being hashed, along with the
length of the word, to decide on the hash address. Because
of this, any two words with the same first and last character
and length cannot be distinguished, and therefore cannot be
handled simultaneously by Cichelli’s algorithm (e.g., “dig”
and “dug”).
Karplus and Haggard [161 generalized Cichelli’s algorithm
using graph theory methods to hash word sets of up to almost
700 words. Karplus also notes experimental results that seem
to indicate an extremely good efficiency for the algorithm (see
Table I). However, the algorithm has not been successfully
applied to word lists greater than 700 words, and the words
in the sample lists [I51 seem to be carefully chosen. For
example, one list contains the words “Tannarive,” “thugee”
and “thyroglobulin,” but omits more common words such as
“the,” “there,” and “through.”
Chang [7] built an ordered minimal perfect hashing algorithm using the Chinese remainder theorem. Ordered functions
allow items in the static list to be arranged in any order (e.g.,
alphabetically) as the perfect hashing function is built. Chang’s
algorithm is unique in this respect.
Du, Hsieh, Jea, and Sheith [8] created a perfect hashing
function based on segmentation and random hashing functions.
Word lists of up to 300 words were suceesfully hashed, though
no execution times are revealed.
Cercone, Krause, and Boates [5] started with Cichelli’s
algorithm and created three different perfect hashing algorithms. One forms segments based on word length and then
applies Cichelli’s algorithm to each segment, The results
for the subsets are then combined into a single table. The
remaining two algorithms are more esoteric. The best of the
three algorithms is able to handle up to 500 words, but the
results are not minimal (one-third of the final table is unused).
Sager [22], also trying to improve on Cichelli’s algorithm,
used a graph theoretic approach to create minimal perfect
hashing functions of sets of words as large as 512. Sager’s
work was later improved by Fox, Chen, Heath, and Datta [9] to
form minimal perfect hashing functions of lists containing up
to 1000 words. Both of these algorithms expand on Cichelli’s
table size considerably to handle large word sets. To increase
the number of words hashed, Fox suggests segmentation of
the word list.
Larson and Ramakrishna [21], [ 191 created a perfect hashing
algorithm for extemal files. The technique uses a segmentation
strategy to divide a large file into smaller groups. Each group
is then perfectly hashed using standard techniques.
Fox, Chen, Heath, and Daoud [lo] have demonstrated a
perfect hashing algorithm capable of handling large word lists.
The algorithm is O(n) at build time and uses very little space
to store the perfect hashing tables needed at retrieval time.
However, this algorithm does not produce an ordered perfect
hashing function, and it is also fairly expensive at retrieval
time. See subsequent sections for discussions of these issues.
Further discussion of perfect hashing functions can be found
in [25] and [20]. Berman [2] presents theoretical analysis
on the space tradeoffs involved in creating perfect hashing
functions.
111.
PERFECT
HASHINGISSUES
Perfect hashing algorithms are used to build perfect hashing
functions. The perfect hashing function normally consists of a
set of values (often stored in a table or array) that are calculated
by the perfect hashing algorithm. The formation of the perfect
hashing function by the perfect hashing algorithm occurs once
for a given, static item list. The perfect hashing function is
then used for each retrieval of all items from the item list.
Several issues have become prominent in perfect hashing
research. These issues include the following.
1) The ability of the perfect hashing algorithm to form
minimal perfect hashing functions.
2) The ability of the algorithm to form ordered perfect
hashing functions.
3) The memory requirements of the perfect hashing function at retrieval time.
4) The amount of CPU time required by the perfect hashing
algorithm while the perfect hashing function is being
built.
5) The amount of CPU time required by the perfect hashing
function for each retrieval.
6) The sensitivity of the perfect hashing algorithm and
function to pattem collisions.
A hash function’s ability to form “minimal” item lists is
extremely important because of space considerations. In a
minimal list, the items addressable by the perfect hashing
function are arranged contiguously. Minimal lists save considerable storage space if the items addressed by the perfect
hashing function are large records of information. Most of the
algorithms discussed previously produce minimal functions,
24 1
BRAIN AND T H A W PATERN COLLISIONS IN PERFECT HASHING
but most also allow nonminimal functions to be formed as
well. In general, less time is required to build nonminimal
perfect hashing functions.
The order of the item list is sometimes important. For
example, the item list must be accessible by the perfect hashing
function but may also need to be accessible alphabetically or
sequentially. Perfect hashing algorithms that allow arbitrary
and specific orderings (e.g., alphabetical) of the item list once
the perfect hashing function is formed are called ordered
functions. Chang’s algorithm [7] creates an ordered perfect
hashing function, while most others do not.
Another issue important to perfect hashing functions is the
amount of memory required by the function at retrieval time.
All perfect hashing functions require the use of a precalculated
set of values to handle retrieval. The values are determined
when the perfect hashing function is built. For example,
Sprugnoli [23] calculates four constants (C1,C2, C3, and C4)
that are used at retrieval. Cichelli [6] calculates a table of 26
values that are accessed for retrieval. In Cichelli’s algorithm,
the first letter and last letter of the word to be retrieved are
used as indexes into the table. The two values found in the
table are summed together with the length of the word to form
the hash address. The purpose of Cichelli’s algorithm is to find
values for the table that cause no address collisions for all of
the words in the static item list. The amount of space required
for the table tends to grow with the number of items hashed.
For example, Sager’s algorithm [22] requires 4 bytes of storage
space in the table for each item in the static list of items. This
table is built by the perfect hashing algorithm once, and then
stored in memory and used for each retireval performed by
the perfect hashing function.
Each perfect hashing algorithm has two completely different
time efficiency aspects that must be monitored. One of these
aspects is the retrieval efficiency. Because a perfect hashing
function maps a key to a unique address, it has O( 1) retrieval
efficiency by definition. The data are stored on auxiliary
memory, but the hashing function is processed in primary
memory. Because the access time for auxiliary memory is
several orders of magnitude greater than the access time for
primary memory or the execution time for an instruction,
the operations performed by the perfect hashing function are
insignificant compared with an auxiliary memory access. In a
perfect hashing function, each item in the list can be retrieved
with O(1) accesses of auxiliary memory.
The second time efficiency aspect involves actually building
the perfect hashing function. For any algorithm, this is the
one-time cost of constructing a perfect hashing function for
a given set of items. For example, Cichelli’s algorithm has
an O(c‘) build time (where c is some constant and T is the
the number of words to be hashed), while Sager’s minicycle
algorithm has an 0(r6) build time (where T is the number of
words hashed). Table I shows the build order of some of the
algorithms mentioned.
Many of the algorithms described above are sensitive to
pattern collisions to a certain degree. As discussed in an earlier
example (and in [31), any two words with the same first and last
character and length (e.g., “dig” and “dug”) cannot be handled
simultaneously by Cichelli’s algorithm. Pattem collisions are
normally handled by segmenting the original word list into a
set of sublists, using a hashing function to divide the words
between the different segments. The hashing function used
to form the segments is chosen so that words with pattem
collisions are placed in different segments. This segmentation
is accompanied by a performance penalty.
Finally, the complexity of the functions used for retrieval
is important. Although all perfect hashing functions have
an O( l), (constant) retrieval time, the amount of CPU time
required for each retrieval should be minimized. Equations
(1)-(6) show some of the different functions used for retrieval
in perfect hashing functions. Cichelli’s algorithm has an extremely simple and efficient retrieval function, as shown in (2).
Sager uses somewhat more complicated functions, as shown
in (3)-(6). (Note that this set of equations comes from [22]
and is required for large word lists. For smaller lists, simpler
equation sets are possible. The listed equation set, or one of
similar complexity, is necessary to eliminate pattem collisions
in large word sets because many characters must be examined
to avoid the collisions.) As pointed out by Sager [22], Chang
[7] uses retrieval functions that are extremely time consuming,
making them impractical.
Sprugnoli:
H = ( F L O O R ( ( C l +W * C2) mod C3)/C4)
(1)
Cichelli:
H = L E N G T H ( W ) + G[FIRST-CHAR-OF-WORD]
+ G[LAST-CHAR-OF-WORD]
(2)
Sager:
HO = ( L E N G T H ( W )+ ( S U M ( O R D ( W [ I ] ) ) I) ) ,
= 1TOLENGTH(W)BY3
H1 = ( S U M ( O R D ( W [ I ] ) ) ) , I
(3)
= lTOLENGTH(W)BY2)MODR
H2 = ( S U M ( O R D ( W [ I ] ) ) ) , I
= 2TOLENGTH(W)BY2)MODR R
H = HO G[H1] G[H2]MODN
(4)
+
+
+
(5)
(6)
In the equations shown above, W represents the word being
retrived. LENGTH(W) is the number of characters in the word.
W [ z ]is the zth character of the word. G represents a table
of values calculated by the given perfect hashing algorithm,
indexed using the value shown. C1, C2, C3, C4, and R are
constants. H represents the hash address calculated by the
perfect hashing function.
IV. A SIMPLE PERFECT HASHINGFUNCTION
A typical application of a perfect hashing function is the
O( 1) retrieval of Pascal reserved words by a Pascal compiler.
The Pascal reserved words are static and are known in advance,
and this knowledge is used to build a specific perfect hashing
function for the reserved word set. When a word is to
be retrieved, the perfect hashing function is applied to the
word. The use of perfect hashing functions can make token
recognition for a compiler efficient.
IEEE TRANSACIIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 2, APRIL 1994
242
TABLE I1
ABCDEFGHI 1 KLMNOPORSTUVWXYZ
THEPROBABILITY
OF PATTERNCOLLISIONS
BASEDON
)E"
1AND
2 BEGW
)CHAR
4
m
5 QSE
6wD
7Eop
Fig. 1.
A perfect hashing function using 'a 2-D array.
THE
NUMBER
OF WORDSIN FIG. 1's 2-D ARRAY
Number of Words
Probability of Collision
2
3
10
20
30
0.15%)
0.45%
7%l
26%)
50%)
71%
85%
40
Many of the techniques described above can be used to
50
build perfect hashing functions for this application. These
60
93%
algorithms cover a wide range of complexities, though the
70
97%
simp1.er algorithms often have significant disadvantages. In
this section a simple algorithm will be described and used
to demonstrated perfect hashing, and also to demonstrate the
memory are required per word in the list. As more words are
six issues described above.
added to the list, the memory efficiency improves, however.
An extremely simple perfect hashing function for Pascal
This algorithm is prone to pattem collisions. For example,
reserved words could be implemented using a 2-D array of
the word CUT cannot be added to the list because it collides
integers, as shown in Fig. 1. The first and last characters
with CONST. Table I1 shows the probability of pattem colliare used as indexes into the 2-D array, and the integer at
sions based on the number of words in the word list. Knuth
the indexed location points to the word in the word list
[17] calls this phenomenon the birthday paradox [23]. For
(normally the word list would contain information needed by
example, in Fig. 1 the array contains 676 positions. If two
the compiler to process/compile the given reserved word).
random words are to be placed into the array, the probability of
In Fig. 1, a list of seven Pascal reserved words has been
them not colliding at the same location is 675/676 = 99.85%.
hashed. To look up the word A N D using this arrangement,
Therefore, the probability of collision is 0.15%. If three
the first letter, A , is used as the row index into the array, while
random words are placed in the array, then the probability of
the last letter, D , is used as the column index. At the location
no collision decreases to (674/676) * (675/676) = 99.55%.
addressed by A and D is the value 1, which is used to access
Therefore the probability of collision in the array containing
the record referenced by the word A N D in a word list. As
three words is 0.45%.
can be seen, the retrieval time for any of the words in the
From this example, it can be seen that the perfect hashing
word list is constant, regardless of the number of words in the
algorithm described has certain advantages and disadvantages.
array, giving this perfect hashing function an 0(1)retrieval
The advantages include the ability of the algorithm to form
time. When using this algorithm on a very large word list,
ordered minimal perfect hashing functions in O(n) time. The
the array would be stored in primary memory and the word
perfect hashing functions formed are also extremely efficient.
records would be stored on disk, allowing the word records to
The disadvantages include extreme memory inefficiency and
be accessed with a single disk probe.
a high sensitivity to pattem collisions. The memory efficiency
The job of the perfect hashing algorithm is to build the 2disadvantage would tend to rule out the perfect hashing
D array (a one-time operation). The perfect hashing function
function formed by this algorithm.
(used many times) would use this 2-D array along with the
function shown in (7).
v. THE n I E DATASTRUCTURE
Hash-address = array[f irst-letter, last-letter]
(7)
In any set of words to be perfectly hashed, no two words
In (7), firstletter and last-letter are the first and last letter of
the word being hashed. This is an extremely efficient retrieval
function compared with other available algorithms.
This perfect hashing algorithm forms ordered minimal perfect hashing functions. The word list contains no gaps between
the words in the list-they are all contiguous. The word list
shown in Fig. 1 can also be ordered in any way.
This particular perfect hashing algorithm would also have an
O ( n ) build time. When the perfect hashing function is being
built, a single value is placed in the 2-D array for each word
in the word list.
The memory usage for this perfect hashing function is very
inefficient because many entries in the 2-D array are empty. In
the example shown in Fig. 1,669 of the 676 entries in the 2-D
array are unused (about 1% memory efficiency). For the seven
words handled by this perfect hashing function, 194 bytes of
have exactly the same characters in all positions. This fact is
used in the trie data structure [25], [ 111 to allow word lists of
any size to be accessed.
To retrieve a word from a trie data structure, each letter in
the word (starting with the first) is examined until the word
can be uniquely differentiated from others in the list of words
known to the trie. For example, if the word DOG is known
to the trie, and if DOG is the only word in the list that begins
with a D, then when trying to retrieve the word DOG using
the trie, only the first letter must be examined.
Fig. 2 shows an array-based trie for a word list containing
the words C A T ,COP, C O R N , C O R K ,DOG, FLAG and
FROG. To access the word DOG from the trie, the first letter
of the word ( D ) is used as an index into the first column of
the array (using the first letter of the word to be retrieved to
index the first column of the trie is the starting point for any
243
BRAIN AND THARP: PAnERN COLLISIONS IN PERFECT HASHING
1
2
3
4
- Build the 2-D nie m y
- Son the columns ofthe 2-D amy. Thc columns with rhc most entries should be
5
at the beginning of the sorted Ust, and other columns should fill the list in
demasingordcr.
- Lmp thmugh dl colunms containing 1amolc values.
- Aarmpt mplla the curmu mlumn U location OM
ofthe 1-D m y .
D
1 CAT
- While
the new column‘s values collide with values almdy in the l-D
amy. m e the mlumndown one position in the 1-Dm y and check for
E
z cccoRpN
3
- On- a non-collidingposition for the new column is found, place the
F
4coRK
G
H
w~sionsagain.
columnin the 1-Dm y ud update the UT.
- S t m the final U T formed.
SDOO
6-
7 FROG
ea.#
Fig. 2. An array-based trie of the word list’shown. Negative values point to
the next column, while positive values point to the word list. EOW stands for
“End of Word,” which is used to differentiate, for example, DOG from DOGS.
retrieval). The value 5 found at location D of the first column
indicates that the word DOG can be found in location 5 of the
word list. Because DOG is the only word in the list starting
with D , the first letter is sufficient to differentiate it from all
other words in the list.
To access the word FLAG from the trie, the first letter of
the word is used as an index into the first column of the array.
The value -5 is found. The negative sign is a flag used to
indicate that the first letter is not sufficient to differentiate the
word, and that the second letter should be used as an index into
another column, in this case column 5 , of the array. At location
L in column 5, the value 6 is found. This is the location of
the word FLAG in the word list.
Words that are prefixes to other words (e.g., cat is a prefix of
cats) are handled in a trie by the addition of an “end of word”
character to the character set (see eow in Fig. 2). The eow
character, because it is unique in the character set, can be used
to distiguish words that have the same prefix by appending it
to the end of the shorter word.
The trie data structure shown in Fig. 2 would work well as
a perfect hashing algorithm because it completely eliminates
the pattem collision problem, and also because it produces
minimal word lists that can be ordered in any way (the
list shown is alphabetical). If the trie array were stored in
memory, the CPU time needed to index a word through the
hie would actually be less than that of other perfect hashing
functions currently in use for handling large word sets [22],
[9]. For example, to look up the word CORN using Sager’s
algorithm [22] (or Fox’s improvement of it [9]) requires that
seven characters be indexed from the word, along with 11
additions, three MOD operations, and three array indexings.
The worst-case situation for a memory based trie using the
word CORN requires that five letters be indexed from the
Fig. 3.
Pseudocode of the compression algorithm as applied to tries.
word, and that the trie array be indexed five times. Retrieval
time from the memory-based trie array is superior to retrieval
using Sager’s minicycle algorithm. It is again important to note
that time spent manipulating in-memory tables is considered
insignificant compared to the time required to do a disk probe.
A long word such as electroencephlocardiogram could
require a number of in-memory operations to generate the hash
address using either Sager’s algorithm or the trie. This does
not make the algorithm O ( n ) or O(maximumword-length),
however, because only one disk probe is required to access the
record on disk as a result of these calculations. It is the number
of disk accesses that determines the order in any hashing
algorithm. The importance of a single disk probe cannot be
overstressed, especially on slow devices such as CD-ROM’s.
VI. COMPRESSION OF THE TRIEARRAY
If the trie array were stored in memory, it could provide a
simple scheme for perfect hashing that is superior in terms of
retrieval CPU time to other perfect hashing algorithms [22],
[9]. Unfortunately, the memory utilization in the trie array
shown in Fig. 2 is only 8%, and this inefficiency makes the
algorithm impractical in most cases involving large word sets.
One way to make the use of a trie practical for perfect
hashing would be to compress the trie array to improve its
memory utilization. A simple and efficient O ( r 2 )algorithm
described in [4] as the SMP algorithm and in [24] as the double
displacement method can be used here to compress the trie.
The array to be compressed is separated into its individual
columns (or rows). The columns are rearranged based on
the number of elements in each column so that the column
containing the most elements is at the top of the sorted list.
The columns are then copied into a 1-D array, starting with
the column containing the most elements. The relationship
between the elements in each column is preserved as the
column is copied into the l-D array. A column lookup table
(CLT) is maintained to remember the starting location of each
column copied into the l-D array. Each column is placed into
the l-D array at the first available position that causes no
collisions between filled elements of the new column being
placed into the l-D array and the already filled elements of
the 1-D array. A pseudocode version of this algorithm, as it
would be applied to tries, is shown in Fig. 3.
Applying this sparse array packing algorithm to the trie
array shown in Fig. 2 would result in the 1-D array shown
in Fig. 4.
The CLT in Fig. 4 indicates the starting location of each
column that has been packed into the l-D array. To retrieve
the word FLAG using the setup shown in Fig. 4, the first
IEEE TRANSAaIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 2, APRIL 1994
244
-2
5
1 - 5
2
1
2
3
5
4
-4
4
3
6
C a l m b k u p Table
P r k d The
-
2
5
1
-
5
2
4
4
6
3
I I
I
1-31
12 13 14 I5 16 17 18 19 20 21 22 23 24 25 26 27 28
171
Fig. 5. The packed trie with CLT modified to contain all positive values.
sraning location of column 1 = 9
-135
1 - 7
2
4
4 6
P r k dMe
3
12 13 14 15 16 17 18 19 20 21 22 23 24 5.2
26 27 28
Fig. 6. The packed trie with CLT incorporated into the trie and eliminated.
letter of the word FLAG (F) is used to index into column 1
of the trie array. To find column 1 in the packed trie array, the
CLT is used. The value for column 1 in the CLT is -2. This
value is added to 6 (since F is the 6th letter in the alphabet)
to yield 4. In location 4 of the 1-D array is the value -5. The
-5 indicates, as it did in a normal trie array, that the second
letter of the word FLAG should be used as an index into the
5th column of the trie array. The letter L acts as the index into
the 5th column of the array. The starting location for column 5
is retrieved from the CLT, where the value -4 is found. This
-4 is added to 12 (since L is the 12th letter of the .alphabet)
to yield the index 8 into the 1-D array. The value at location
8 of the 1-D array is 6. As in the unpacked trie array, the 6
indicates that the word FLAG can be found at location 6 in
the word list.
The CLT can also be eliminated to improve space efficiency.
First, the values in the CLT are all increased by 11 to remove
all negative values from the CLT, and the indexing of the 1-D
array is modified accordingly, as shown in Fig. 5.
The technique for retrieving values from the array shown
in Fig. 5 is identical to the technique used in Fig. 4. The CLT
can then be eliminated by remembering the starting location
for column 1 (since all words must index through column 1)
and replacing the negative values in the 1-D array with their
CLT values, as shown in Fig. 6.
Using the arrangement shown in Fig. 6, words can be
retrieved in a manner very similar to the technique used for
Figs. 4 and 5. For example, to retrieve the word FLAG, start
by adding 9 (the starting location of column 1) to 6 (F is the
first letter in FLAG, and F is the 6th letter of the alphabet):
9 + 6 = 15. Use 15 as an index into the 1-D array and retrieve
the value -7. The fact that -7 is negative indicates, as usual,
that the second letter of FLAG should be used. L is the 12th
letter of the alphabet. Add the absolute value of -7 to 12:
A B S ( - 7 ) 12 = 19. Using 19 as an index into the 1-D array
+
1
yields the value 6. Since 6 is positive, it is used as an index
into the word list of Fig. 2, where the word FLAG is found.
Once the array shown in Fig. 6 has been created, the
memory efficiency of the trie array has been improved to 64%,
while retrieval times are almost as efficient as those for the
unpacked trie array (an extra addition operation is needed for
each access to the 1-D array).
The packed trie's memory utilization can be improved to
100% using the MSMP sparse array packing technique shown
in [4]. This algorithm adds a MOD function to retrievals in
the 1-D arfay to eliminate unused gaps in the array. This extra
step is not necessary, however. On large word lists (any list
of over 100 words) the packing algorithm shown in Figs. 4
through 6 yields utilizations of 100% in every test case that
has been attempted.
This phenomenon occurs because of the way the 2-D trie is
constructed and packed into the I-D array. For example, on a
3000-word list used in testing, the 2-D trie had 2122 columns.
Each column contains 128 values (one value for each of the
128 ASCII codes). In this 2-D array, 5121 positions contain
values. The 1-D array will therefore contain 5121 positions if
100% memory utilization is achieved. It can be seen that each
column placed into the 1-D array is very small in relation to
the total 1-D array size.
Examination of the composition of the 2-D trie shows that
a majority of the columns contain only one or two values:
911 columns contain two values, while 680 columns contain
one value. This column distribution occurs because many
words must be separated from their plural forms in the me.
For example, if the words M A R S H A , M A R S H A L L , and
M A R S H A L L S occur in the list, two single value columns
(each containing an L ) will be created in the trie to get to
the point where M A R S H A L L and M A R S H A L L S can be
distinguished. Then a two value column (containing S and
an end-of-word mark) will be created to cause the actual
distinguishing between M A R S H A L L and M A R S H A L L S .
The columns containing two values pack very densely. The
columns that contain only one value can be placed anywhere
in the 1-D array, so they are used by the algorithm to fill all
unused gaps left in the 1-D array. The result has been 100%
memory utilization in the 1-D array for every test case that
has been attempted. The general organization of a trie does
not guarantee 100% utilization of memory when the trie is
packed, but it makes it extremely likely.
VII. RESULTS
Table I11 shows the amount of time and space required
to form ordered minimal perfect hashing functions using the
compressed trie algorithm described above. The final entry
in this table shows the performance of the algorithm when it
is used to form an ordered minimal perfect hashing function
for the entire Unix word list (including such words as loth,
I'll, and electroencephalography) normally found in the file
/usr/dict/words. This word list contains letters (both upper and
lower case), digits, and punctuation. To handle the entire list,
the columns of the trie were increased to 128 locations to
accommodate the entire ASCII character set.
245
BRAIN AND WARP: PA’ITERN COLLISIONS IN PERFECT HASHING
TABLE IV
COMPARISON
OF THIS ALGORITHM
TO OTHERS
TABLE I11
TOTALTIMEREQUIRELI
TO CREATEORDERED
MINIMAL
PERFECT
HASHINGFUNCTIONS
FOR WORDLISTSOF THE SIZEINDICATED
Formation
Packing
# words lime (in s) Time (in s)
1000
1500
3000
6OOo
12000
24481
30
45
88
183
385
789
31
52
128
360
1141
423 1
To:l:)me
61
97
216
543
1526
5020
# Columns # Values
795
1223
2122
4250
8592
16962
1794
2722
5121
10249
20591
41442
In Table 111, word lists of the size indicated were taken from
the standard Unix word list (e.g., the first 1000 words were
taken for the 1OOO-word test). The Formation Time column
indicates the amount of time taken to form the 2-D trie in
seconds. The Packing Time column indicates the amount of
time required to form the packed 1-D array. Total time is the
sum of formation time and packing time. # Columns indicates
the number of columns in the 2-D array formed, while # Values
indicates the number of filled values in the 2-D array. This
column also indicates the size of the 1-D array formed, because
in all cases the 2-D array packed perfectly into the 1-D array,
with no unused values present in the 1-D array. This table was
formed using one node of a Sequent Balance 8OOO (each node
uses a 12-Mhz 32032 processor, which is roughly equivalent
to a Macintosh I1 computer (a 68020 at 16 Mhz)).
Determining the computational complexity of the trie packing algorithm is straightforward. It can be seen in Table 111
that the size of the 2-D trie, both in terms of the number of
columns and in terms of the number of values held in the
trie, is directly proportional to the number of words. Call the
total number of elements held in the 2-D trie T , and assume
in the worst case that there are T columns in the 2-D trie,
each containing 1 element. Because the 2-D trie contains T
values, a minimally packed 1-D array containing the contents
of this 2-D me will also require T elements. As each column
is packed, every element in the column must be examined
to determine if it collides with any existing values in the 1-D
array. Also, in the worst case, there are T possible positions for
each column to occupy in the 1-D array. Since each column
can be considered to contain a single element, and since in
the worst case T positions for this column in the 1-D array
must be examined, T comparisons must be made. This same
operation must be performed for all T columns in the 2-D trie.
Therefore, r2 comparisons must be performed to pack the 2D trie. The algorithm is O(?) in the worst case. The O ( r 2 )
nature of this algorithm can also be seen in the Total Time
column of Table 111.
The time required to construct the trie to be packed is O ( n ) ,
as can be seen emperically in Table 111. This is a one-time
cost, like the packing of the trie. For more information on trie
construction, see [25].
Table IV compares the new algorithm with other current
minimal perfect hashing algorithms. It can be seen that the
new algorithm can be used to create perfect hashing functions
for extremely large word lists. It should also be noted that
the packed trie technique is the only algorithm in Table 4 that
Algorithm Reference Build
Name
Order
Cichelli
[6]
Karplus
[15]
O(cr)
O(rl
5)*
List
Size
CPU Time Machine
(in s)
40
35
667
150
IBM
4341
Space
(bytes/
entrv)
0.65
N.A
Sager
[22]
O(r6)
256
45
IBM
4341
Fox
[91
[3]
[4]
O(r3)
O(r2)*
O(r2)
IO00
1696
5000
600
733
283
Mac I1
PS/2/50
PS/2/70
4.0
2.4
2.0
-
O(r2)
24481
5020
SB 8000
3.4
Brain
MSMP
Packed
TTiP
4.0
N.A. = Not available.
*Derived from experimental results rather than theoretical analysis.
**These times are included for completeness rather than comparison
purposes, although the machines listed are all roughly equivalent to one
another in performance. The times demonstrate that each of the algorithms
listed can complete its task in a reasonable amount of time.
forms an ordered perfect hashing function, and it is also the
only algorithm that is immune to pattem collisions.
In Table IV, the Space column indicates the amount of
memory space (in bytes) required to hold the table used by
the given algorithm, per item stored in the item list. The space
requirement for the packed trie algorithm was determined by
dividing the total number of elements in the packed trie (at 2
bytes per element) by the total number of words recognizable
by the packed trie for the 24481-word trie (see Table 111):
41 442 * 2/24 481 = 3.4. The elements in the packed trie will
be stored in memory and used by the perfect hashing function
during retrieval. This process is typical of all of the perfect
hashing functions shown in Table IV, including Cichelli’s
algorithm and Sager’s algorithm.
It is also important to recognize that this table of elements
is not a hash table. It is a unique table of values formed for the
data set, and it allows every item in the data set to be accessed
on the disk using a single probe of the storage medium. Normal
hashing algorithms would have poor performance compared to
the implementation shown here.
The problem involved with using a normal hash function in
a perfect hashing application can be clarified with an example.
Double hashing is a well understood and effective hashing
algorithm that could be used to allow word searches equivalent
to the perfect hashing function’s. One of two techniques might
be used. In the first, a table of pointers is held in memory, as
shown in Fig. 7. To look up a word, the word is hashed and
the pointer value held at the hash location in memory is used
to probe the disk. This approach would yield an in-memory
table of the same size as the trie, but because of collisions
in the hash table multiple disk probes would be required for
retrieval of some data items.
The second way to use double hashing would be to place all
of the keys in an in-memory table so that multiple disk probes
are not required. This approach is shown in Fig. 8. Double
hashing is used to find the word in the table. Once the word
is found, the associated pointer is used to access the word on
the disk in one probe. While this would yield the desired disk
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 2, APRIL 1994
246
Perfect Hashing Algorithm’s Use of a Table
I
Pelf.”
huhiil
-
conmimimI 83K
s-7:
"-"nary
table
n e m-mmory
Dau items
m &I
a l e --I
veld l
k w1p1y1adOf Lbc
d.u m 011 & &*
1
5.11 k -d
w IM Ula
m
umglr disk
performed by the trie algorithm or Sager’s algorithm are
considered insignificant when compared to the time required
for a disk access. The fact that these algorithms can access
any data item in a single disk probe once these calculations
have been performed is their desirable characteristic.
wok
Normal Hashing Algorithm’sUse of a Table
I
Fig. 7. A comparison of the way a perfect hashing function and a normal
hash function use an in-memory table.
Double Hashinn using an in-memorv table
d i l in a c e l a d
Fig. 8. Double hashing used with an in-memory table of words.
probe depth of 1, it requires a great deal of memory space
for the table. Our test word list had a maximum word size of
22 characters. Each entry also requires a 2-byte pointer onto
the disk. For a table of 24481 words, assuming the best-case
scenario of a 100% packing factor, 24481 * 24 = 587554
bytes are required to hold the table. Using a compressed trie
requires only 82 884 bytes. Either way, double hashing does
not compare well to the trie approach.
It can be seen in Table IV that the new algorithm is superior
to all other algorithms currently in use in terms of build time,
with the possible exception of the MSMP algorithm. The
MSMP algorithm suffers only in its vulnerability to pattem
collisions. Forming the array-based trie is also significantly
faster than Sager’s algorithm [22]. Fox’s algorithm [9] (an
improvement of Sager’s) requires 600 s of Macintosh I1 CPU
time to process lo00 words. The algorithm described here
handles lo00 word lists in 61 s on a machine of similar
computing power.
Another benefit of the new algorithm is its superior retrieval
performance. As already discussed, retrieval of the word
using Sager’s algorithm
require that Seven
characters be indexed from the word, along with 11 additions.
MOD operations, and three array indexings. neworst:
case sit&ltion for a memory based packed trie of any Size using
the word CORN would require that five letters be indexed
from the
that the l-D
array be indexed five
and that five extra additions be performed for the indexings. It
can be seen that a packed trie is superior to Sager’s algorithm is
‘pace requirements and
time*Even for
long words, however, it must recognized that the calculations
VIII. EXTENSIONS
OF THE NEW ALGORITHM
The new algorithm offers a number of advantages over other
current algorithms as long as the packed trie can be held in
memory. On extremely long word lists (e.g., a 100000-word
list found in a medical dictionary located on a CD-ROM),
however, it may be infeasible to store the entire packed trie in
memory. This would cause multiple reads to be performed on
the packed trie stored on disk (a maximum of five reads, as
already discussed in the case of the word C O R N ) , and these
multiple reads will severely degrade performance.
To solve this problem, the word list can be segmented as
discussed in [9]. For example, a 1OOOOO-word list could be
subdivided into 26 separate lists using the first character of
each word to determine the contents of each sublist (similarly,
the first two characters could be used to subdivide the lists
into 676 sublists, etc.). To retrieve a word, the word would
first be examined to determine which subtrie is appropriate,
and that subtrie would be loaded into memory in a single read
operation. The normal retrieval process would then continue
using the subtrie stored in memory. Since individual sectors on
CD-ROM’s are extremely large (2 Kbytes [27]), a significant
number of words can be stored in each subtrie.
IX. CONCLUSION
A packed trie data structure can be used to create extremely
fast and space-efficient ordered minimal perfect hashing functions for extremely large word lists. The performance of the
algorithm, both in terms of access and creation times, is
superior to other perfect hashing functions currently available.
REFERENCES
[l] R. Bayer and C. McCreight, “Organizations and maintenance of large
ordered indexes,” Acta Inform., vol. 1, no. 3, pp. 173-189, 1972.
[2] F. Berman, M. E. Bock, E. Dittert, M. J. O’Donnell, and D. Plank,
“Collections of functions for perfect hashing,” SIAM J . Compur., vol.
15, no. 2, pp. 604-618, May 1986.
[3] M. Brain and A. Tharp, “Near-perfect hashing of large word sets,”
Software Practice and Experience, vol. 19, no. 10, pp. 967-978, Oct.
1989.
[4] M. Brain and A. Tharp, “Perfect hashing using sparse matrix packing,”
Inform. Syst., vol. 15, no. 3, pp. 281-290, Fall 1990.
151 N. Cercone, M. Krause, and I. Boates, “Minimal and almost minimal
perfect hash function search with application to natural language lexicon
design,’’ Comput. Math. Appl., vol. 9, no. 1, pp. 215-231.
161
- _ R. J. Cichelli. “Minimal uerfect hashine made simule.” Commun. Ass.
Comput. Mach., vol. 23, ho. 1, pp. 17-?9, Jan. 1960.
171 C. c; Chang, “The study of an ordered minimal perfect hashing scheme,”
Commun. Ass. Comput. Mach., vol. 27, no. 4, pp. 384387, Apr. 1984.
181
- _ M. W. Du. T. M. Hseih. K. F. lea. and D. W. Shieh. “The studv of
a new perfect hash scheme,” IEEE Trans. Sofhvare Eng., vol. 9, no. 3,
pp. 305-313, May 1983.
[9] E. A. Fox, Q. F. Chen, L. S. Heath, and S.Datta, “A more cost effective
algorithm for finding minimal perfect hash functions,”ACM Conf. Proc.,
1989, pp. 114122.
[lo] E. A. Fox, L. S.Heath, Q. F. Chen, and A. M. Daoud, “Practical minimal
perfect hash functions for large databases,” Commun. Ass. Comput.
Mach., vol. 35, no. 1, pp. 105-121, Jan. 1992.
BRAIN AND THARP: PATTERN COLLISIONS IN PERFECT HASHING
E. Fredkin, “Trie memory,” Commun. Assoc. Comput. Much., vol. 3 ,
no. 9, pp. 490-500, Sept. 1960.
M. L. Fredman and J. Komlos, “Storing a sparse table with O(1) worst
case access time,” J. Ass. Comput. Much., vol. 1, no. 3, pp. 53&544,
July 1984.
Y. S. Hsiao and A. L. Tharp, “Adaptive hashing,” Inform. Syst., vol.
13, no. 1, pp. 111-127, 1988.
G. Jaeschke, “Reciprocal hashng: A method for generating minimal
perfect hash functions,” Commun. Assoc. Comput. Mach., vol. 24, no.
12, pp. 82%833, Dec. 1981.
K. Karplus and G. Haggard, “Finding minimal perfect hash functions,”
Comput. Sci. Dept., Comell Univ., TR84437, Sept. 1984.
-,
“Finding minimal perfect hash functions,” in Proc. SZGCSE
1986, pp. 191-193.
D. E. Knuth, The Art of Computer Programming, vol. 1-3.
Reading,
MA: Addison-Wesley, 1968-73, vol. 3, p. 506.
P.Larson and A. Kajla, “File organization: Implementation of a method
guaranteeing retrieval in one access,” Commun. Assoc. Comput. Much.,
vol. 27, no. 7, pp. 610-677, July 1984.
P. Larson and M.V. Ramakrishna, “Extemal perfect hashing,” in Proc.
ACM SZGMOD Con& 1985, pp. 190-199.
T. Lewis and C. Cook,“Hashing for dynamic and static intemal tables,”
Compur., pp. 45-56, Oct. 1988.
M. V. Ramakrishna, “Perfect hashing for external files,” Ph.D. dissertation, UNv. of Waterloo, Waterloo, Canada, 1986.
T. J. Sager, “A polynomial time generator for minimal perfect hash
functions,”Commun. Assoc. Comput. Mach., vol. 28, no. 5, pp. 523-532,
May 1985.
R. Sprugnoli, “Perfect hashing functions: A single probe retrieval
method for static sets,” Commun. Assoc. Comput. Much., vol. 20, no.
11, pp. 841-850, NOV. 1977.
R. E. Tarjan and A. C. Yao, “Storing a sparse table,” Commun. Assoc.
Comput. Mach., vol. 22, no. 11, pp. 606-611, Nov. 1979.
A. L. Tharp, File Organization and Processing. New York: Wiley,
1988.
F. A. Williams, “Handling identifiers as intemal symbols in language
processors,” Commun. Assoc. Comput. Much., vol. 6, no. 6, pp. 21-24,
June 1959.
241
[27] B. Zoellick, “CD-ROM software developement,” Byre, vol. 11, no. 5,
pp. 177-188, May 1986.
Marshall Brain received the B.S. degree in computer engineeringfrom Rensselaer Polytechnic Institute, Troy, NY, and the M.S. degree in computer science from North Carolina State University, Raleigh,
NC.
He is a member of the Academy of Outstanding
Teachers at North Carolina State University. He
currently works for Interface Technologies, Inc. in
Wake Forest, NC.
Mr. Brain is the author of MotifProgrummingc
The Essentials . . . and More (Digital Press). He is
the lead author Prentice Hall’s five-book Bridge series on Windows NT.
Alan L. Tharp received the B.S. degree in science
engineering and the M.S. and Ph.D. degrees in
computer science from Northwestem University,
Evanston, IL.
He is currently Head of the Computer Science
Department at North Carolina State University,
Raleigh, NC. His research interests include file and
data structures, database management systems, and
human-computer interfaces.
Dr. Tharp is a member of the Association for
Computing Machinery. He is also the author of File
Organization and Processing (Wiley, 1988).