239 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 2, APRIL 1994 Using Tries to Eliminate Pattern Collisions in Perfect Hashing Marshall D. Brain and Alan L. Tharp Abstract4any current perfect hashing algorithms suffer from the problem of pattern collisions. In this paper, a perfect hashing technique that uses array-based tries and a simple sparse matrix packing algorithm is introduced. This technique eliminates all pattern collisions, and because of this it can be used to form ordered minimal perfect hash functions on extremely large word lists. This algorithm is superior to other known perfect hashing functions for large word lists in terms of function building efflciency, pattern collision avoidance, and retrieval function complexity. It has been successfully used to form an ordered minimal perfect hashing function for the entire 24, 481 element Unix word list without resorting to segmentation. The item lists addressed by the perfect hashing function formed can be ordered in any manner, including alphabetically, to easily allow other forms of access to the same list. Index Tenns-Perfect hashing, minimal perfect hashing, hashing tries, sparse array packing. I. INTRODUCTION H ASHING is a technique used to build modifiable data structures that maintain efficient retrieval times for all items in the structure. As items are added to the structure, the attempt is made to maintain O( 1) access time at retrieval. This efficiency is achieved by placing each unique item that is added at a unique address in the structure. A number of algorithms have been suggested, but no current hashing algorithm can maintain O(1) retrieval times over the lifetime of the data structure: Eventually multiple items collect at identical addresses, and these collections must be handled in some way. All current algorithms used to resolve these collections have a retrieval penalty, which causes the retrieval efficiency to rise above the ideal 0(1)efficiency. A number of pe$ect hashing algorithms have been created recently that do allow the 0(1)retrieval of all items in the data structure as long as the items are static. These algorithms all start with an original and static list of items and use various techniques to build a hash function that is specific to that set of items. A perfect hashing function may perform multiple computations or table look-ups to produce the unique retrieval address, but each item in the list can be retrieved with O( 1) efficiency. Perfect hashing functions were used as early as 1977 to allow 0(1) retrieval of items such as month names [23]. Manuscript received October 26, 1990, revised September 30, 1991, and March 16, 1992. The authors are with the Computer Science Department, North Carolina State University, Raleigh, NC, 27695-8206 USA. IEEE Log Number 9212669. Algorithms such as Cichelli’s algorithm [6], which is capable of handling words lists of up to about 50 words, have been used for tasks such as the 0(1)retrieval of Pascal reserved words within a Pascal compiler. Interest in perfect hashing algorithms has risen recently with the advent of extremely large static databases, particularly those held on CD-ROM’s. CD-ROM’s are a nonwriteable medium, which guarantees static data. CD-ROM’s also have slow track seek times, which makes multiple accesses to the disk extremely frustrating for the user of the CD-ROM. Perfect hashing functions have been used in this environment to provide quick access to large medical dictionaries [9] and other word lists. The perfect hashing function allows any word to be accessed from the disk with only one disk access operation. Unfortunately, many current perfect hashing algorithms are severely limited in the number of words that can be processed. Future application of perfect hashing functions to large static databases will continue. On-line periodical indexes and card catalogs, especially those committed to CD-ROM, will continue to demand the high-efficiency retrieval capabilities offered by perfect hashing. Perfect hashing algorithms must expand to handle immense item lists of this type. A new perfect hashing technique based on compressed tries has been developed. It can be used to form ordered minimal perfect hashing functions for extremely large word lists. The new technique is unique in its speed and efficiency, as well as in its immunity to pattem collisions. 11. AVAILABLEPERFECT HASHINGALGORITHMS Sprugnoli [23] introduced the first perfect hashing algorithm. Two techniques-the quotient reduction method and the remainder reduction method-were developed to allow perfect hashing of relatively small sets of words (10-12 words). These methods both rely on the algorithm’s ability to create a simple formula of the form H = (W - Cl)/C2 (quotient reduction method) or H = (FLOOR((C1 W * C2)modC3)/C4)) (remainder reduction method), where H is the hash value, W is the word to be hashed, and Cl,C2,C3, and C4 are constants calculated by the algorithm, for the word set in question. For larger sets of words, Sprugnoli suggests a segmentation process to break the list into smaller subsets. Sprugnoli’s methods are unique in the fact that they do not require auxiliary tables of information at retrieval time-only the four constants must be stored. Jaeschke [ 141 created a technique called reciprocal hashing. This technique is similar to Sprugnoli’s in that the perfect hashing functions are relatively simple, but like Sprugnoli’s + 1041-4347/94$04.00 0 1994 IEEE ~ ~ IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 2, APRIL 1994 240 TABLE I B U ~ LORDER D OF DIFFERENT ALGORITHMS Algorithm Cichelli Karplus Sager Fox Reference Build 161 O(c‘) O(rl 5)* 0(r6) [151 1221 [91 0 ( T 3 ) *Derived from experimental results rather than theoretical analysis. algorithm, Jaeschke’s algorithm cannot handle more than about 20 words without. segmentation. Cichelli [6] developed an extremely straightforward minimal perfect hashing algorithm that can be used on word lists up to about 50 words in size. Cichelli’s method makes use of a table of 26 integers that is used at retrieval time. The perfect hashing algorithm uses a brute-force technique to find a set of values for the table that allows each word to be hashed to a unique address. For larger word sets Cichelli suggests segmentation. Cichelli’s algorithm is extremely sensitive to pattem collisions. We introduce the term pattern collisions to describe the situation that occurs when two words cannot be handled simultaneously by a perfect hashing function. If the technique used by the perfect hashing function to determine the address for a word causes two words to be indistinguishable, a pattern collision has occured. For example, a perfect hashing function formed by Cichelli’s algorithm uses the first and last characters of the word being hashed, along with the length of the word, to decide on the hash address. Because of this, any two words with the same first and last character and length cannot be distinguished, and therefore cannot be handled simultaneously by Cichelli’s algorithm (e.g., “dig” and “dug”). Karplus and Haggard [161 generalized Cichelli’s algorithm using graph theory methods to hash word sets of up to almost 700 words. Karplus also notes experimental results that seem to indicate an extremely good efficiency for the algorithm (see Table I). However, the algorithm has not been successfully applied to word lists greater than 700 words, and the words in the sample lists [I51 seem to be carefully chosen. For example, one list contains the words “Tannarive,” “thugee” and “thyroglobulin,” but omits more common words such as “the,” “there,” and “through.” Chang [7] built an ordered minimal perfect hashing algorithm using the Chinese remainder theorem. Ordered functions allow items in the static list to be arranged in any order (e.g., alphabetically) as the perfect hashing function is built. Chang’s algorithm is unique in this respect. Du, Hsieh, Jea, and Sheith [8] created a perfect hashing function based on segmentation and random hashing functions. Word lists of up to 300 words were suceesfully hashed, though no execution times are revealed. Cercone, Krause, and Boates [5] started with Cichelli’s algorithm and created three different perfect hashing algorithms. One forms segments based on word length and then applies Cichelli’s algorithm to each segment, The results for the subsets are then combined into a single table. The remaining two algorithms are more esoteric. The best of the three algorithms is able to handle up to 500 words, but the results are not minimal (one-third of the final table is unused). Sager [22], also trying to improve on Cichelli’s algorithm, used a graph theoretic approach to create minimal perfect hashing functions of sets of words as large as 512. Sager’s work was later improved by Fox, Chen, Heath, and Datta [9] to form minimal perfect hashing functions of lists containing up to 1000 words. Both of these algorithms expand on Cichelli’s table size considerably to handle large word sets. To increase the number of words hashed, Fox suggests segmentation of the word list. Larson and Ramakrishna [21], [ 191 created a perfect hashing algorithm for extemal files. The technique uses a segmentation strategy to divide a large file into smaller groups. Each group is then perfectly hashed using standard techniques. Fox, Chen, Heath, and Daoud [lo] have demonstrated a perfect hashing algorithm capable of handling large word lists. The algorithm is O(n) at build time and uses very little space to store the perfect hashing tables needed at retrieval time. However, this algorithm does not produce an ordered perfect hashing function, and it is also fairly expensive at retrieval time. See subsequent sections for discussions of these issues. Further discussion of perfect hashing functions can be found in [25] and [20]. Berman [2] presents theoretical analysis on the space tradeoffs involved in creating perfect hashing functions. 111. PERFECT HASHINGISSUES Perfect hashing algorithms are used to build perfect hashing functions. The perfect hashing function normally consists of a set of values (often stored in a table or array) that are calculated by the perfect hashing algorithm. The formation of the perfect hashing function by the perfect hashing algorithm occurs once for a given, static item list. The perfect hashing function is then used for each retrieval of all items from the item list. Several issues have become prominent in perfect hashing research. These issues include the following. 1) The ability of the perfect hashing algorithm to form minimal perfect hashing functions. 2) The ability of the algorithm to form ordered perfect hashing functions. 3) The memory requirements of the perfect hashing function at retrieval time. 4) The amount of CPU time required by the perfect hashing algorithm while the perfect hashing function is being built. 5) The amount of CPU time required by the perfect hashing function for each retrieval. 6) The sensitivity of the perfect hashing algorithm and function to pattem collisions. A hash function’s ability to form “minimal” item lists is extremely important because of space considerations. In a minimal list, the items addressable by the perfect hashing function are arranged contiguously. Minimal lists save considerable storage space if the items addressed by the perfect hashing function are large records of information. Most of the algorithms discussed previously produce minimal functions, 24 1 BRAIN AND T H A W PATERN COLLISIONS IN PERFECT HASHING but most also allow nonminimal functions to be formed as well. In general, less time is required to build nonminimal perfect hashing functions. The order of the item list is sometimes important. For example, the item list must be accessible by the perfect hashing function but may also need to be accessible alphabetically or sequentially. Perfect hashing algorithms that allow arbitrary and specific orderings (e.g., alphabetical) of the item list once the perfect hashing function is formed are called ordered functions. Chang’s algorithm [7] creates an ordered perfect hashing function, while most others do not. Another issue important to perfect hashing functions is the amount of memory required by the function at retrieval time. All perfect hashing functions require the use of a precalculated set of values to handle retrieval. The values are determined when the perfect hashing function is built. For example, Sprugnoli [23] calculates four constants (C1,C2, C3, and C4) that are used at retrieval. Cichelli [6] calculates a table of 26 values that are accessed for retrieval. In Cichelli’s algorithm, the first letter and last letter of the word to be retrieved are used as indexes into the table. The two values found in the table are summed together with the length of the word to form the hash address. The purpose of Cichelli’s algorithm is to find values for the table that cause no address collisions for all of the words in the static item list. The amount of space required for the table tends to grow with the number of items hashed. For example, Sager’s algorithm [22] requires 4 bytes of storage space in the table for each item in the static list of items. This table is built by the perfect hashing algorithm once, and then stored in memory and used for each retireval performed by the perfect hashing function. Each perfect hashing algorithm has two completely different time efficiency aspects that must be monitored. One of these aspects is the retrieval efficiency. Because a perfect hashing function maps a key to a unique address, it has O( 1) retrieval efficiency by definition. The data are stored on auxiliary memory, but the hashing function is processed in primary memory. Because the access time for auxiliary memory is several orders of magnitude greater than the access time for primary memory or the execution time for an instruction, the operations performed by the perfect hashing function are insignificant compared with an auxiliary memory access. In a perfect hashing function, each item in the list can be retrieved with O(1) accesses of auxiliary memory. The second time efficiency aspect involves actually building the perfect hashing function. For any algorithm, this is the one-time cost of constructing a perfect hashing function for a given set of items. For example, Cichelli’s algorithm has an O(c‘) build time (where c is some constant and T is the the number of words to be hashed), while Sager’s minicycle algorithm has an 0(r6) build time (where T is the number of words hashed). Table I shows the build order of some of the algorithms mentioned. Many of the algorithms described above are sensitive to pattern collisions to a certain degree. As discussed in an earlier example (and in [31), any two words with the same first and last character and length (e.g., “dig” and “dug”) cannot be handled simultaneously by Cichelli’s algorithm. Pattem collisions are normally handled by segmenting the original word list into a set of sublists, using a hashing function to divide the words between the different segments. The hashing function used to form the segments is chosen so that words with pattem collisions are placed in different segments. This segmentation is accompanied by a performance penalty. Finally, the complexity of the functions used for retrieval is important. Although all perfect hashing functions have an O( l), (constant) retrieval time, the amount of CPU time required for each retrieval should be minimized. Equations (1)-(6) show some of the different functions used for retrieval in perfect hashing functions. Cichelli’s algorithm has an extremely simple and efficient retrieval function, as shown in (2). Sager uses somewhat more complicated functions, as shown in (3)-(6). (Note that this set of equations comes from [22] and is required for large word lists. For smaller lists, simpler equation sets are possible. The listed equation set, or one of similar complexity, is necessary to eliminate pattem collisions in large word sets because many characters must be examined to avoid the collisions.) As pointed out by Sager [22], Chang [7] uses retrieval functions that are extremely time consuming, making them impractical. Sprugnoli: H = ( F L O O R ( ( C l +W * C2) mod C3)/C4) (1) Cichelli: H = L E N G T H ( W ) + G[FIRST-CHAR-OF-WORD] + G[LAST-CHAR-OF-WORD] (2) Sager: HO = ( L E N G T H ( W )+ ( S U M ( O R D ( W [ I ] ) ) I) ) , = 1TOLENGTH(W)BY3 H1 = ( S U M ( O R D ( W [ I ] ) ) ) , I (3) = lTOLENGTH(W)BY2)MODR H2 = ( S U M ( O R D ( W [ I ] ) ) ) , I = 2TOLENGTH(W)BY2)MODR R H = HO G[H1] G[H2]MODN (4) + + + (5) (6) In the equations shown above, W represents the word being retrived. LENGTH(W) is the number of characters in the word. W [ z ]is the zth character of the word. G represents a table of values calculated by the given perfect hashing algorithm, indexed using the value shown. C1, C2, C3, C4, and R are constants. H represents the hash address calculated by the perfect hashing function. IV. A SIMPLE PERFECT HASHINGFUNCTION A typical application of a perfect hashing function is the O( 1) retrieval of Pascal reserved words by a Pascal compiler. The Pascal reserved words are static and are known in advance, and this knowledge is used to build a specific perfect hashing function for the reserved word set. When a word is to be retrieved, the perfect hashing function is applied to the word. The use of perfect hashing functions can make token recognition for a compiler efficient. IEEE TRANSACIIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 2, APRIL 1994 242 TABLE I1 ABCDEFGHI 1 KLMNOPORSTUVWXYZ THEPROBABILITY OF PATTERNCOLLISIONS BASEDON )E" 1AND 2 BEGW )CHAR 4 m 5 QSE 6wD 7Eop Fig. 1. A perfect hashing function using 'a 2-D array. THE NUMBER OF WORDSIN FIG. 1's 2-D ARRAY Number of Words Probability of Collision 2 3 10 20 30 0.15%) 0.45% 7%l 26%) 50%) 71% 85% 40 Many of the techniques described above can be used to 50 build perfect hashing functions for this application. These 60 93% algorithms cover a wide range of complexities, though the 70 97% simp1.er algorithms often have significant disadvantages. In this section a simple algorithm will be described and used to demonstrated perfect hashing, and also to demonstrate the memory are required per word in the list. As more words are six issues described above. added to the list, the memory efficiency improves, however. An extremely simple perfect hashing function for Pascal This algorithm is prone to pattem collisions. For example, reserved words could be implemented using a 2-D array of the word CUT cannot be added to the list because it collides integers, as shown in Fig. 1. The first and last characters with CONST. Table I1 shows the probability of pattem colliare used as indexes into the 2-D array, and the integer at sions based on the number of words in the word list. Knuth the indexed location points to the word in the word list [17] calls this phenomenon the birthday paradox [23]. For (normally the word list would contain information needed by example, in Fig. 1 the array contains 676 positions. If two the compiler to process/compile the given reserved word). random words are to be placed into the array, the probability of In Fig. 1, a list of seven Pascal reserved words has been them not colliding at the same location is 675/676 = 99.85%. hashed. To look up the word A N D using this arrangement, Therefore, the probability of collision is 0.15%. If three the first letter, A , is used as the row index into the array, while random words are placed in the array, then the probability of the last letter, D , is used as the column index. At the location no collision decreases to (674/676) * (675/676) = 99.55%. addressed by A and D is the value 1, which is used to access Therefore the probability of collision in the array containing the record referenced by the word A N D in a word list. As three words is 0.45%. can be seen, the retrieval time for any of the words in the From this example, it can be seen that the perfect hashing word list is constant, regardless of the number of words in the algorithm described has certain advantages and disadvantages. array, giving this perfect hashing function an 0(1)retrieval The advantages include the ability of the algorithm to form time. When using this algorithm on a very large word list, ordered minimal perfect hashing functions in O(n) time. The the array would be stored in primary memory and the word perfect hashing functions formed are also extremely efficient. records would be stored on disk, allowing the word records to The disadvantages include extreme memory inefficiency and be accessed with a single disk probe. a high sensitivity to pattem collisions. The memory efficiency The job of the perfect hashing algorithm is to build the 2disadvantage would tend to rule out the perfect hashing D array (a one-time operation). The perfect hashing function function formed by this algorithm. (used many times) would use this 2-D array along with the function shown in (7). v. THE n I E DATASTRUCTURE Hash-address = array[f irst-letter, last-letter] (7) In any set of words to be perfectly hashed, no two words In (7), firstletter and last-letter are the first and last letter of the word being hashed. This is an extremely efficient retrieval function compared with other available algorithms. This perfect hashing algorithm forms ordered minimal perfect hashing functions. The word list contains no gaps between the words in the list-they are all contiguous. The word list shown in Fig. 1 can also be ordered in any way. This particular perfect hashing algorithm would also have an O ( n ) build time. When the perfect hashing function is being built, a single value is placed in the 2-D array for each word in the word list. The memory usage for this perfect hashing function is very inefficient because many entries in the 2-D array are empty. In the example shown in Fig. 1,669 of the 676 entries in the 2-D array are unused (about 1% memory efficiency). For the seven words handled by this perfect hashing function, 194 bytes of have exactly the same characters in all positions. This fact is used in the trie data structure [25], [ 111 to allow word lists of any size to be accessed. To retrieve a word from a trie data structure, each letter in the word (starting with the first) is examined until the word can be uniquely differentiated from others in the list of words known to the trie. For example, if the word DOG is known to the trie, and if DOG is the only word in the list that begins with a D, then when trying to retrieve the word DOG using the trie, only the first letter must be examined. Fig. 2 shows an array-based trie for a word list containing the words C A T ,COP, C O R N , C O R K ,DOG, FLAG and FROG. To access the word DOG from the trie, the first letter of the word ( D ) is used as an index into the first column of the array (using the first letter of the word to be retrieved to index the first column of the trie is the starting point for any 243 BRAIN AND THARP: PAnERN COLLISIONS IN PERFECT HASHING 1 2 3 4 - Build the 2-D nie m y - Son the columns ofthe 2-D amy. Thc columns with rhc most entries should be 5 at the beginning of the sorted Ust, and other columns should fill the list in demasingordcr. - Lmp thmugh dl colunms containing 1amolc values. - Aarmpt mplla the curmu mlumn U location OM ofthe 1-D m y . D 1 CAT - While the new column‘s values collide with values almdy in the l-D amy. m e the mlumndown one position in the 1-Dm y and check for E z cccoRpN 3 - On- a non-collidingposition for the new column is found, place the F 4coRK G H w~sionsagain. columnin the 1-Dm y ud update the UT. - S t m the final U T formed. SDOO 6- 7 FROG ea.# Fig. 2. An array-based trie of the word list’shown. Negative values point to the next column, while positive values point to the word list. EOW stands for “End of Word,” which is used to differentiate, for example, DOG from DOGS. retrieval). The value 5 found at location D of the first column indicates that the word DOG can be found in location 5 of the word list. Because DOG is the only word in the list starting with D , the first letter is sufficient to differentiate it from all other words in the list. To access the word FLAG from the trie, the first letter of the word is used as an index into the first column of the array. The value -5 is found. The negative sign is a flag used to indicate that the first letter is not sufficient to differentiate the word, and that the second letter should be used as an index into another column, in this case column 5 , of the array. At location L in column 5, the value 6 is found. This is the location of the word FLAG in the word list. Words that are prefixes to other words (e.g., cat is a prefix of cats) are handled in a trie by the addition of an “end of word” character to the character set (see eow in Fig. 2). The eow character, because it is unique in the character set, can be used to distiguish words that have the same prefix by appending it to the end of the shorter word. The trie data structure shown in Fig. 2 would work well as a perfect hashing algorithm because it completely eliminates the pattem collision problem, and also because it produces minimal word lists that can be ordered in any way (the list shown is alphabetical). If the trie array were stored in memory, the CPU time needed to index a word through the hie would actually be less than that of other perfect hashing functions currently in use for handling large word sets [22], [9]. For example, to look up the word CORN using Sager’s algorithm [22] (or Fox’s improvement of it [9]) requires that seven characters be indexed from the word, along with 11 additions, three MOD operations, and three array indexings. The worst-case situation for a memory based trie using the word CORN requires that five letters be indexed from the Fig. 3. Pseudocode of the compression algorithm as applied to tries. word, and that the trie array be indexed five times. Retrieval time from the memory-based trie array is superior to retrieval using Sager’s minicycle algorithm. It is again important to note that time spent manipulating in-memory tables is considered insignificant compared to the time required to do a disk probe. A long word such as electroencephlocardiogram could require a number of in-memory operations to generate the hash address using either Sager’s algorithm or the trie. This does not make the algorithm O ( n ) or O(maximumword-length), however, because only one disk probe is required to access the record on disk as a result of these calculations. It is the number of disk accesses that determines the order in any hashing algorithm. The importance of a single disk probe cannot be overstressed, especially on slow devices such as CD-ROM’s. VI. COMPRESSION OF THE TRIEARRAY If the trie array were stored in memory, it could provide a simple scheme for perfect hashing that is superior in terms of retrieval CPU time to other perfect hashing algorithms [22], [9]. Unfortunately, the memory utilization in the trie array shown in Fig. 2 is only 8%, and this inefficiency makes the algorithm impractical in most cases involving large word sets. One way to make the use of a trie practical for perfect hashing would be to compress the trie array to improve its memory utilization. A simple and efficient O ( r 2 )algorithm described in [4] as the SMP algorithm and in [24] as the double displacement method can be used here to compress the trie. The array to be compressed is separated into its individual columns (or rows). The columns are rearranged based on the number of elements in each column so that the column containing the most elements is at the top of the sorted list. The columns are then copied into a 1-D array, starting with the column containing the most elements. The relationship between the elements in each column is preserved as the column is copied into the l-D array. A column lookup table (CLT) is maintained to remember the starting location of each column copied into the l-D array. Each column is placed into the l-D array at the first available position that causes no collisions between filled elements of the new column being placed into the l-D array and the already filled elements of the 1-D array. A pseudocode version of this algorithm, as it would be applied to tries, is shown in Fig. 3. Applying this sparse array packing algorithm to the trie array shown in Fig. 2 would result in the 1-D array shown in Fig. 4. The CLT in Fig. 4 indicates the starting location of each column that has been packed into the l-D array. To retrieve the word FLAG using the setup shown in Fig. 4, the first IEEE TRANSAaIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 2, APRIL 1994 244 -2 5 1 - 5 2 1 2 3 5 4 -4 4 3 6 C a l m b k u p Table P r k d The - 2 5 1 - 5 2 4 4 6 3 I I I 1-31 12 13 14 I5 16 17 18 19 20 21 22 23 24 25 26 27 28 171 Fig. 5. The packed trie with CLT modified to contain all positive values. sraning location of column 1 = 9 -135 1 - 7 2 4 4 6 P r k dMe 3 12 13 14 15 16 17 18 19 20 21 22 23 24 5.2 26 27 28 Fig. 6. The packed trie with CLT incorporated into the trie and eliminated. letter of the word FLAG (F) is used to index into column 1 of the trie array. To find column 1 in the packed trie array, the CLT is used. The value for column 1 in the CLT is -2. This value is added to 6 (since F is the 6th letter in the alphabet) to yield 4. In location 4 of the 1-D array is the value -5. The -5 indicates, as it did in a normal trie array, that the second letter of the word FLAG should be used as an index into the 5th column of the trie array. The letter L acts as the index into the 5th column of the array. The starting location for column 5 is retrieved from the CLT, where the value -4 is found. This -4 is added to 12 (since L is the 12th letter of the .alphabet) to yield the index 8 into the 1-D array. The value at location 8 of the 1-D array is 6. As in the unpacked trie array, the 6 indicates that the word FLAG can be found at location 6 in the word list. The CLT can also be eliminated to improve space efficiency. First, the values in the CLT are all increased by 11 to remove all negative values from the CLT, and the indexing of the 1-D array is modified accordingly, as shown in Fig. 5. The technique for retrieving values from the array shown in Fig. 5 is identical to the technique used in Fig. 4. The CLT can then be eliminated by remembering the starting location for column 1 (since all words must index through column 1) and replacing the negative values in the 1-D array with their CLT values, as shown in Fig. 6. Using the arrangement shown in Fig. 6, words can be retrieved in a manner very similar to the technique used for Figs. 4 and 5. For example, to retrieve the word FLAG, start by adding 9 (the starting location of column 1) to 6 (F is the first letter in FLAG, and F is the 6th letter of the alphabet): 9 + 6 = 15. Use 15 as an index into the 1-D array and retrieve the value -7. The fact that -7 is negative indicates, as usual, that the second letter of FLAG should be used. L is the 12th letter of the alphabet. Add the absolute value of -7 to 12: A B S ( - 7 ) 12 = 19. Using 19 as an index into the 1-D array + 1 yields the value 6. Since 6 is positive, it is used as an index into the word list of Fig. 2, where the word FLAG is found. Once the array shown in Fig. 6 has been created, the memory efficiency of the trie array has been improved to 64%, while retrieval times are almost as efficient as those for the unpacked trie array (an extra addition operation is needed for each access to the 1-D array). The packed trie's memory utilization can be improved to 100% using the MSMP sparse array packing technique shown in [4]. This algorithm adds a MOD function to retrievals in the 1-D arfay to eliminate unused gaps in the array. This extra step is not necessary, however. On large word lists (any list of over 100 words) the packing algorithm shown in Figs. 4 through 6 yields utilizations of 100% in every test case that has been attempted. This phenomenon occurs because of the way the 2-D trie is constructed and packed into the I-D array. For example, on a 3000-word list used in testing, the 2-D trie had 2122 columns. Each column contains 128 values (one value for each of the 128 ASCII codes). In this 2-D array, 5121 positions contain values. The 1-D array will therefore contain 5121 positions if 100% memory utilization is achieved. It can be seen that each column placed into the 1-D array is very small in relation to the total 1-D array size. Examination of the composition of the 2-D trie shows that a majority of the columns contain only one or two values: 911 columns contain two values, while 680 columns contain one value. This column distribution occurs because many words must be separated from their plural forms in the me. For example, if the words M A R S H A , M A R S H A L L , and M A R S H A L L S occur in the list, two single value columns (each containing an L ) will be created in the trie to get to the point where M A R S H A L L and M A R S H A L L S can be distinguished. Then a two value column (containing S and an end-of-word mark) will be created to cause the actual distinguishing between M A R S H A L L and M A R S H A L L S . The columns containing two values pack very densely. The columns that contain only one value can be placed anywhere in the 1-D array, so they are used by the algorithm to fill all unused gaps left in the 1-D array. The result has been 100% memory utilization in the 1-D array for every test case that has been attempted. The general organization of a trie does not guarantee 100% utilization of memory when the trie is packed, but it makes it extremely likely. VII. RESULTS Table I11 shows the amount of time and space required to form ordered minimal perfect hashing functions using the compressed trie algorithm described above. The final entry in this table shows the performance of the algorithm when it is used to form an ordered minimal perfect hashing function for the entire Unix word list (including such words as loth, I'll, and electroencephalography) normally found in the file /usr/dict/words. This word list contains letters (both upper and lower case), digits, and punctuation. To handle the entire list, the columns of the trie were increased to 128 locations to accommodate the entire ASCII character set. 245 BRAIN AND WARP: PA’ITERN COLLISIONS IN PERFECT HASHING TABLE IV COMPARISON OF THIS ALGORITHM TO OTHERS TABLE I11 TOTALTIMEREQUIRELI TO CREATEORDERED MINIMAL PERFECT HASHINGFUNCTIONS FOR WORDLISTSOF THE SIZEINDICATED Formation Packing # words lime (in s) Time (in s) 1000 1500 3000 6OOo 12000 24481 30 45 88 183 385 789 31 52 128 360 1141 423 1 To:l:)me 61 97 216 543 1526 5020 # Columns # Values 795 1223 2122 4250 8592 16962 1794 2722 5121 10249 20591 41442 In Table 111, word lists of the size indicated were taken from the standard Unix word list (e.g., the first 1000 words were taken for the 1OOO-word test). The Formation Time column indicates the amount of time taken to form the 2-D trie in seconds. The Packing Time column indicates the amount of time required to form the packed 1-D array. Total time is the sum of formation time and packing time. # Columns indicates the number of columns in the 2-D array formed, while # Values indicates the number of filled values in the 2-D array. This column also indicates the size of the 1-D array formed, because in all cases the 2-D array packed perfectly into the 1-D array, with no unused values present in the 1-D array. This table was formed using one node of a Sequent Balance 8OOO (each node uses a 12-Mhz 32032 processor, which is roughly equivalent to a Macintosh I1 computer (a 68020 at 16 Mhz)). Determining the computational complexity of the trie packing algorithm is straightforward. It can be seen in Table 111 that the size of the 2-D trie, both in terms of the number of columns and in terms of the number of values held in the trie, is directly proportional to the number of words. Call the total number of elements held in the 2-D trie T , and assume in the worst case that there are T columns in the 2-D trie, each containing 1 element. Because the 2-D trie contains T values, a minimally packed 1-D array containing the contents of this 2-D me will also require T elements. As each column is packed, every element in the column must be examined to determine if it collides with any existing values in the 1-D array. Also, in the worst case, there are T possible positions for each column to occupy in the 1-D array. Since each column can be considered to contain a single element, and since in the worst case T positions for this column in the 1-D array must be examined, T comparisons must be made. This same operation must be performed for all T columns in the 2-D trie. Therefore, r2 comparisons must be performed to pack the 2D trie. The algorithm is O(?) in the worst case. The O ( r 2 ) nature of this algorithm can also be seen in the Total Time column of Table 111. The time required to construct the trie to be packed is O ( n ) , as can be seen emperically in Table 111. This is a one-time cost, like the packing of the trie. For more information on trie construction, see [25]. Table IV compares the new algorithm with other current minimal perfect hashing algorithms. It can be seen that the new algorithm can be used to create perfect hashing functions for extremely large word lists. It should also be noted that the packed trie technique is the only algorithm in Table 4 that Algorithm Reference Build Name Order Cichelli [6] Karplus [15] O(cr) O(rl 5)* List Size CPU Time Machine (in s) 40 35 667 150 IBM 4341 Space (bytes/ entrv) 0.65 N.A Sager [22] O(r6) 256 45 IBM 4341 Fox [91 [3] [4] O(r3) O(r2)* O(r2) IO00 1696 5000 600 733 283 Mac I1 PS/2/50 PS/2/70 4.0 2.4 2.0 - O(r2) 24481 5020 SB 8000 3.4 Brain MSMP Packed TTiP 4.0 N.A. = Not available. *Derived from experimental results rather than theoretical analysis. **These times are included for completeness rather than comparison purposes, although the machines listed are all roughly equivalent to one another in performance. The times demonstrate that each of the algorithms listed can complete its task in a reasonable amount of time. forms an ordered perfect hashing function, and it is also the only algorithm that is immune to pattem collisions. In Table IV, the Space column indicates the amount of memory space (in bytes) required to hold the table used by the given algorithm, per item stored in the item list. The space requirement for the packed trie algorithm was determined by dividing the total number of elements in the packed trie (at 2 bytes per element) by the total number of words recognizable by the packed trie for the 24481-word trie (see Table 111): 41 442 * 2/24 481 = 3.4. The elements in the packed trie will be stored in memory and used by the perfect hashing function during retrieval. This process is typical of all of the perfect hashing functions shown in Table IV, including Cichelli’s algorithm and Sager’s algorithm. It is also important to recognize that this table of elements is not a hash table. It is a unique table of values formed for the data set, and it allows every item in the data set to be accessed on the disk using a single probe of the storage medium. Normal hashing algorithms would have poor performance compared to the implementation shown here. The problem involved with using a normal hash function in a perfect hashing application can be clarified with an example. Double hashing is a well understood and effective hashing algorithm that could be used to allow word searches equivalent to the perfect hashing function’s. One of two techniques might be used. In the first, a table of pointers is held in memory, as shown in Fig. 7. To look up a word, the word is hashed and the pointer value held at the hash location in memory is used to probe the disk. This approach would yield an in-memory table of the same size as the trie, but because of collisions in the hash table multiple disk probes would be required for retrieval of some data items. The second way to use double hashing would be to place all of the keys in an in-memory table so that multiple disk probes are not required. This approach is shown in Fig. 8. Double hashing is used to find the word in the table. Once the word is found, the associated pointer is used to access the word on the disk in one probe. While this would yield the desired disk IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO. 2, APRIL 1994 246 Perfect Hashing Algorithm’s Use of a Table I Pelf.” huhiil - conmimimI 83K s-7: "-"nary table n e m-mmory Dau items m &I a l e --I veld l k w1p1y1adOf Lbc d.u m 011 & &* 1 5.11 k -d w IM Ula m umglr disk performed by the trie algorithm or Sager’s algorithm are considered insignificant when compared to the time required for a disk access. The fact that these algorithms can access any data item in a single disk probe once these calculations have been performed is their desirable characteristic. wok Normal Hashing Algorithm’sUse of a Table I Fig. 7. A comparison of the way a perfect hashing function and a normal hash function use an in-memory table. Double Hashinn using an in-memorv table d i l in a c e l a d Fig. 8. Double hashing used with an in-memory table of words. probe depth of 1, it requires a great deal of memory space for the table. Our test word list had a maximum word size of 22 characters. Each entry also requires a 2-byte pointer onto the disk. For a table of 24481 words, assuming the best-case scenario of a 100% packing factor, 24481 * 24 = 587554 bytes are required to hold the table. Using a compressed trie requires only 82 884 bytes. Either way, double hashing does not compare well to the trie approach. It can be seen in Table IV that the new algorithm is superior to all other algorithms currently in use in terms of build time, with the possible exception of the MSMP algorithm. The MSMP algorithm suffers only in its vulnerability to pattem collisions. Forming the array-based trie is also significantly faster than Sager’s algorithm [22]. Fox’s algorithm [9] (an improvement of Sager’s) requires 600 s of Macintosh I1 CPU time to process lo00 words. The algorithm described here handles lo00 word lists in 61 s on a machine of similar computing power. Another benefit of the new algorithm is its superior retrieval performance. As already discussed, retrieval of the word using Sager’s algorithm require that Seven characters be indexed from the word, along with 11 additions. MOD operations, and three array indexings. neworst: case sit<ion for a memory based packed trie of any Size using the word CORN would require that five letters be indexed from the that the l-D array be indexed five and that five extra additions be performed for the indexings. It can be seen that a packed trie is superior to Sager’s algorithm is ‘pace requirements and time*Even for long words, however, it must recognized that the calculations VIII. EXTENSIONS OF THE NEW ALGORITHM The new algorithm offers a number of advantages over other current algorithms as long as the packed trie can be held in memory. On extremely long word lists (e.g., a 100000-word list found in a medical dictionary located on a CD-ROM), however, it may be infeasible to store the entire packed trie in memory. This would cause multiple reads to be performed on the packed trie stored on disk (a maximum of five reads, as already discussed in the case of the word C O R N ) , and these multiple reads will severely degrade performance. To solve this problem, the word list can be segmented as discussed in [9]. For example, a 1OOOOO-word list could be subdivided into 26 separate lists using the first character of each word to determine the contents of each sublist (similarly, the first two characters could be used to subdivide the lists into 676 sublists, etc.). To retrieve a word, the word would first be examined to determine which subtrie is appropriate, and that subtrie would be loaded into memory in a single read operation. The normal retrieval process would then continue using the subtrie stored in memory. Since individual sectors on CD-ROM’s are extremely large (2 Kbytes [27]), a significant number of words can be stored in each subtrie. IX. CONCLUSION A packed trie data structure can be used to create extremely fast and space-efficient ordered minimal perfect hashing functions for extremely large word lists. The performance of the algorithm, both in terms of access and creation times, is superior to other perfect hashing functions currently available. REFERENCES [l] R. Bayer and C. McCreight, “Organizations and maintenance of large ordered indexes,” Acta Inform., vol. 1, no. 3, pp. 173-189, 1972. [2] F. Berman, M. E. Bock, E. Dittert, M. J. O’Donnell, and D. Plank, “Collections of functions for perfect hashing,” SIAM J . Compur., vol. 15, no. 2, pp. 604-618, May 1986. [3] M. Brain and A. Tharp, “Near-perfect hashing of large word sets,” Software Practice and Experience, vol. 19, no. 10, pp. 967-978, Oct. 1989. [4] M. Brain and A. Tharp, “Perfect hashing using sparse matrix packing,” Inform. Syst., vol. 15, no. 3, pp. 281-290, Fall 1990. 151 N. Cercone, M. Krause, and I. Boates, “Minimal and almost minimal perfect hash function search with application to natural language lexicon design,’’ Comput. Math. Appl., vol. 9, no. 1, pp. 215-231. 161 - _ R. J. Cichelli. “Minimal uerfect hashine made simule.” Commun. Ass. Comput. Mach., vol. 23, ho. 1, pp. 17-?9, Jan. 1960. 171 C. c; Chang, “The study of an ordered minimal perfect hashing scheme,” Commun. Ass. Comput. Mach., vol. 27, no. 4, pp. 384387, Apr. 1984. 181 - _ M. W. Du. T. M. Hseih. K. F. lea. and D. W. Shieh. “The studv of a new perfect hash scheme,” IEEE Trans. Sofhvare Eng., vol. 9, no. 3, pp. 305-313, May 1983. [9] E. A. Fox, Q. F. Chen, L. S. Heath, and S.Datta, “A more cost effective algorithm for finding minimal perfect hash functions,”ACM Conf. Proc., 1989, pp. 114122. [lo] E. A. Fox, L. S.Heath, Q. F. Chen, and A. M. Daoud, “Practical minimal perfect hash functions for large databases,” Commun. Ass. Comput. Mach., vol. 35, no. 1, pp. 105-121, Jan. 1992. BRAIN AND THARP: PATTERN COLLISIONS IN PERFECT HASHING E. Fredkin, “Trie memory,” Commun. Assoc. Comput. Much., vol. 3 , no. 9, pp. 490-500, Sept. 1960. M. L. Fredman and J. Komlos, “Storing a sparse table with O(1) worst case access time,” J. Ass. Comput. Much., vol. 1, no. 3, pp. 53&544, July 1984. Y. S. Hsiao and A. L. Tharp, “Adaptive hashing,” Inform. Syst., vol. 13, no. 1, pp. 111-127, 1988. G. Jaeschke, “Reciprocal hashng: A method for generating minimal perfect hash functions,” Commun. Assoc. Comput. Mach., vol. 24, no. 12, pp. 82%833, Dec. 1981. K. Karplus and G. Haggard, “Finding minimal perfect hash functions,” Comput. Sci. Dept., Comell Univ., TR84437, Sept. 1984. -, “Finding minimal perfect hash functions,” in Proc. SZGCSE 1986, pp. 191-193. D. E. Knuth, The Art of Computer Programming, vol. 1-3. Reading, MA: Addison-Wesley, 1968-73, vol. 3, p. 506. P.Larson and A. Kajla, “File organization: Implementation of a method guaranteeing retrieval in one access,” Commun. Assoc. Comput. Much., vol. 27, no. 7, pp. 610-677, July 1984. P. Larson and M.V. Ramakrishna, “Extemal perfect hashing,” in Proc. ACM SZGMOD Con& 1985, pp. 190-199. T. Lewis and C. Cook,“Hashing for dynamic and static intemal tables,” Compur., pp. 45-56, Oct. 1988. M. V. Ramakrishna, “Perfect hashing for external files,” Ph.D. dissertation, UNv. of Waterloo, Waterloo, Canada, 1986. T. J. Sager, “A polynomial time generator for minimal perfect hash functions,”Commun. Assoc. Comput. Mach., vol. 28, no. 5, pp. 523-532, May 1985. R. Sprugnoli, “Perfect hashing functions: A single probe retrieval method for static sets,” Commun. Assoc. Comput. Much., vol. 20, no. 11, pp. 841-850, NOV. 1977. R. E. Tarjan and A. C. Yao, “Storing a sparse table,” Commun. Assoc. Comput. Mach., vol. 22, no. 11, pp. 606-611, Nov. 1979. A. L. Tharp, File Organization and Processing. New York: Wiley, 1988. F. A. Williams, “Handling identifiers as intemal symbols in language processors,” Commun. Assoc. Comput. Much., vol. 6, no. 6, pp. 21-24, June 1959. 241 [27] B. Zoellick, “CD-ROM software developement,” Byre, vol. 11, no. 5, pp. 177-188, May 1986. Marshall Brain received the B.S. degree in computer engineeringfrom Rensselaer Polytechnic Institute, Troy, NY, and the M.S. degree in computer science from North Carolina State University, Raleigh, NC. He is a member of the Academy of Outstanding Teachers at North Carolina State University. He currently works for Interface Technologies, Inc. in Wake Forest, NC. Mr. Brain is the author of MotifProgrummingc The Essentials . . . and More (Digital Press). He is the lead author Prentice Hall’s five-book Bridge series on Windows NT. Alan L. Tharp received the B.S. degree in science engineering and the M.S. and Ph.D. degrees in computer science from Northwestem University, Evanston, IL. He is currently Head of the Computer Science Department at North Carolina State University, Raleigh, NC. His research interests include file and data structures, database management systems, and human-computer interfaces. Dr. Tharp is a member of the Association for Computing Machinery. He is also the author of File Organization and Processing (Wiley, 1988).
© Copyright 2026 Paperzz