Tries & Suffix Trees Storing Collections of Strings • All our dictionary data structures can be used to store strings, of course. • But strings have several properties that might make you want a different data structure: - Comparing strings can be slow (not constant time) Strings can “overlap” or be similar: e.g. cat, car, cars, cart Special kinds of queries are important: e.g. “Does string S contain “marmoset” as a substring?” Tries • A trie, pronounced “try”, is a tree that exploits some structure in the keys - e.g. looking at their individual characters ace, add, added, cede, dad, deed Alphabet = ACDE, special $ “end of string” character A C D E $ A A C D E C D E A $ C D E $ C D E $ A C D E A C D E $ A C D E $ A C D C D E E C D E A $ C D E $ add$ A C $ A C D E $ D $ A C D E $ A C D E $ $ dad$ ace$ E $ A A D $ A A C E $ deed$ A C D E $ added$ cede$ Saving Space #1 • Store nodes in a 2-d table. • table size = |∑| = size of the alphabet by number of nodes m • Each entry contains the index of the node it “points to” • Uses O(log m) space instead of the size of the pointer (e.g. 32 or 64 bits) • General trick: you ensure nodes are contained in some region of memory of size M. A C D E $ Saving Space #2 • Replace • with • at each node A A C D E $ D $ NULL Array at each node becomes a linked list. Saves space when the branching factor is low don’t need to store an entry for each character in the alphabet) Also imagine a hybrid method, using arrays at nodes with high branching factors de la Brandais tree Saving Space #3 A lot of nodes were non-discriminatory: they didn’t discriminate between two keys: A C D E $ A C D E $ A C D E $ A C D E $ A C D E $ A C D E $ A C D E $ A C D E $ A C D E $ A C D E $ A C D E $ A C D E $ A C D E $ A C D E $ dad$ A C D E $ A C D E $ A C D E $ add$ ace$ deed$ A C D E $ cede$ added$ Idea: only store the discriminatory nodes Practical Algorithm to Retrieve Information Coded in Alphanumeric Saving Space #3 – Patricia Tries Same tree, but only storing the discriminatory nodes. BUT: now have to store the index of the character position the node is testing before, a node at depth d tested position d, but now that isn’t true: we can skip over positions) 1 2 cede$ A C D E $ ace$ 4 A C D E $ 2 dad$ A C D E $ deed$ A C D E $ added$ add$ Also: now you must CHECK whether a leaf you reach matches the query. E.g. what if we searched for cedar. Suffix Tree Variant 1 abbracadabbra$ suffix i is the part of the string from i to the end 2 bbracadabbra$ 3 bracadabbra$ 4 racadabbra$ 5 acadabbra$ suffix identifier = the prefix of the suffix that is unique within the string cadabbra$ adabbra$ Every suffix has a identifier because of the $ character dabbra$ Suffix tree variant: store the n+1 suffix identifiers in a trie bra$ ra$ a$ abbra$ bbra$ abbracadabbra$ a b c b b d a $ 5 7 13 r b r c a a c 1 2 $ 9 r $ a c 6 4 14 $ 12 3 11 $ c $ d 8 d a r c 10 Store the suffix number (start position of the suffix) at each leaf Searching abbracadabbra$ a b c b b d a $ 5 7 13 r b r c a r $ a c 6 4 14 $ 12 3 11 $ c $ d 8 d a r c a c 1 2 $ 9 10 Search for abbr: follow the path for abbr leaves under end node correspond to the locations of abbr. Space usage: • • Use Patricia trees to store the suffix tree • Every internal node is at least a binary split. • Therefore, # number of internal nodes is about equal to the number of leaves • Hence, linear space. Then: # leaves = O(n) [one leaf for each position in the string] Topics for Final • Pre-midterm: C++, Stacks, queues, decks, graphs, BFS, DFS, sparse matrices, selforganizing lists, trees, threaded trees, tree traversals, bitvectors for sets, full binary tree thm, array implementation of trees, general trees as binary trees B-trees, AVL-trees, Splay trees, heaps (leftist, skew), skip lists - • Post-midterm: - PR, MX, Point quadtrees kd-trees, range searching incremental nearest neighbor 2-d range trees, “1-d range trees”, fractional cascading Interval trees Segment trees Binary Space Partitions (painters algorithm, BSP trees) Doubly-connected edge lists Grid files tries, patricia trees, de la Brandais tres, O(log m) pointer space, suffix trees
© Copyright 2026 Paperzz