Unique Binary Search Tree Representations and Equality

Unique Binary Search Tree Representations and
Equality-testing of Sets and Sequences
Robert E. Tarjan t
Rajamani Sundar*
New York University
Princeton University
and
NEC Research Institute
Abstract
operations of the abstract data type can be performed
efficiently. More concretely, consider the problem of
uniquely representing a dictionary over an ordered universe by a binary search tree. The cost of searching for
an item in the dictionary equals its depth in the corresponding binary search tree. The cost of performing an
update operation (insert or delete) on the dictionary is
the time needed to transform the tree corresponding to
the initial set into the tree corresponding to the final
set. Let T(n) denote the worst-case cost of performing a dictionary operation on an n-element.dictionary
under this representation. The problem is to devise a
representation that minimizes T(n) for all values of n.
Given an ordered universe U, we study the problem
of representing each subset of U by a unique binary
search tree so that dictionary operations can be performed efficiently. While efficient randomized solutions
to the problem are known, its deterministic complexity
has remained unexplored. We exhibit representations
that permit the execution of dictionary operations in
optimal deterministic time when the dictionary is sufficiently sparse or sufficiently dense. Our results demonstrate an exponential separation between the deterministic and randomized complexities of the problem.
We apply unique representations to obtain efficient data structures for maintaining a collection of
sets/sequences under queries that test the equality
of a pair of objects.
Our data structure for set
equality-testing is based on an efficient implementation
of cascades of CONS operations on uniquely stored Sexpressions.
1
The unique representation problem arises in the contexts of incremental evaluation and implementation of
high-level programming languages. Incremental evaluation [13,14] is the technique of efficiently updating the
value of a function when the input changes. A simple
idea to speed up incremental evaluation is to remember
the results of previous function calls and thereby avoid
recomputation. If the input domain is uniquely represented and input data objects are stored uniquely,
then it becomes easy to check if the function has already been evaluated on a given input. In this context,
unique storage of data objects means that instances of
the same data object created at different times are all
represented physically by the same storage structure.
Modern programming languages such as S E T L support high level data types such as sets and sequences
and permit testing whether two such objects are equal.
Equality-testing can be implemented in constant time
by devising a unique representation for the data type
and storing concrete data values uniquely.
Introduction
The unique representation problem for an abstract data
type (e.g. a dictionary) is to design a unique representation for values of the abstract data type by values of
a concrete data type (e.g. a binary search tree) so that
*Research at BeUcore, DIMACS (NSF Science a n d Technology
center for Discrete M a t h e m a t i c s a n d Theoretical Computer Science), a n d Courant I n s t i t u t e partially supported by a n IBM graduate fellowship a n d NSF grants CCR-S702271, CCR-S902221,
CCR-8906949, a n d STCSS-09648, a n d ONR grant N00014-85-K0046.
tResearch at P r i n c e t o n University partially supported by
NSF grants DCR-8605962, a n d STC88-09648, a n d ONR grant
N00014-87-K-0467.
Before proceeding further, we define some basic concepts. A dictionary is a set of items selected from a
totally ordered universe on which membership queries
and update operations that insert or delete items can
be performed. We can represent a dictionary by a binary tree containing one item per node, with the items
arranged in symmetric order: each item in the tree is
larger than the items in its left subtree and smaller
than the items in its right subtree. This data structure
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial
advantage, the ACM copyright notice and the title of the publication and
its date appear, and notice is given that copying is by permission of the
Association for Computing Machinery. To copy otherwise, or to republish,
requires a fee and/or specific permission.
© 1990 ACM 089791-361-2/90/0005/0018
$1.50
18
For each updating method, we seek representations
that permit efficient implementation of the dictionary
using that method. Since 2n - 2 rotations suffice to
transform any n-node binary tree into any other nnode binary tree (see [6,16]), any unique representation supports dictionary operations in O(n) time using
rotations. We provide a representation that supports
dictionary operations in o(vrff) time using CONS operations.
is called a binary search tree. A binary search tree is
represented in storage as a collection of records, one per
node, with a pointer to the record corresponding to the
root. Each record has a key field, left and right child
pointers, and a bounded number of additional information and pointer fields. A rotation of an edge Ix, p] in a
binary tree, where p is the parent of x, is a transformation that makes x the parent of p by transferring one
of the subtrees of x to p.
Now, we survey previous work on unique representations. Snyder [17] considers the problem of representing subsets of an ordered universe by tree-based search
structures so that all sets of equal cardinality have the
same underlying tree. For this class of representations
he showed that e(Vfff) time is necessary and sufficient
to perform a dictionary operation, where n is the dictionary size. In his model, an update operation on a
tree is implemented by changing the pointers of a set
of nodes. Munro and Suwanda [12] examine how to
implicitly represent subsets of an ordered universe by
enforcing, for each set size, a unique partial order on
the storage locations. They showed that o ( v r n ) time
is necessary and sufficient to perform dictionary operations using such a representation. Recent work has
focussed on randomized solutions to the unique representation problem. Pugh [13] and Pugh and Tietelbaum [14] have devised efficient randomized representations for sets, sequences and other abstract data types.
Aragon and Seidel [2] give a randomized binary search
tree representation of a dictionary that has the unique
representability property and allows dictionary operations to be performed in O(log n) randomized time.
In this paper we study the deterministic complexity
of uniquely representing a dictionary by a binary search
tree. We consider the following ways of updating a
binary search tree during an insertion or deletion:
We prove lower bounds on the complexity of any
unique representation that are valid if the dictionary is
sufficiently sparse (i.e., if n, the dictionary size, is small
in relation to [U[). The lower bounds are derived for
the following cost metric. The cost of an update using
one of the above methods is, respectively, the number
of rotations or the number of CONS operations. The
cost of searching for an item in a binary search tree is
1 plus the depth of the item in the tree. Under the
assumption that the dictionary is sparse, we show that
the preceding upper bounds are optimal. In the light
of Aragon and Seidel's representation, this implies an
exponential separation between the deterministic and
randomized complexities of the problem. The sparseness assumption in the lower bounds is essential since
we construct a representation that requires O(log IVl)
time per dictionary operation using any update strategy. This representation is optimal on dense dictionaries.
We apply unique representations to the problems of
equality-testing of sets and sequences:
S e q u e n c e e q u a l i t y - t e s t i n g p r o b l e m : Maintain a
collection of sequences over an ordered universe under the operations i) EQUAL(S, T) - Return true if sequences S and T are equal, ii) INSERT(S,i,x,T) - Create
a new sequence T by inserting element x into sequence
S between positions i - 1 and i, and iii) DELETE(S, i, T)
- Create a new sequence T by deleting the ith element
of sequence S. Initially, the collection consists of only
the null sequence.
. Performing a sequence of rotations: We update
the tree using the traditional algorithms for binary search tree insertion and deletion [10], but
permit arbitrary rotations to be performed on the
tree before and after so that the final tree correctly
represents the updated dictionary.
S e t e q u a l i t y - t e s t l n g p r o b l e m : Maintain a collection
of sets over an ordered universe under the operations
EQUAL(S, T), INSERT(S,X, T), and DELETE(S,x,T),
defined in the obvious fashion. Initially, the collection
consists of only the empty set.
. Nondestructive updating through CONS operations: We perform a sequence of dictionary operations on the null tree, maintaining all the trees
created so far and their subtrees. An update operation constructs a new tree by accessing some
of its subtrees from the existing collection of trees
and creating the remaining subtrees from the accessed subtrees, proceeding b o t t o m up. Also, an
update operation may create any number of additional trees that may be needed in the future.
A CONS operation is used to create a new tree in
constant time, given its left and right subtrees and
root item. This operation parallels the CONS operation on S-expressions occurring in LISP [1].
For both problems, we are interested only in data structures that test equality in constant time. Carter and
Wegman [18] proposed a randomized signature-based
scheme for set equality-testing that requires only constant time per operation but errs with a small probability. Pugh [13] and Pugh and Tietelbaum [14] gave
randomized solutions to both problems that require
O(logn) expected time and O(logn) expected space
per update operation, where n denotes the size of the
set or sequence. Their data structures also support
19
Sequence equality-testing can be solved in O(n +
logrn) amortized time and O(1) space per update operation by maintaining sequences in a lexicographic
search tree [15]. Here, n denotes the length of the
updated sequence and m denotes the total number
of updates.
Our solution to this problem requires
O(v~ff~+
log m) amortized time and O(V~) amortized space per update operation. When sequences are
free from repetitions, sequence equality-testing is equivalent to set equality-testing. This follows from the representation of sets by sorted sequences and the representation of sequences by sets of ordered pairs of adjacent elements• Therefore, sequence equality-testing
becomes truly hard only when sequences have repetitions.
The paper is organized as follows. Section 2 describes
the results on unique binary search tree representations.
Sections 3 and 4, respectively, describe the data structures for set and sequence equality-testing. Section 5
poses open problems.
more powerful operations (for instance, union and intersection of sets) efficiently. For sequence equalitytesting, however, the logarithmic bound is valid only if
the sequences do not contain repeated elements. Yellin
[21] described two deterministic solutions for equalitytesting of sets, both of which use prohibitively large
amounts of space.
We provide a data structure for set equality-testing
with O(log m) amortized time and O(log m) space per
update, where m denotes the number of updates. This
is achieved by developing an efficient implementation
of cascades of CONS operations on uniquely stored Sexpressions. An S-expression [1] is either an atom or
a pair of S-expressions. An atom is represented by a
node in storage and a pair is represented by a node with
left and right pointers that point to nodes representing component S-expressions. We store S-expressions
uniquely, i.e. all instances of an S-expression are represented by a single node. CONS(Sl,S2) returns the Sexpression (sl.s2). A cascade of CONS operations is
a sequence of CONS operations such that the result of
each CONS operation is an input to the next CONS operation. For instance,
sl
:= CoNs(s0,t0)
S2
::
CONS(Sl,tl)
sy
:=
CONS(Sy-I,t]_I)
2
2.1
tree
:= CoNs(vk,[wl,w2,...,wz])
:= CONS(Vk_l,Sl)
Result
:=
An
optimal
representation
s p a r s e dictionaries
for
We describe a representation that allows search in
O(logn) time and permits updating the binary search
tree in O(x/'ff) time using CONS operations. We represent the dictionary by an almost complete binary search
tree. Specifically, if n = 211 + 2 i~ + • • • + 2/k , such that
il > i2 > . . . > ik, then the binary search tree representing an n-element dictionary comprises a b-node
right path, with a sequence of b complete binary trees
of respective sizes 211 - 1, 212 - 1 , . . . , 2 ik - 1 hanging off
of it. In order to be able to update the tree emciently,
we need to maintain some additional information. Define a j-run to be a subset of 2j - 1 adjacent elements
in the dictionary. For each j _< [log n/2], we maintain
complete binary trees representing the j-runs of the
dictionary, called the j-trees, in a sorted, doubly linked
list. In addition, when n has the form 2 2i - 2 i + k,
where 1 < k < 2 i, we also maintain the list of/-trees
(i = [log n/2] + 1) corresponding to the first k2 i/-runs.
When an insertion causes n to increase from 2 2i - 1 to
2 2i, this list becomes the list of [log n/2J-trees.
It is obvious that this representation supports search
in O(log n) time. To insert a new element z, we first update the lists of trees, proceeding b o t t o m up, and then
create the rest of the tree from trees in the updated
lists. Updating the list of j-trees, for any j, involves
Wl]) ==-
s2
,i
search
In this section, we describe the optimal representations
for sparse and dense dictionaries and establish the lower
bounds for representing sparse dictionaries.
is a cascade of f CONS operations.
Unique storage of S-expressions makes CONS operations e x p e n s i v e ~ e
CONS operations can be implemented in O(x/log IFI) time and O(1) amortized space,
where F denotes the total number of CONS operations
performed. This implementation is based on Willard's
data structure [20] for maintaining a dictionary in a
small universe. Universal hashing [3] and dynamic perfect hashing [7] offer alternate implementations that require O(1) randomized amortized time and O(1) amortized space per CONS operation.
We develop a data structure that performs a cascade of f CONS operations in O(f + log m) amortized
time, where rn denotes the total number of cascades
performed. When sets are represented by binary tries,
an update operation translates into a cascade of at most
log m CONS operations and requires O(logrn) amortized time using this data structure. Many list-oriented
functions in functional languages (LISP, for instance)
involve cascades of CONS operations and can be implemented efficiently using this method. Function APPEND is a typical example:
APPEND([Vl, V 2 , - ' - , i)k], [Wl, W2, • • . ,
Unique
binary
representations
CONS(Vl,Sk_l)
20
theorem is optimal, upto a constant factor. The Ramsey number Rn(b,, n + 1) is at most Tn+l(cn), for some
constant c. Therefore the trade-off is valid when the
dictionary size is at most (1 - o(1))log* [U[.
The next theorem gives the lower bound on the cost
of updating the tree using CONS operations.
deleting at most 2J - 2 j-trees corresponding to old jruns that contain the predecessor as well as the successor o f x and inserting in their place at most 2 J - 1 j-trees
corresponding to new j-runs that contain x. Creating
a new j-tree is accomplished in O(1) time by performing a CONS operation on two (j -- 1)-trees that are its
subtrees and its root item. In addition, if n is of the
form 2 2 i - 2 i + k , where 1 < k < 2 i, we need to extend the size of the list of/-trees from k2 i to (k + 1)2 i
by creating at most 2 i new /-trees. This shows that
the list of j-trees, for any j , can be updated in O(2 j)
time once the updated list of (j - 1)-trees is available.
The total cost of updating the lists of trees is at most
0 ( 2 + 22 + . . . + 2[l°gn/2J +1) = O(vfff). To create
the rest of the tree, we need to create the nodes of the
tree at levels [logn/2] + 1, [log n/2J + 2 , . . . , [log nJ + 1
and the nodes on the right path of the tree. These
nodes can be created bottom-up from the updated lists
of trees by performing o(vrff) CONS operations. It follows that the cost of an insertion is O(vfff). Deletion is
analogous.
2.2
Lower
ies
bounds
for sparse
T h e o r e m 2 Let n and m > 2n be positive integers and
let U be an ordered universe of size at least Rn(bn, m).
For any unique binary search tree representation of subsets of U, there is a sequence of m update operations
that involves only subsets of size at most n and causes
f~(mv/-ff) CONS operations to be performed.
We now prove the theorems.
P r o o f o f T h e o r e m 1. By Ramsey's theorem, there
is a subset S = { x l , x 2 , . . . , X , + l } of U all of whose
n-subsets are represented by a single underlying binary
tree, say B. We claim that at least n - 1 - 2d rotations
are necessary to transform set $1 = { x l , x 2 , . . . , x n }
into set $2 = {x2, x 3 , . . . , X n + l } . The claim is proved
using Wilber's method [19] of deriving lower bounds on
rotations in binary trees.
Consider an n-node binary tree T whose nodes are
labeled from 1 to n in symmetric order. An interval
[i, j] of nodes of T is a block of T. Any block [i, j] of
T induces a binary tree, called the block tree of block
[i, j], obtained by contracting all edges of T having an
endvertex outside the block. A rotation on T propagates into a rotation on a block tree of T only if both
nodes of the rotation lie in the block tree; otherwise
the block tree is unaffected. Let Train denote the binary tree obtained by deleting the minimum element
of T (equivalently, the block tree of block [2, n] in T).
Define Tmax similarly. Let drain(T) and dmax(T) denote,
respectively, the depths of the minimum and maximum
elements of T.
dictionar-
Our lower bounds are proved using Ramsey's theorem
[9,11].
Ramsey's theorem
Let n, k and s > n be arbitrary
positive integers and let U be an arbitrary set. There
exists a number R n ( k , s) with the following property: if
IUI > R.(&, s) then, for any partition of the n-subsets
of U into k classes, there is an s-subset S all of whose
n-subsets lie in a single class.
It is known that P ~ ( k , s ) = Tn+t(O(log(ks))) [9,11],
where Tn (x) is the tower function
2 2.
"'2~/ n.
L e m m a 1 For any binary tree T, at least ITI- 1 drain(T)- dm~(T) rotations are necessary to transform
Train into Tm~x-
First, we state the lower bound for updating the tree
by means of rotations. Let bn = (2nn)/(n+ 1) denote the
number of distinct n-node binary trees [10]. Let n and
d be positive integers and let U be an ordered universe
of size at least l ~ ( b n , n + 1). For any unique binary
search tree representation of subsets of U, we have the
following trade-off between search and update times:
P r o o f . By induction on IT[.
C a s e 1. ITI _< 2: Easy.
C a s e 2. IT[ _> 3: For any pair of nodes u and v in T,
define
T h e o r e m 1 I f the n-subsets of U are represented by
binary search trees of height at most d, then there is an
n-subset on which an update operation requires f ~ ( n 2d) rotations.
Let x denote (the symmetric order number of) the highest node in T that is not the maximum or the minimum
element. Let L and R denote, respectively, the block
trees of blocks [1, x] and [z, IT[] in T. Divide the nodes
of Train and Traax into a left block [1, I L l - 1] and a right
block [ILl, IT I - 1]. Then the left block trees of Train and
Trnax are, respectively, Lmln and Lmax- By the inductive hypothesis, at least I L l - 1 - d r a i n ( L ) - dmax(L) left
block rotations must be performed to transform Lmin
6u,~ =
The theorem implies that a dictionary operation requires f~(n) time in the worst case if the dictionary is
sufficiently sparse and rotations are used to update the
tree. It is easy to show that the trade-off given by the
21
1
0
if u is an ancestor of v
otherwise.
creation of at least n new trees. T h e second part of the
claim follows from a straightforward fact:
into Lmax. Similarly, at least I R I - 1 - d m i n ( R ) - d m a x ( R )
right block rotations are needed to transform the right
block tree Rmin of Train into the right block tree Rmax
of Tm~x- Since the roots of Train and Tmax belong to
different blocks, at least one rotation involving nodes
from both blocks is performed in transforming Tmin into
Tm~x- Any single rotation on Train falls exactly into o n e
of the three categories. Hence the total number of rotations necessary to transform Train into Tm~x is at least
I L ] + [R I - 1 - drain(L) - dmax(L)
-dmin(R) -- dmax(R)
[TI-
• For all / _< V~, let vi denote the highest node
of the tree with symmetric order number in the
interval [ ( / - 1 ) v ~ + 1, iV/~. Consider any iteration of Step 1 of the cycle. The subtrees of nodes
vl, v2 . . . . , vd~ after the iteration are all different
from the subtrees of the tree at any time prior to
the iteration.
The theorem follows from the claim. []
=
2.3
(drain(T) - ~max,min) - ~min,$
--~max,~ -- (dmax(T)
- -
~min,max)
I T ] - dmin(T) - dmax(T) - 1.
D
An optimal representation
dense dictionaries
for
We describe a representation, based on binary tries
[10], that allows a dictionary operation to be performed in O(loglVl) time. Assume that V = [1,2 p]
and let S denote the set being represented. If S C
[1,2P-1], then S is represented in U by the tree representing it in the universe [1,2p-I]. Otherwise, let
z = m i n ( S A [ 2 p-1 + 1,2P]). Then, x is the root of
the tree representing S, and its left and right subtrees are, respectively, the trees representing subsets
S A[1, 2p- t] and S N[z + 1, 2p] in the universes [1, 2p- 1]
and [2p-1 + 1,2P]. It is easy to show that this representation supports dictionary operations in O(log [U[)
time using any update mechanism.
When set St is transformed into set $2, the block
tree induced by subset {x2, z 3 , . . . , x , } is transformed
from Bmin into Bmax. By the lemma, it follows that at
least n - 1 - 2d rotations are required to transform $1
into $2.
Consider transforming S1 into $2 by deleting xl
and inserting xn+t. The insertion of x , + t into set
{x2, x 3 , . . . , x , } and the deletion of z , + t from set $2
are inverse transformations and require the same number of rotations. Hence t2(n - 2d) rotations are necessary for either the deletion of Xl from set St or the
deletion of x,~+l from set $2. []
P r o o f o f T h e o r e m 2. By Ramsey's theorem, there
is a subset S = {Xl,X2,...,xm} of U all of whose nsubsets are represented by the same underlying binary
tree. Let T denote an n-subset of S, to be specified
later. The lower bound construction inserts the elements of T into the null tree and repeats the following
cycle of operations on the tree until m - n updates have
been performed:
3
Set e q u a l i t y - t e s t i n g
In this section, we describe a data structure for set
equMity-testing with O ( l o g m ) amortized time and
O(log rn) space per update operation, where m denotes
the total number of updates. The elements seen so far
are numbered in serial order and define the current universe U = [1, [U[]. Each set is represented by a binary
trie [10] in this universe. The binary trie representing a
set S is an S-expression that stores the elements of S as
atoms and is defined recursively. Let 2p < [U I _< 2p+I.
A singleton set is represented by an atom and the empty
set, by the atom NIL. If IS[ >_ 2, then S is represented
by a pair (sl.s2), where sl and s2 are, respectively,
the S-expressions representing subsets S n [1,2 p] and
S f3 [2p + 1, [UI] in their respective subuniverses. We
store S-expressions uniquely so that two sets are equal
if and only if their S-expressions are represented by the
same node. A set update operation translates into a
cascade of at most log lU[ _< logm CONS operations,
which can be implemented in O(log m) amortized time
and O(log m) space using the method described below.
We now describe an efficient data structure for
performing cascades of CONS operations on uniquely
stored S-expressions.
The data structure requires
O(f + log m) amortized time to perform a cascade of
f CONS operations, where rn denotes the total number of cascades performed. Consider the collection of
Cycle.
1. Repeat ~ 1 times:
Delete the minimum element of the tree
and insert a 'fresh element' from S that
is larger than the maximum element of
the tree. T h r o u g h o u t this construction,
by 'fresh element' we mean an element
that has never been in the tree before.
2. Replace the elements of the tree with
symmetric order numbers v/'~,2v~,...,n
by fresh elements from S that occupy the
same symmetric order positions in the
tree.
We now define the elements inserted into the tree. Execute the construction, treating these elements as blackboxes, and determine their partial ordering. Next,
choose the elements from set S to fit the partial order.
We claim that each iteration of the cycle performs
4 v f ~ - 2 update operations on the tree and causes the
22
L e m m a 3 Consider a binary tree partitioned into
blocks whose items have been assigned arbitrary nonnegalive weights such that every block has positive weight
(the weight of a block is the sum of the weights of
items in the block). Let n denote the number of nodes
in the tree and let nb denote the number of blocks.
The cost of a sequence of m splays performed on the
roots of the blocks is O(m q- n "4- Ejrn=llog(W/wj) -JE'~=I log(W/Wi)), where W = the total weight of all
the items, wj = the weight of the block of the jib accessed item, and ~i = the weight of the ith block of the
tree.
nodes representing S-expressions. Number the nodes
serially in order of creation. Say node p is a parent
of node v if p points to v. Each node, say v, maintains a set parents(v) of all its parents. Each parent
p E parents(v) is assigned a key equal to (serial#(w), b),
where w is the other node (besides v) pointed to by
p, and b equals 0 or 1 depending on whether the left
pointer of p points to v or not. To perform a CONS operation on two nodes v and w, we search set parents(v)
under key (serial#(w), 0) and return the matching parent. If there is no matching parent, we create a new
node p with pointers to v and w, set parents(p) to
empty, insert p into parents(v) and parents(w), and
return p.
We represent each set parents(v) by a binary search
tree and use the splay algorithm [15] to perform
searches. Insertion of a new item using splay is implemented as follows. Maintain a pointer to the maximum
item in the tree (in addition to the root pointer). If the
inserted item is larger than the current maximum item,
insert it as the right child of the maximum item. Otherwise, insert the item into the tree in the standard way
and splay at the item. The two types of insertions are
called passive and active, respectively. We implement
passive insertions more efficiently since they are more
numerous than active insertions.
The following theorem summarizes the performance
of the data structure.
P r o o f . Assign potentials to all nodes in the tree
as described by Cole et hi. [5] in Section 2, "Global
insertions", and analyze splays using their analysis of
global insertions. An account of this analysis can also
be found in Cole [4]. The amortized cost of splaying at
the root of a block with weight w is O(1 + log(W/w))
and the drop in potential over the entire sequence is
O ( ~ i n=b 1 log(W/t~i) + n). The result follows. I"1
Partition the tree into blocks as follows. Call the
items accessed by active and hot splays active and hot,
respectively. Every active or hot item forms a singleton
block. Each nonempty interval of nodes between consecutive singleton blocks forms a passive block. Choose
an item from each passive block and call it the block
representative. Note that na = the number of active
items. The weight of item i is defined by:
T h e o r e m 3 The amortized cost of a cascade of f
CONS operations is O ( f + log m), where rn is the total
number of cascades performed on S-expressions.
fi
F/(na + 1)
0
P r o o f . The proof uses two lemmas.
L e m m a 2 Consider a sequence of insertions and
searches performed on an (initially empty) binary
search tree using splays. Let
fi
= the number of searches of item i,
F
= the total number of searches,
na = the number of active insertions, and
n
= the total number of insertions.
The cost of this sequence is O(n + F + nalogna +
~y,>l fi log(F/fi)).
if the item is hot
if the item is active
if the item is in a passive block
but not the representative
The representatives of na + 1 passive blocks are assigned
a weight of F/(na + 1) each and the representatives of
the remaining passive blocks are placed in one-to-one
correspondence with the set of hot items and assigned
the weights of their mates. The total weight of the
tree is at most 4F. Applying L e m m a 3, the cost of the
sequence of splays is
O(no + F + , + ~ f, log(4F/f,) + , ° log(4(na + 1)) +
.h_>l
P r o o f . We modify the sequence by preinserting all
:Ltems into the initial tree according to their order of
arrival (without splaying). On this tree, we perform
she searches and simulate the insertions. Active inserrbons are simulated by splaying at the corresponding
items and passive insertions are simply ignored. We
obtain a sequence of splays corresponding to active insertions and searches (active splays and hot splays, respectively). It suffices to bound the cost of this sequence.
Recall the notion of blocks in a binary tree introduced in the proof of Theorem 1. We bound the cost of
this sequence by partitioning the tree into blocks and
applying a result implicit in the work of Cole et hi. [5].
log(4F/f,)) + (2.o + 1 ) l o g ( 4 ( . . + 1))) =
O(n + F + na log na + E f i
log(F/fi)).
fi>_l
13
L e m m a 4 Consider a digraph G = (V, E). Assign an
arbitrary positive integer Fe to each edge e and assign
the corresponding integer Fv = ~-~(~,~o)~EF(v,~o) to each
vertex v. Let {P1, P ~ , . . . , Pro} be a collection of walks
in G such that an edge e occurs exactly Fe times in the
walks. Call the start vertices of the walks, sources. Let
23
#P~ denote the number of walks starting at vertex v.
Let
idv =
the end of the sequence of cascades induces a directed
graph with the nodes as vertices and edges from nodes
to their parents. The indegree of each node in the graph
is at most 2. For each edge (v, w), define F(v,to) = the
number of searches of node w performed on parents(v).
Delete all edges e such that F~ = 0. Applying Lemma
4 to the resulting graph, the summation is bounded by
F log 3 + m log m. It follows that the cost of the sequence of cascades is O(F + m l o g m ) . The theorem
follows. []
if v is not a source
otherwise
I indegree(v)
indegree(v) + 1
Then,
F(,~,~,)log(F,,/F(,,,w)) <_
(v,w)e~
F~ log idv + Z
id~> l
# P" log Fv .
F,>_I
4
P r o o f . Let
F(v,~) ----F(v,w) - #walks with (v, w) as the last edge.
F(,,w)log(F~/F(,~,,~)) =
~_~ FvlogFv+
~
F(v,w)log(1/F(,~,w)) <
(v,w)eE
# Pv log Fv + ~
F(~,v) log Fv+
1%>_1
(~:,v)~~
T'(,~,~)l°g(1//~'(,~,~))=
(v,w)~E
(x l o g ( l / x ) is decreasing in [l/e, oo])
F,>I
F,~>I ( v , ~ ) e ~
logFo +
F,> I
logid .
idu,)_l
(entropy inequality)
[]
Consider a sequence of m cascades of CONS operations comprising F CONS operations totally. The cost
of a cascade of f CONS operations is O(f) plus the
cost of operations on parent sets. During any cascade of CONS operations, there are at most two active
insertions into parent sets. These insertions are performed when the first node is created by the cascade
and all subsequent insertions are passive. Thus, out of
at most 2 F insertions into parent sets during the entire
sequence of cascades, at most 2m are active. Applying L e m m a 2 to the sequence of insertions and searches
performed on each parent set and summing the costs
over all parent sets, we see that the total.cost of parent
set operations is
O(F+mlog m+ ~
Z
equality-testing
In this section, we describe a data structure for sequence equality-testing with O ( ~ + l o g
m) amortized time and O ( v ~ amortized space per update,
where n denotes the length of the updated sequence
and m, the total number of updates. We represent a sequence by an almost complete binary tree as described
in Section 2.1. The ith node of the tree in symmetric order stores the ith element of the sequence. A j-run is a
subsequence of 2j - 1 adjacent elements of the sequence.
As before, for each j < flog n/2J, we maintain a list of
j-trees representing the j-runs of the sequence. In this
representation a node can belong to several sequences
and hence have several left and right pointers corresponding to several lists of j-trees. This makes navigation through the list of j-trees of a particular sequence
expensive. We make the data structure persistent so
that it permits efficient navigation. A data structure
is fully persistent [8] if it maintains all versions of the
data structure created by the update operations so far
and permits accesses and updates (that create new versions) to any existing version. Driscoll et al. [8] show
how to make a purely pointer-based data structure that
has nodes of constant bounded indegree and constant
bounded number of access points fully persistent by
weakening the time and space per update operation to
amortized bounds (the original bounds must be worstcase). Their method is to split a node when it has
to maintain too many fields corresponding to different
versions of the data structure. In our data structure,
nodes have indegree at most 4 and the root is the only
access point, so it can be made fully persistent using
this technique. The persistent data structure maintains
the representations of all sequences created so far and
permits navigating and updating any of them.
Two sequences are equal if and only if their binary
trees are isomorphic. To facilitate isomorphism testing of binary trees, we serially number each class of
isomorphic binary trees in order of creation and store
the serial number of a binary tree at its root. Two sequences are equal if and only if their binary trees have
identical serial numbers. When a node of the persistent
data structure is split, its serial number is stored in the
resulting nodes. We now discuss how to compute the se-
(v,w)~
F,~_I
Sequence
f, log(F(v)/f,)),
nodes v i E parents(v)
A fi~_l
where F(v) denotes the total number of searches performed on parents(v) and fi denotes the number of
searches of item i among these. We use Lemma 4
to bound the summation. The collection of nodes at
24
[5] R.Cole, B.Mishra, J.Schmidt, and A.Siegel, "On
the dynamic finger conjecture for splay trees. Part
I: Splay sorting logn-block sequences", Courant
Institute Tech. Rept., September 1989.
rial number of a new tree created by a CONS operation.
Number all elements seen so far in serial order and store
them in a balanced search tree so that the serial number
of an element can be computed in O(logm) time. By
associating with each tree a triple (l, r, x) of serial numbers of its two subtrees and root element, we reduce the
computation of serial number of a new tree to testing
membership of its triple in the collection of triples of all
the trees. The collection of triples associated with the
trees is a subset of [1, c.m 312] X [1, c.m 312] X [1, m], for
some constant c. Willard [20] gives a data structure for
main~
a dictionary over an ordered universe U in
O(x/log [U[) time and O(1) amortized space per operation. If the collection of triples is maintained using
this data structure, the serial number of a new tree can
be computed in O ( ~ )
time and O(1) amortized
space.
The persistent data structure, in conjunction with
the data structure for serial number computation,
solves the sequence equality-testing problem. The cost
of testing equality of two sequences is O(1). The
amortized time to update a sequence of length n is
O(v/'n log m + log m); the amortized space required is
O(~/~).
5
[6] K.Culik II and D.Wood, "A note on some tree similarity measures", Info. Proc. Lett. 15 (1982), 3942.
[7] M.Dietzfelbinger, A.Karlin, K.Mehlhorn, F.Meyer
auf der Heide, H.Rohnert, and R.E.Tarjan, "Dynamic perfect hashing: Upper and lower bounds",
Proc. 29th IEEE FOCS (1988), 524-531.
[8] J.R.Driscoll,
N.Sarnak,
D.D.Sleator,
and R.E.Tarjan, "Making data structures persistent", J. Comp. Sys. Sci. 38 (1989), 86-124.
[9] R.L.Graham, B.L.Rothschild and J.Spencer,
RAMSEY THEORY, John Wiley 8, sons, 1980.
[10] D.E.Knuth, THE ART OF COMPUTER PROGRAMMING, VOL. 3:
SORTING AND SEARCHING,
Addison-Wesley Publishing Co., 1973.
[11] L.Lovfisz, COMBINATORIAL PROBLEMS AND EXERCISES, North-Holland Publishing Co., 1979.
[12] J.I.Munro and H.Suwanda, "Implicit data structures for fast search and update," J. Comp. Sys.
Sci. 21 (1980), 236-250.
Open Problems
Our work raises some interesting open problems:
[13] W.Pugh, "Incremental computation and the incremental evaluation of functional programming,"
Ph.D. Thesis, Cornell University, 1988.
1. Determine the complexity of unique binary search
tree representations when the dictionary is of intermediate density (n e < IUI < Tn+l(~n)).
[14] W.Pugh and T.Tietelbaum, "Incremental computation via function caching," Proc. 151h ACM
POPL (1989), 315-328.
2. Determine whether there is a data structure for
sequence equality-testing with O(polylog(m)) time
and space per update operation.
[15] D.D.Sleator and R.E.Tarjan, "Self-adjusting binary search trees", J. A C M 32 (1985), 652-686.
[16] D.D.Sleator, R.E.Tarjan, and W.P.Thurston, "Rotation distance, triangulations, and hyperbolic geometry," J. A M S (1988), 647-682.
Acknowledgements
We thank Mike Fredman for valuable discussions.
[17] L.Snyder, "On uniquely representable data structures," Proc. 18Lh IEEE FOCS (1977), 142-146.
References
[18] M.N.Wegman and J.L.Carter, "New hash functions and their use in authentication and set equality", J. Comp. Sys. Sci. 22 (1981), 265-279.
[1] J.Allen, ANATOMY OF LISP, McGraw Hill Publishing Co., 1978.
[19] R.Wilber, "Lower bounds for accessing binary
search trees with rotations," SIAM J. on Computing, 18 (1989), 56-67.
[2] C.R.Aragon and R.G.Seidel, "Randomized search
trees", Proc. 30¢h IEEE FOCS (1989), 540-545.
[20] D.E.Willard, "New trie data structures which support very fast search operations", J. Comp. Sys.
Sci. 28 (1984), 379-394.
[31 J.L.Carter
and M.N.Wegman, "Universal classes
of hash functions", J. Comp. Sys. Sci. 18 (1979),
143-154.
[21] D.Yellin, "Representing sets with constant time
equality testing", IBM Tech. Rept., April 1989.
[4] R.Cole, "On the dynamic finger conjecture for
splay trees", these proceedings.
25

Download Report

Unique Binary Search Tree Representations and Equality

Paperzz.com

Your Paperzz