An Approximation Algorithm for Binary Searching in Trees ∗ Eduardo Laber† Marco Molinaro‡ Abstract We consider the problem of computing efficient strategies for searching in trees. As a generalization of the classical binary search for ordered lists, suppose one wishes to find a (unknown) specific node of a tree by asking queries to its arcs, where each query indicates the endpoint closer to the desired node. Given the likelihood of each node being the one searched, the objective is to compute a search strategy that minimizes the expected number of queries. Practical applications of this problem include file system synchronization and software testing. Here we present a linear time algorithm which is the first constant factor approximation for this problem. This represents a significant improvement over previous O(log n)-approximation. Keywords: Binary search, partially ordered sets, trees, approximation algorithms 1 Introduction Searching in ordered structures is a fundamental problem in theoretical computer science. In one of its most basic variants, the objective is to find a special element of a totally ordered set by making queries which iteratively narrow the possible locations of the desired element. This can be generalized to searching in more general structures which have only a partial order for their elements instead of a total order [5, 20, 4, 23, 21]. In this work, we focus on searching in structures that lay between totally ordered sets and the most general posets: we wish to efficiently locate a particular node in a tree. More formally, as input we are given a rooted tree T = (V, E) which has a ‘hidden’ marked node and a function w : V → R that gives the likelihood of a node being the one marked. For example, T could be modeling a network with one defective unit. In order to discover which node of T is marked, we can perform edge queries: after querying the arc (i, j) of T (j being a child of i)1 , we receive an answer stating that either the marked node is a descendant2 of j (called a yes answer) or that the marked node is not a descendant of j (called a no answer). A search strategy is a procedure that decides the next query to be posed based on the outcome of previous queries. As an example, consider the strategy for searching the tree T of Figure 1.a represented by the decision tree D of Figure 1.b. A decision tree can be interpreted as a strategy in the following way: at each step we query the arc indicated by the node of D that we are currently located. In case of a yes answer, we move to the right child of the current node and we move to its left child otherwise. We proceed with these operations until the marked node is found. Let us assume that 4 is the marked node in Figure 1.a. We start at ∗ Preliminary version in ICALP 2008. [email protected]. Informatics Department, PUC-Rio. Corresponding address: Informatics Department of PUC-Rio, Rua Marquês de São Vicente 225, RDC, 4o andar, CEP 22453-900, Rio de Janeiro - RJ, Brazil. ‡ [email protected]. Tepper School of Business, Carnegie Mellon University. 1 Henceforth, when we refer to the arc (i, j), j is a child of i. 2 We assume that a node is a descendant of itself. † 1 the root of D and query the arc (3, 4) of T , asking if the marked node is a descendant of node 4 in T . Since the answer is yes, we move to the right child of (3, 4) in D and we query the arc (4, 6) in T . In this case, the outcome of the query (4, 6) is no and then we move to node (4, 5) of D. By querying this node we conclude that the marked node of T is in indeed 4. (a) (b) 1 (3, 4) no 2 3 (1, 2) yes no 4 (1, 3) 5 yes no 1 6 2 yes 3 (4, 6) yes no 6 (4, 5) no 4 yes 5 Figure 1: (a) Tree T . (b) Example of a decision tree for T ; Internal nodes correspond to arcs of T and leaves to nodes of T . P We define the expected number of queries of a strategy S as v∈V sv w(v), where sv is the number of queries needed by S to find the marked node when v is the marked node. Therefore, our optimization problem is to find the strategy with minimum expected number of queries. We make a more formal definition of a strategy by means of decision trees in Section 2.1. Besides generalizing a fundamental problem in theoretical computer science, searching in posets (and in particular in trees) also has practical applications in file system synchronization and software testing [21]. We remark that although these applications were considered in the ‘worst case’ version of this problem, taking into account the likelihood that the elements are marked (for instance via code complexity measures in the former example) may lead to improved searches. Apart from the above mentioned applications, strategies for searching in trees have a potential application in the context of asymmetric communication protocols [1, 2, 10, 16, 25]. In this scenario, a client has to send a binary string x ∈ {0, 1}t to the server, where x is drawn from a probability distribution D that is only available to the server. The asymmetry comes from the fact that the client has much larger bandwidth for downloading than for uploading. In order to benefit from this discrepancy, both parties agree on a protocol to exchange bits until the server learns the string x. Such protocol typically aims at minimize the number of bits sent by the client (although other factors such as the number of bits sent by the server and the number of communication rounds should also be taken into account). In one of the first proposed protocols [2, 16], at each round the server sends to the client a binary string y and the client replies with a single bit depending on whether y is a prefix of x or not. Based on the client’s answer, the server updates his knowledge about x and send another string if he has not learned x yet. It is not difficult to realize that this protocol corresponds to a strategy for searching a marked leaf in a complete binary tree of height t, where only the leaves of the tree have non-zero probability. In fact, the binary strings in {0, 1}t can be represented by a complete binary tree of height t where every edge that connects a node to its left(right) child is labeled with 0(1). This generates a 1-1 correspondence between binary strings with length at most t and the edges of the tree and, as a consequence, the message y sent by the server naturally corresponds to an edge query. Statement of the results. Our main result is a linear-time algorithm that provides the first constant-factor approximation for the problem of searching in trees where the goal is to minimize the expected number of queries. The algorithm is based on the decomposition of the input tree into special paths. A search strategy is computed for each of these paths and then combined to form a strategy for searching the original tree. This decomposition is motivated 2 by the fact that the problem of binary searching in paths is easily reduced to the well-solved problem of searching in ordered lists with access probabilities [14]. We shall notice that the complexity of this problem remains open, which contrasts with its ‘worst case’ version that is polynomially solvable [4, 23, 21]. In fact, the ‘average case’ version studied here is one more example of problems related to searching and coding whose complexities are unknown [8, 3, 12, 11]. Related work. Searching in totally ordered sets is a very well-studied problem [14]. In addition, many variants have also been considered, such as when there is information about the likelihood of each element being the one marked [7], or where each query has a different fixed cost and the objective is to find a strategy with least total cost [13, 17, 18]. As a generalization of the latter, [22, 24] considered the variant when the cost of each query depends on the queries posed previously. The most general version of our problem when the input is a poset instead of a tree was first considered by Lipman and Abrahams [20]. Apart from introducing the problem, they present an optimized exponential time algorithm for solving it. In [15], Kosaraju et al. present a greedy O(log n)-approximation algorithm. In fact, their algorithm handles more general searches, see [19, 6] for other more general results. To the best of our knowledge, this O(log n)-approximation algorithm is the only available result, with theoretical approximation guarantee, for the average case version of searching in trees. Therefore, our constant approximation represents a significant advance for this problem. The variant of our problem where the goal is to minimize the number of queries in the worst case for searching in trees, instead of minimizing the average case, was first considered by Ben-Asher et al. [4]. They have shown that it can be solved in O(n4 log3 n) via dynamic programming. Recent advances [23, 21] have reduced the time complexity to O(n3 ) and then O(n). In contrast, the more general version of the worst case minimization where the input is a poset instead of a tree is NP-hard [5]. 2 2.1 Preliminaries Notation For any tree T and node j ∈ T , we denote by Tj the subtree of T composed by all descendants of j. In addition, we denote the root of a tree T by r(T ). We also extend the set difference operation to trees: given trees T 1 = (V 1 , E 1 ) and T 2 = (V 2 , E 2 ), T 1 − T 2 is the forest of T 1 induced by the nodes V 1 P − V 2 . Furthermore, we extend the weight function to a subtree T ′ in ′ a standard way w(T ) = v∈T ′ w(v). Finally, for any multiset W of non-negative real values, we define its entropy as: X w w log2 P H(W ) = − ′ w′ ∈W w w∈W We remark that this seemingly unnatural extension of the definition of entropy for unnormalized multisets greatly simplifies the notation in the proofs. Its homogeneity is a particularly important property. We should also mention that, although not explicit in some passages, all the logarithms in this text are in base 2. Every query strategy for a tree T can be represented by a binary decision tree D such that a path of D indicates what queries should be made at each step. In a decision tree D, each internal node corresponds to a query for a different arc of T and each leaf of D corresponds to a different node of T (these correspondences shall become clear later). In addition, each internal node u of D satisfies a search property that can be described as follows. If u corresponds to a 3 query for the arc (i, j), then: (i) u has exactly one right subtree and one left subtree; (ii) all nodes of the right subtree of u correspond to either a query for an arc in Tj or to a node in Tj ; (iii) all nodes of the left subtree of u correspond to either a query for an arc in T − Tj or to a node in T − Tj . For any decision tree D, we use u(i,j) to denote the internal node of D which corresponds to a query for the arc (i, j) of T and uv to denote the leaf of D which corresponds to the node v of T . From the example presented in the introduction, we can infer an important property of the decision trees. Consider a tree T and a search strategy given by a decision tree D for T . If v is the marked node of T , the number of queries posed to find v is the distance (in arcs) from the root of D to uv . For any decision tree D, we define d(u, v, D) as the distance between nodes u and v in D (when the decision tree is clear from the context, we omit the last parameter of this function). Thus, the expected (with respect to w) number of queries it takes to find the marked node using the strategy given by the decision tree D, or simply the cost of D, is given by: X cost(D, w) = d(r(D), uv , D)w(v) v∈T Therefore, the problem of computing a search strategy for (T, w) which minimizes the expected number of queries can be recast as the problem of finding a decision tree for T with minimum cost, that is, that minimizes cost(D, w) among all decision trees D for T . The cost of such minimum cost decision tree is denoted by OPT(T, w). 2.2 Basic Properties Now we present two properties of decision trees which are crucial for the analysis of the algorithm proposed in the next section. Consider a subtree T ′ of T ; we say that a node u is a representative of T ′ in a decision tree D if the following conditions hold: (i) u is a node of D that corresponds to either an arc or a node of T ′ (ii) u is an ancestor of all other nodes of D which correspond to arcs or nodes of T ′ . For instance, in the decision tree of Figure 1.b, u(3,4) is the representative of T3 and u(4,6) is the representative of T4 . The next lemma asserts the existence of a unique representative for each subtree of T (proof in the Appendix). Lemma 1. Consider a tree T and a decision tree D for T . For each subtree T ′ of T , there is a unique node u ∈ D which is the representative of T ′ in D. We denote the representative of T ′ (with respect to some decision tree) by u(T ′ ). The second property is given by the following lemma: Lemma 2. Consider P a tree T , a weight function w and a decision tree D for T . Then for every subtree T ′ of T , x∈T ′ d(u(T ′ ), ux , D)w(x) ≥ OPT(T ′ , w). ′ ′ Proof. It suffices P to argue ′that for every subtree T of T there is a decision tree for T which costs at most x∈T ′ d(u(T ), ux , D)w(x). This is done by induction on the number of nodes of T ′. When T ′ is a single node the result trivially holds, so suppose T ′ has at least two nodes. Because T ′ has more than one node, its representative u(T ′ ) in D corresponds to an arc of T ′ , say (i, j). Let D1 and D2 be respectively decision trees for Tj′ and T ′ − Tj′ that satisfy the inductive hypothesis. Then consider the following decision tree DT ′ for T ′ : its root r′ corresponds to a 4 query for the arc (i, j), the right subtree of r′ equals D1 and the left subtree of r′ equals D2 . Clearly DT ′ has cost: cost(DT ′ , w) = w(T ′ ) + cost(D1 , w) + cost(D2 , w) X X 1 + d(u(Tj′ ), ux , D) w(x) + ≤ x∈T ′ −Tj′ x∈Tj′ 1 + d(u(T ′ − Tj′ ), ux , D) w(x) (1) Now consider a path in D from its root to the node ux , where x is an arbitrary node in As Tj′ ⊆ T ′ ⊆ T , the definition of a representative implies that this path has the form (r(D) u(T ′ ) u(Tj′ ) ux ). Since u(T ′ ) corresponds to the arc (i, j), which does not belong ′ ′ to Tj , u(Tj ) must be a proper descendant of u(T ′ ). Therefore, the length of the path from u(T ′ ) to ux satisfies: d(u(T ′ ), ux , D) ≥ 1 + d(u(Tj′ ), ux , D) (2) Tj′ . The same argument shows that the path in D from its root to the node ux , where x is a arbitrary node in T ′ − Tj′ , satisfies the following analogous inequality: d(u(T ′ ), ux , D) ≥ 1 + d(u(T ′ − Tj′ ), ux , D) (3) Employing P the bounds provided by (2) and (3) on inequality (1) we have that the cost of DT ′ is at most x∈T ′ d(u(T ′ ), ux , D)w(x). This completes the inductive step and consequently the proof of the lemma. 3 An Algorithm for Searching in Trees In this section we present our algorithm for searching in trees. A crucial observation is that this problem can be efficiently solved when the input tree is a path. This is true because it can be easily reduced to the well-known problem of searching a hidden marked element from a total ordered set U in a sorted list L ⊆ U of elements, where each element of U has a given probability of being the marked one (see Section B of the appendix). Due to this correspondence, we can employ an approximation algorithm for the latter problem in order to obtain an approximation (with the same guarantee) for searching in path-like trees. We remark that one can use the exact algorithm from [14] to obtain optimal strategies for searching in path-like trees. However, we rely on the faster approximation algorithm from [7] in order to obtain a linear running time for our algorithm. Motivated by this observation, the algorithm decomposes the input tree into special paths, finds decision trees for each of these paths (with modified weight functions) and combine them into a decision tree for the original tree. In our analysis, we obtain a lower bound on the optimal solution and an upper bound on the returned solution in terms of the costs of the decision trees for the paths. Thus, the approximation guarantee of the algorithm is basically a constant times the guarantee of the approximation used to compute the decision trees for the paths. Throughout the text, we present the execution and analysis of the algorithm over an instance (T, w), where T is rooted at node r. Moreover, we assume that T has more than one node, which excludes trivial instances. For every node u ∈ T , we define the cumulative weight of u as the sum of the weights of its descendants, namely w(Tu ). A heavy path Q of T is defined recursively as follows: r belongs to Q; for every node u in Q, the non-leaf child of u with greatest cumulative weight also belongs to Q. This construction guarantees that a heavy path of T has the following properties: 1. If u is an internal node of T that does not belong to the heavy path then w(Tu ) ≤ w(T )/2. 5 2. If u is the last node of the heavy path then all of its children are leaves of T . Let Q = (q1 → . . . → q|Q| ) be a heavy path of T . We define T qi = Tqi − Tqi+1 , for i < |Q| and T q|Q| = Tq|Q| . In addition, we define Tqji as the jth heaviest maximal subtree rooted at a child of qi not in Q (Figure 2). Finally, let ni denote the number of children of qi that do not belong to Q and define eji as the arc of T connecting the node qi to the subtree Tqji . 1 3 T1 T1 1 T3 T31 T32 Figure 2: Example of structures T qi and Tqji , with nodes of the heavy path in black. Now we explain the high level structure of the solution returned by the algorithm. The main observation is that we can break the task of finding the marked node in three stages: first finding the node qi of Q such that T qi contains the marked node, then querying the arcs {eji }j ′ to discover which tree Tqji contains the marked node or that the marked node is qi , and finally ′ (if needed) locate the marked node in Tqji . The algorithm follows this reasoning: it computes a decision tree for the heavy path Q, then a decision tree for querying the arcs {eji } and recursively computes a decision tree for each tree Tqji . Now we present the algorithm itself, which consists of the following five steps: (i) Find a heavy path Q of T and then for each qi ∈ Q define w′ (qi ) = w(T qi )/w(T ). (ii) Calculate a decision tree DQ for the instance (Q, w′ ) using the approximation algorithm presented in [7]. (iii) Calculate recursively a decision tree Dij for each instance (Tqji , w). (iv) Build a decision tree Di for each T qi as follows. The leftmost path of Di consists of nodes corresponding to the arcs e1i , . . . , eni i , with a node uqi appended at the end. In addition, for every j ∈ {1, . . . , ni }, Dij is the right subtree of the node corresponding to eji in Di (Figure 3). (v) Construct the decision tree D for T by replacing the leaf uqi of DQ by Di , for each qi ∈ Q (Figure 4). It is not difficult to see that the tree D computed by the algorithm is a decision tree for T . In the next sections we present an upper bound on the cost of D and lower bounds on the cost of an optimal decision tree for (T, w). Together these bounds will give an approximation guarantee for the algorithm. 3.1 Upper Bound As the trees {Tqji } and Q form a partition of the nodes of T , we analyze the distance of the root of D to the nodes in each of these structures separately in order to upper bound the cost of D. First, consider a tree Tqji and let x be a node in Tqji . Noticing that ux is a leaf of the tree Dij , the path from r(D) to ux in D must contain the node r(Dij ). Then, by the construction made 6 (b) (a) ue1i qi ue2i e1i e2 i e3 i Tq1i Tq2i Di1 ue3i Di2 uqi Tq3i Di3 Figure 3: Illustration of Step (iv) of the algorithm. (a) Tree T with heavy path in bold. (b) Decision tree Di for T i . (b) (a) D4 uq4 uq1 uq2 D1 uq3 D2 D3 Figure 4: Illustration of Step (v) of the algorithm. (a) Decision tree DQ built at Step (ii). (b) Decision tree D constructed by replacing the leaves {uqi } by the decision trees {Di }. in Step (iv), the path from r(D) to ux in D has the form (r(D) r(Dij ) ux ). Notice that the path (r(D) ue1 → ue2 → . . . → uej i i i ue1 ) in D is the same as the path (r(D) i r(Dij ) u qi ) Dij . Employing the to ux is the same in D and in in DQ . In addition, the path from previous observations, we have that the length of the path (r(D) ue1 → ue2 → . . . → uej i r(Dij ) i i ux ) is: d(r(D), ux , D) = d(r(D), uqi , DQ ) + j + d(r(Dij ), ux , Dij ) (4) Now we consider a node qi ∈ Q. Again due to the construction made in Step (iv) of the algorithm, it is not difficult to see that the path from r(D) to uqi in D traverses the leftmost path of Di , that is, this path has the form (r(D) ue1 → ue2 → . . . → ueni → uqi ). Because i i i the path (r(D) ue1 ) in D is the same as the path (r(D) uqi ) in DQ , it follows that the i length of the path from the root of D to uqi is: d(r(D), uqi , D) = d(r(D), uqi , DQ ) + ni (5) Weighting the distances to reach nodes in {Tqji } and in Q and employing equalities (4) and (5) we find the cost of D: X XX X d(r(D), ux , D)w(x) + d(r(D), uqi , D)w(qi ) cost(D, w) = qi ∈Q j = XX qi ∈Q x∈Tqji d(r(D), uqi , DQ )w(Tqji ) + qi ∈Q j + d(r(D), uqi , DQ )w(qi ) + qi ∈Q XX X qi ∈Q j X d(r(Dij ), ux , Dij )w(x) + 7 j · w(Tqji ) qi ∈Q j X qi ∈Q x∈Tqji XX ni w(qi ) (6) Recall we define w′ (qi ) = w(T qi )/w(T ) at Step (i) of the algorithm. Then the combined sum of the first two summations of equation (6) can be written as: XX X d(r(D), uqi , DQ )w(Tqji ) + d(r(D), uqi , DQ )w(qi ) qi ∈Q j qi ∈Q X d(r(D), uqi , DQ )w(T qi ) = w(T ) = qi ∈Q X d(r(D), uqi , DQ )w′ (qi ) qi ∈Q ′ = w(T ) · cost(DQ , w ) (7) Moreover, the fourth summation of equation (6) can be expressed more concisely as the sum of the cost of the decision trees {Dij }: ni X XX d(r(Dij ), ux , Dij )w(x) = qi ∈Q j=1 x∈Tqj ni XX cost(Dij , w) (8) qi ∈Q j=1 i Therefore, employing equalities (7) and (8) in (6) leads to the following expression for the cost of D: cost(D, w) = w(T ) · cost(DQ , w′ ) + ni XX j · w(Tqji ) + ni XX cost(Dij , w) + ni w(qi ) qi ∈Q qi ∈Q j=1 qi ∈Q j=1 X Now all we need is to upper bound the first term of the previous equality. Notice that cost(DQ , w′ ) is exactly the cost of the approximation computed at Step (ii) of the algorithm. As mentioned previously, in this step we use the result from [7] which guarantees an upper bound of H({w′ (qi )}) + 2 on cost(DQ , w′ ). Substituting this bound on the last displayed equality and observing that w(T ) · H({w′ (qi )}) = H({w(T qi )}), we complete the bound for the cost of D: cost(D, w) ≤ H({w(T qi )}) + 2w(T ) + ni XX j · w(Tqji ) qi ∈Q j=1 + X ni X cost(Dij , w) + qi ∈Q j=1 3.2 X ni w(qi ) (9) qi ∈Q Entropy Lower Bound In this section we present a lower bound on the cost of an optimal decision tree for (T, w). Hence, let D∗ be a minimum cost decision tree for (T, w), and let r∗ be the root of D∗ . Consider a tree Tqji and let x be a node in Tqji . By definition, the representative of Tqji in D∗ (the node u(Tqji )) is an ancestor of the node ux in D∗ . Because Tqji ⊆ T qi it also follows from the definition of the representative that u(T qi ) is an ancestor of u(Tqji ) in D∗ . Combining these observations we have that the path in D∗ from r∗ to ux has the form (r∗ u(T qi ) u(Tqji ) ux ). Now consider a node qi ∈ Q; again by the definition of representative, the path from r∗ to uqi ). Then adding the weighted paths from r∗ to nodes uqi can be written as (r∗ u(T qi ) 8 in {Tqji } and in Q we can write the cost of D∗ , which is OPT(T, w), as: X XX X d(r∗ , ux , D∗ )w(x) + d(r∗ , uqi , D∗ )w(qi ) OPT(T, w) = qi ∈Q j = qi ∈Q j + X qi ∈Q = X qi ∈Q x∈Tqji XX X x∈Tqji d(r∗ , u(T qi )) + d(u(T qi ), u(Tqji )) + d(u(Tqji ), ux ) w(x) d(r∗ , u(T qi )) + d(u(T qi ), uqi ) w(qi ) d(r∗ , u(T qi ))w(T qi ) + d(u(T qi ), u(Tqji ))w(Tqji ) qi ∈Q j qi ∈Q + XX XX X qi ∈Q j d(u(Tqji ), ux )w(x) + X d(u(T qi ), uqi )w(qi ) (10) qi ∈Q x∈Tqji Now we lower bound each term of the last equation. The idea to analyze the first term is the following: it can be seen as the cost of the decision tree D∗ under a cost function where the representative of T qi , for every i, has weight w(T qi ) and all other nodes have weight zero. Since D∗ is a binary tree, we can use Shannon’s Coding Theorem to guarantee that the first term of (10) is lower bounded by H({w(T qi )})/ log 3 − w(T ) (see Section C of the appendix for a more formal proof). Now we bound the second term of (10). Fix a node qi ∈ Q; consider two different trees Tqji ′ ′ ′ and Tqji such that d(r∗ , u(Tqji )) = d(r∗ , u(Tqji )). We claim that u(Tqji ) and u(Tqji ) are siblings in D∗ . In fact, because their distances to r∗ are the same, u(Tqji ) is neither a descendant nor an ′ ancestor of u(Tqji ). Therefore, they have a common ancestor, say u(v,z) , and one of these nodes is in the right subtree of u(v,z) and the other in the left subtree of u(v,z) . It is not difficult to see that u(v,z) can only correspond to either (qi , j) or (qi , j ′ ). Without loss of generality suppose ′ ′ the latter; then the right child of u(v,z) corresponds to an arc/node of Tqji , and therefore u(Tqji ) must be this child. Due to their distance to r∗ , it follows that u(Tqji ) must be the other child of u(v,z) . As a consequence, we have that for any level ℓ of D∗ there are at most two representatives of the trees {Tqji } located at ℓ. This together with the fact that Tqji is the jth heaviest tree rooted at a child of qi guarantees that ni X d(u(T qi ), u(Tqji ))w(Tqji ) j=1 ≥ ni X j j=1 2 w(Tqji ) ≥ ni X j−1 j=1 2 w(Tqji ) (11) and the last inequality gives a bound for the second term of (10). Directly employing Lemma 2, we can lower bound the third term of (10) by: ni X XX d(u(Tqji ), ux , D∗ )w(x) qi ∈Q j=1 x∈Tqj ≥ ni XX OPT(Tqji , w) qi ∈Q j=1 i Finally, for the fourth term we fix qi and note that the path in D∗ connecting u(T qi ) to uqi must contain the nodes corresponding to arcs e1i , . . . , eni i (otherwise when traversing D∗ and reaching uqi we would not have enough knowledge to infer that qi is the marked node, contradicting the validity of D∗ ).PApplying this reasoning for each qi , we conclude that last term of (10) is lower bounded by qi ni · w(qi ). 9 Therefore, applying the previous discussion to lower bound the terms of (10) we obtain that: OPT(T, w) ≥ ni H({w(T qi )}) 3w(T ) X X j · w(Tqji ) − + log 3 2 2 qi ∈Q j=1 + ni XX OPT(Tqji , w) + qi ∈Q j=1 3.3 X ni w(qi ) (12) qi ∈Q Alternative Lower Bounds When the value of the entropy H({w(T qi )}) is large enough, it dominates the term −3w(T )/2 in inequality (12) and leads to a sufficiently strong lower bound. However, when this entropy assumes a small value we need to adopt a different strategy to devise an effective bound. For this purpose we present two different lower bounds which do not contain an entropy term but have improved c · w(T ) terms. First alternative lower bound. In order to increase the term −3w(T )/2 that appears in (12), we use almost the same derivation that leads from (10) to (12); however, we simply lower bound the first summation of (10) by zero. This gives: ni ni X j · w(Tqji ) X X w(T ) X X OPT(Tqji , w) + ni w(qi ) (13) + + OPT(T, w) ≥ − 2 2 qi ∈Q j=1 qi ∈Q qi ∈Q j=1 Second alternative lower bound. Now we devise a lower P bound without the −w(T )/2 term; however it also does not contain the important term qi ∈Q ni w(qi ). Consider a tree Tqji and a node x ∈ Tqji . By the definition of representative, u(Tqji ) is an ancestor of ux in D∗ , thus d(r∗ , ux , D∗ ) = d(r∗ , u(Tqji ), D∗ ) + d(u(Tqji ), ux , D∗ ). Because {Tqji } and Q form a partition of nodes of T , the cost OPT(T, w) can be written as: X X X X d(u(Tqji ), ux )w(x) (14) OPT(T, w) = d(r∗ , u(Tqji ))w(Tqji ) + d(r∗ , uqi )w(qi ) + qi ∈Q qi ,j qi ,j x∈Tqj i Now we analyze the first two summations of the previous equation. Because the trees {Tqji } and the nodes {qi } are pairwise disjoint, the nodes {u(Tqji )} and {uqi } are all different and consequently at most one of them can be equal to r∗ . Since we assumed T has more than one node, r∗ must correspond to an arc of T . Therefore d(r∗ , u(Tqji ), D∗ ) ≥ 1 when Tqji is a leaf of T and also d(r∗ , uqi , D∗ ) ≥ 1 for all qi ∈ Q. Then the sum of the first two terms of equation (14) is at least X X d(r∗ , u(Tqji ))w(Tqji ) + d(r∗ , uqi )w(qi ) ≥ w(T ) − w(Tqji ) qi ∈Q qi ,j where Tqji is a subtree of T that is not a leaf. It follows from the first presented property of heavy paths that this sum can be further lower bounded by w(T ) − w(T )/2 = w(T )/2. Combining this fact with a lower bound for the last summation of equation (14) provided by Lemma 2 (with T ′ = Tqji ) we have: OPT(T, w) ≥ ni w(T ) X X + OPT(Tqji , w) 2 qi ∈Q j=1 10 (15) 3.4 Approximation Guarantee We proceed by induction over the number of nodes of T , where the base case is the trivial one when T has only one node. Notice that because each tree Tqji is properly contained in T , the inductive hypothesis asserts that Dij is an approximate decision tree for Tqji , namely cost(Dij , w) ≤ αOPT(Tqji , w) for some constant α. The analysis needs to be carried out in two different cases depending on the value of the entropy (henceforth we use H as a shorthand for H({w(T qi )}). Case 1: H/ log 3 ≥ 2w(T ). Applying this bound to inequality (12) we have: OPT(T, w) ≥ X j · w(Tqj ) X X H i + + OPT(Tqji , w) + ni w(qi ) 4 log 3 2 qi ,j (16) qi ∈Q qi ,j Employing the entropy hypothesis to the first term of inequality (9) and using the inductive hypothesis we have: cost(D, w) ≤ X X H · (log 3 + 1) X + j · w(Tqji ) + αOPT(Tqji , w) + ni w(qi ) log 3 qi ,j qi ,j qi ∈Q Setting α ≥ 4(log 3 + 1) it follows from the previous inequality and from inequality (16) that cost(D, w) ≤ αOPT(T, w). Case 2: H/ log 3 ≤ 2w(T ). Applying the entropy hypothesis and the inductive hypothesis to inequality (9) we have: X X X cost(D, w) ≤ (2 log 3 + 2)w(T ) + j · w(Tqji ) + αOPT(Tqji , w) + ni w(qi ) qi ,j qi ,j qi ∈Q Adding together (α − 2) times inequality (15) and two times inequality (13) we have: αOPT(T, w) ≥ X X (α − 4)w(T ) X + j · w(Tqji ) + αOPT(Tqji , w) + 2 ni w(qi ) 2 qi ,j qi ,j qi ∈Q Setting α ≥ 2(2 log 3 + 2) + 4 we have that cost(D, w) ≤ αOPT(T, w). Therefore, the inductive step holds for both cases when α ≥ 2(2 log 3 + 2) + 4. In fact, a more careful choice for the threshold that separates the cases of the analysis can achieve a slightly improved approximation guarantee. Using H/ log 3 ≤ 1.86 w(T ) and H/ log 3 ≥ 1.86 w(T ) as the hypotheses for cases 1 and 2 respectively, straightforward calculations show that the approximation factor of the algorithm is at most 14. Theorem 1. The proposed algorithm is a 14-approximation for the problem of binary searching in trees. 4 Efficient Implementation Notice that in order to implement the Step (iv) of the algorithm, we need to sort the trees {Tqji }. As we argue in Section 4.3, all other steps of the algorithm can be performed in linear time and consequently this operation is the algorithm’s bottleneck. Nonetheless, we resort to an ‘approximate sorting’ procedure which allows a linear time implementation of the algorithm. 11 4.1 Approximate Sorting Informally, given a set of nonnegative real numbers S = {s1 , s2 , . . . , s|S| } we want to sort them such that they are roughly in non-increasing order. In the next paragraph we define more formally the concept of an approximate sorting of S. Without loss of generality, assume that S is nonempty and that s1 ≥ s2 ≥ . . . ≥ s|S| . Let ŝ = s1 be the largest number in S. Then we define a set of intervals {I1 , . . . , It+1 } for t = log |S|2 as follows: I1 = [0, ŝ/2t ), Ii = [ŝ/2t−i+2 , ŝ/2t−i+1 ) for i = 2, . . . , t and It+1 = [ŝ/2, ŝ]. We say that a permutation σ of {1, ..., |S|} approximately sorts S iff for every i > j the elements of interval Ii appear before those from Ij in the sequence induced by σ. Proposition 1. Let σ be an approximate sorting for S. For every 1 ≤ i ≤ |S|, if si belongs to an interval Ij then sσ(i) also belongs to Ij . The next lemma will be useful to show that we do not lose too much when employing an approximately sorted sequence instead of a sorted one in the Step (iv) of our algorithm. Lemma 3. Let σ be an approximate sorting for S. Then: |S| X i · sσ(i) ≤ 3 i=1 |S| X i · si i=1 Proof. Let Pi be the indexes of the {sj } which belong to the interval Ii , namely Pi = {j : sj ∈ Ii }. Because each element of S belongs to exactly one interval Ii , we can recast the inequality of the lemma as: t+1 X t+1 X X X j · sσ(j) ≤ 3 j · sj i=1 j∈Pi i=1 j∈Pi Thus it suffices to prove the validity P of the previous inequality. We analyze separately the sums j∈Pi j · sσ(j) for each interval Ii . First consider Pi with 2 ≤ i ≤ t + 1. It follows from Proposition 1 that if an index j belongs to Pi then both sj and sσ(j) belong to Ii . Therefore, making use of the boundaries of the interval Ii we have: X X X X ŝ ŝ j · sσ(j) ≤ t−i+1 j = 2 · t−i+2 j ≤ 2 j · sj (17) 2 2 j∈Pi j∈Pi j∈Pi j∈Pi Now consider the indexes in P1 . Again, for all j ∈ P1 we have that sσ(j) ∈ I1 and consequently sσ(j) ≤ ŝ/2t . Therefore: X j · sσ(j) j∈P1 |S| ŝ X ŝ X ŝ · |S|2 ŝ |S|2 + |S| ≤ t j≤ t ≤ j= t 2 2 2 2 2t j=1 j∈P1 Recalling that t = log |S| X 2 we have: j · sσ(j) ≤ j∈P1 t+1 XX ŝ · |S|2 j · sj 2 ⌉ ≤ ŝ ≤ ⌈log |S| 2 i=1 j∈P (18) i Thus, employing inequalities (17) and (18) we complete the proof: t+1 X t+1 X t+1 t+1 X X X X X X 2 j · sσ(j) ≤ j · sj + j · sj ≤ 3 j · sj i=1 j∈Pi i=1 j∈Pi i=2 12 j∈Pi i=1 j∈Pi Approximate Sorting in Linear Time. We say that an element s ∈ S has rank i if it belongs to the interval Ii . Once we know the rank of each element of S, we can employ a simple bucketing strategy to output an approximately sorted sequence in linear time. Thus, it suffices to show how to compute the ranks of all elements of S in O(|S|) time. For simplicity of notation, define m = |S|. Also recall that ŝ is the maximum of S and that t = log m2 . Define T [j] = ⌈log(j + 1)⌉ + 1 for j ≥ 0. The first step towards finding the ranks is to compute T [j] for j = 0, . . . , 2m − 1. This can be done in linear time using the following strategy: tabulate all powers 2j for j = 0, . . . , ⌈log 2m⌉ then set T [0] = 1 and fill entries T [2j ], T [2j + 1], . . . , T [2j+1 − 1] with value j + 2. In order to calculate the rank of a number s ∈ S we check if s < 2⌈log m⌉−t ŝ. In the affirmative case, s has rank T [⌊2t s/ŝ⌋]. Otherwise, if s < ŝ then s has rank T [⌊2t−⌈log m⌉ s/ŝ⌋] + ⌈log m⌉. Finally, if s = ŝ then s has rank t + 1. It is easy to verify that this procedure calculates the right rank for elements of S smaller than ŝ/2t or equal to ŝ. For all other values of s (i.e., s belongs to one of the intervals {I2 , . . . , It+1 }) the correctness of the procedure is based on the following two observations: if s ≥ 2⌈log m⌉−t ŝ then T [⌊2t−⌈log m⌉ s/ŝ⌋] + ⌈log m⌉ = T [⌊2t s/ŝ⌋] and also for j = 2, . . . , t + 1: t 2t s 2s j−2 j−1 j−2 s ∈ Ij ⇔ 2 ≤ <2 ⇔2 ≤ < 2j−1 ŝ ŝ t t t 2s 2s 2s ⇔ 2j−2 < + 1 ≤ 2j−1 ⇔ log +1 =j−1⇔T =j ŝ ŝ ŝ Thus, by making use of this procedure we can find in linear time an approximately sorted sequence for any set of non-negative real numbers. 4.2 Modified Algorithm Based on the previous remarks, we propose the following modification to our algorithm: in Step (iv), the algorithm will now use an approximate sorting of the trees {Tij }j to build the tree Di , for each qi ∈ Q. In order to clarify this modification, we illustrate it for a particular qi ∈ Q. In Step (iv), the modified algorithm finds a permutation σi which approximately sorts the weights {w(Tq1i ), w(Tq2i ), . . . , w(Tqnii )}. Then it construct the decision tree Di using this permutation: σ (1) σ (n ) σ (j) the leftmost path of Di consists of nodes corresponding to ei i , . . . , ei i i , uqi and Di i is σ (j) the right subtree of the node corresponding to ei i . An upper bound for this modified version of the algorithm can be derived by the same analysis used previously. In fact, the introduction of the approximate sorting only alters the third term of the upper bound presented in inequality (9), leading to the following bound: cost(D, w) ≤ H({w(T qi )}) + 2w(T ) + ni XX j· w(Tqσii (j) ) + qi ∈Q j=1 ni XX cost(Dij , w) + qi ∈Q j=1 X ni w(qi ) qi ∈Q Using the result from Lemma 3 we can bound the third term of the previous inequality and reach the following expression: cost(D, w) ≤ H({w(T qi )}) + 2w(T ) + 3 ni XX j · w(Tqji ) qi ∈Q j=1 + ni XX cost(Dij , w) + qi ∈Q j=1 X qi ∈Q 13 ni w(qi ) (19) Again considering separate cases for low and high entropy values (i.e., H/ log 3 ≤ 1.72w(T ) and H/ log 3 ≥ 1.72w(T )) and performing the same analysis as done previously, it can be verified that the guarantee of the modified algorithm is at most 21.5. 4.3 Running Time Finally, we argue that the modified algorithm presented in the previous section runs in O(n) time, where n is the number of nodes of the input tree T . Preprocessing. PThe preprocessing step is used to calculate in one linear pass the cumulative weights w(Tu ) = v∈Tu w(v) for every vertex u ∈ T . This can be accomplished using a bottomup approach. Notice that the input tree for every recursive call launched during the execution of the algorithm is a maximal subtree T . Therefore, the cumulative weight of a node u with respect to one such tree is exactly w(Tu ). As a consequence, any cumulative weight used during the algorithm can be found in constant time after the preprocessing step. Algorithm. A heavy path Q can be found by starting at the root of T and at each step adding its non-leaf child with greatest cumulative weight. Thus, this procedure takes O(deg(Q)) time, where deg(Q) is the sum of the degrees of all nodes in Q. The rest of Step (i) can also be done in O(deg(Q)) time. Step (ii) takes O(|Q|) time. Because Tqji = Tj , it is also easy to construct the instances for Step (iii) in time which is linearly proportional to the number of such trees, which is again O(deg(Q)). Assuming that we already have the decision trees {Dij } computed recursively, building each tree Di takes time linearly proportional to the degree of qi , already including the time to sort approximately the trees {Tqji }j . Therefore, Step (iv) can be also computed in O(deg(Q)) time. Finally, Step (v) can be easily implemented in O(deg(Q)) time. Thus, excluding the time spent in recursive calls, an execution of the algorithm take O(deg(Q)) time, where Q is the heavy path of the input tree of this execution. Adding this running time for every call launched during the execution of the algorithm, we have the total time complexity of the algorithm. Therefore, it follows that this complexity is linearly proportional to sum of the degrees of the nodes in the heavy paths found during the execution. As these paths are pairwise disjoint, the total running time is then linear in the number of edges of the input tree T , which is also O(n). Theorem 2. There is a linear time algorithm which provides a constant factor approximation for the problem of binary searching in trees. 5 Conclusions We presented a linear time algorithm that achieves constant approximation ratio for the problem of minimizing the expected cost of binary searching in trees. This constitutes a significant improvement over the O(log n) approximation algorithm available in the literature. The main question that remains open is to decide whether the problem studied in this paper admits a polynomial time algorithm. References [1] M. Adler, E. Demaine, N. Harvey, and M. Patrascu. Lower bounds for asymmetric communication channels and distributed source coding. In SODA, pages 251–260, 2006. 14 [2] M. Adler and B. Maggs. Protocols for asymmetric communication channels. Journal of Computer and System Sciences, 63(4):573–596, 2001. [3] A. Barkan and H. Kaplan. Partial alphabetic trees. Journal of Algorithms, 58(2):81–103, 2006. [4] Y. Ben-Asher, E. Farchi, and I. Newman. Optimal search in trees. SIAM Journal on Computing, 28(6):2090–2102, 1999. [5] R. Carmo, J. Donadelli, Y. Kohayakawa, and E. Laber. Searching in random partially ordered sets. Theoretical Computer Science, 321(1):41–57, 2004. [6] V. Chakaravarthy, V. Pandit, S. Roy, P. Awasthi, and M. Mohania. Decision trees for entity identification: Approximation algorithms and hardness results. In PODS, pages 53–62, 2007. [7] R. de Prisco and A. de Santis. On binary search trees. Information Processing Letters, 45(5):249–253, 1993. [8] K. Douı̈eb and S. Langerman. Near-entropy hotlink assignments. In ESA, pages 292–303, 2006. [9] R. Gallager. Information Theory and Reliable Communication. John Wiley & Sons, Inc., New York, NY, USA, 1968. [10] S. Ghazizadeh, M. Ghodsi, and A. Saberi. A new protocol for asymmetric communication channels: Reaching the lower bounds. Scientia Iranica, 8(4), 2001. [11] M. Golin, C. Kenyon, and N. Young. Huffman coding with unequal letter costs. In STOC, pages 785–791, 2002. [12] R. Karp. Minimum-redundancy coding for the discrete noiseless channel. IRE Transactions on Information Theory, 7(1):27–38, 1961. [13] W. Knight. Search in an ordered array having variable probe cost. SIAM Journal on Computing, 17(6):1203–1214, 1988. [14] D. Knuth. The art of computer programming, volume 3: sorting and searching. Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 1998. [15] R. Kosaraju, T. Przytycka, and R. Borgstrom. On an optimal split tree problem. In WADS, pages 157–168, 1999. [16] E. Laber and L. Holanda. Improved bounds for asymmetric communication protocols. Information Processing Letters, 83(4):205–209, 2002. [17] E. Laber, R. Milidiú, and A. Pessoa. Strategies for searching with different access costs. In ESA, pages 236–247, 1999. [18] E. Laber, R. Milidiú, and A. Pessoa. On binary searching with non-uniform costs. In SODA, pages 855–864, 2001. [19] E. Laber and L. Nogueira. On the hardness of the minimum height decision tree problem. Discrete Applied Mathematics, 144(1-2):209–212, 2004. 15 [20] M. Lipman and J. Abrahams. Minimum average cost testing for partially ordered components. IEEE Transactions on Information Theory, 41(1):287–291, 1995. [21] S. Mozes, K. Onak, and O. Weimann. Finding an optimal tree searching strategy in linear time. In SODA, pages 1096–1105, 2008. [22] G. Navarro, R. Baeza-Yates, E. Barbosa, N. Ziviani, and W. Cunto. Binary searching with nonuniform costs and its application to text retrieval. Algorithmica, 27(2):145–169, 2000. [23] K. Onak and P. Parys. Generalization of binary search: Searching in trees and forest-like partial orders. In FOCS, pages 379–388, 2006. [24] J. Szwarcfiter, G. Navarro, R. Baeza-Yates, J. de S. Oliveira, W. Cunto, and N. Ziviani. Optimal binary search trees with costs depending on the access paths. Theoretical Computer Science, 290(3):1799–1814, 2003. [25] J. Watkinson, M. Adler, and F. Fich. New protocols for asymmetric communication channels. In SIROCCO, pages 337–350, 2001. 16 Appendix A Proof of Lemma 1 Given any tree T ′ and two nodes u, v ∈ T ′ , a lowest common ancestor of u and v is a node x ∈ T ′ such that: (i) x is an ancestor of both u and v (ii) there is no proper descendant of x which is an ancestor of both u and v. The following proposition is easily verified: Proposition 2. Consider a tree T ′ and two nodes of it u and v such that u is neither an ancestor of v nor a descendant of it. Then the lowest common ancestor x of u and v is a proper ancestor of both of them. Furthermore, if u′ (v ′ ) is the child of x which is an ancestor of u (v), then u′ 6= v ′ . Proof of Lemma 1. Consider a subtree T ′ of T . We define u(T ′ ) as the node of D closest to r(D) which corresponds to an arc or node of T ′ . More formally, u(T ′ ) = uα , such that α = argminα′ {d(r(D), uα′ , D) : α′ is a node or an arc of T ′ }. We claim that u(T ′ ) is a representative of T ′ , and to prove this it suffices to argue that u(T ′ ) is an ancestor of all nodes of D that correspond to nodes or arcs of T ′ . By means of contradiction, let β be a node or arc of T ′ such that uβ is not a descendant of u(T ′ ) = uα . The node uβ cannot be a proper ancestor of uα , or that would contradict the choice of uα . Thus, from Proposition 2 uα and uβ are on subtrees rooted at different children of their lowest common ancestor, say u(i,j) . Without loss of generality, assume that uβ is belongs to the right subtree of u(i,j) and that uα belongs to the left subtree of u(i,j) . By definition, β is an arc/node of Tj and α is an arc/node of T − Tj . Thus, the root of T ′ must be a proper ancestor of j so that both α and β belong to T ′ . But this implies that (i, j) ∈ T ′ . Because u(i,j) is a proper ancestor of uα , it is closer to the root of D and contradicts the choice of uα . Therefore, u(T ′ ) must be an ancestor of all nodes of D that correspond to arcs/nodes of T ′ . Notice that u(T ′ ) is the unique representative of T ′ , otherwise a different representative would have to be a proper ancestor of u(T ′ ) and would contradict the fact u(T ′ ) is a representative. B Searching in Path-like Trees We argue that the problem of searching in path-like trees can be reduced to the well-known problem of searching in ordered lists with different access probabilities. The latter problem is defined as follows. As input we have a list L = {l1 , . . . , ln } of numbers in increasing order and probabilities {p1 , . . . , pn , q1 , . . . , qn+1 }. One number, which may not belong to L, is marked; pi is the probability that li is the marked number and qi is the probability that the marked number is in the interval (li−1 , li ) (using the convention that l0 = −∞ and ln+1 = ∞). The goal is to find the location of the marked number, that is, to identify which li equals to the marked number (in case it belongs to L) or to identify which interval (li−1 , li ) contains it. For that, one may query elements of L, where querying li gives the information whether the marked node is greater than li , less than li or li itself. The reduction from the problem of searching in path-like trees to the above problem goes as follows. Given an instance (T, w) where T = (t1 → . . . → tn ) is a path, we create a list L with the first n − 1 natural numbers and associate the arcs of T with the elements of L and the nodes of T with the intervals delimited by L. That is, we set pi = 0 for all 1 ≤ i ≤ n − 1 and set qi = w(ti ) for all 1 ≤ i ≤ n. It is not difficult to see that a search strategy for L gives a search strategy for T (with the same expected number of queries) and vice-versa. Due 17 to this equivalence, approximate algorithms for the problem of searching in lists with access probabilities also approximate the problem of searching in paths with the same guarantee. C Proofs for the Entropy Lower Bound Lemma 4. P qi ∈Q d(r ∗ , u(T qi ), D ∗ )w(T qi ) ≥ H({w(T qi )})/ log 3 − w(T ) Proof. Construct the tree D′ by adding new leaves to D∗ as follows: for each node qi ∈ Q, add a leaf li to u(T qi ) (the relative position of siblings can be ignored in this analysis). Now define the probability distribution w′ for D′ such that w′ (li ) = w(T qi )/w(T ), and all other nodes of D′ have zero probability. Clearly the distance from r(D′ ) to li in D′ equals d(r∗ , u(T qi ), D∗ ) + 1. Because only the nodes {li } have nonzero probability, the cost of D′ is: cost(D′ , w′ ) = X d(r(D′ ), li , D′ )w′ (li ) = X qi li w(T qi ) d(r∗ , u(T qi ), D∗ ) + 1 w(T ) Because the trees {T qi } are pairwise disjoint, for qi 6= qj we have that u(T qi ) 6= u(T qj ). Therefore at most one new leaf {li } is added to each node {u(T qi )} and D′ is at most a ternary tree. We can then use Shannon Coding Theorem [9] to bound the cost of D′ as cost(D′ , w′ ) ≥ H({w′ (li )})/ log 3. Substituting this bound in the previous displayed inequality and noticing that H({w(T qi )}) = H({w′ (li )}) · w(T ), we have: H({w(T qi )}) X ≤ d(r∗ , u(T qi ), D∗ ) + 1 w(T qi ) log 3 q i The result follows by reorganizing the previous inequality and noticing that w(T ) = as the trees {T qi } define a partition of nodes of T . 18 P qi w(T qi ),
© Copyright 2026 Paperzz