Theoretical Computer Science 337 (2005) 217 – 239 www.elsevier.com/locate/tcs A survey on tree edit distance and related problems夡 Philip Bille∗ The IT University of Copenhagen, Rued Langgardsvej 7, DK-2300 Copenhagen S, Denmark Received 23 March 2004; received in revised form 23 December 2004; accepted 30 December 2004 Communicated by A. Apostolico Abstract We survey the problem of comparing labeled trees based on simple local operations of deleting, inserting, and relabeling nodes. These operations lead to the tree edit distance, alignment distance, and inclusion problem. For each problem we review the results available and present, in detail, one or more of the central algorithms for solving the problem. © 2005 Elsevier B.V. All rights reserved. Keywords: Tree matching; Tree edit distance; Tree alignment; Tree inclusion 1. Introduction Trees are among the most common and well-studied combinatorial structures in computer science. In particular, the problem of comparing trees occurs in several diverse areas such as computational biology, structured text databases, image analysis, automatic theorem proving, and compiler optimization [43,55,22,24,16,35,56]. For example, in computational biology, computing the similarity between trees under various distance measures is used in the comparison of RNA secondary structures [55,18]. 夡 This work is part of the DSSCV project supported by the IST Programme of the European Union (IST-200135443). ∗ Tel.: +45 72 18 50 00; fax: +45 72 18 50 01. E-mail address: [email protected]. 0304-3975/$ - see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.tcs.2004.12.030 218 P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 l1 l2 l1 l1 (a) l2 (b) l1 l1 l2 (c) Fig. 1. (a) A relabeling of the node label l1 to l2 . (b) Deleting the node labeled l2 . (c) Inserting a node labeled l2 as the child of the node labeled l1 . Let T be a rooted tree. We call T a labeled tree if each node is a assigned a symbol from a fixed finite alphabet . We call T an ordered tree if a left-to-right order among siblings in T is given. In this paper we consider matching problems based on simple primitive operations applied to labeled trees. If T is an ordered tree these operations are defined as follows: Relabel: Change the label of a node v in T. Delete: Delete a non-root node v in T with parent v , making the children of v become the children of v . The children are inserted in the place of v as a subsequence in the left-to-right order of the children of v . Insert: The complement of delete. Insert a node v as a child of v in T making v the parent of a consecutive subsequence of the children of v . Fig. 1 illustrates the operations. For unordered trees the operations can be defined similarly. In this case, the insert and delete operations works on a subset instead of a subsequence. We define three problems based on the edit operations. Let T1 and T2 be labeled trees (ordered or unordered). Tree edit distance: Assume that we are given a cost function defined on each edit operation. An edit script S between T1 and T2 is a sequence of edit operations turning T1 into T2 . The cost of S is the sum of the costs of the operations in S. An optimal edit script between T1 and T2 is an edit script between T1 and T2 of minimum cost and this cost is the tree edit distance. The tree edit distance problem is to compute the edit distance and a corresponding edit script. Tree alignment distance: Assume that we are given a cost function defined on pair of labels. An alignment A of T1 and T2 is obtained as follows. First we insert nodes labeled with spaces into T1 and T2 so that they become isomorphic when labels are ignored. The resulting trees are then overlaid on top of each other giving the alignment A, which is a tree where each node is labeled by a pair of labels. The cost of A is the sum of costs of all pairs of opposing labels in A. An optimal alignment of T1 and T2 is an alignment of minimum P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 219 cost and this cost is called the alignment distance of T1 and T2 . The alignment distance problem is to compute the alignment distance and a corresponding alignment. Tree inclusion: T1 is included in T2 if and only if T1 can be obtained by deleting nodes from T2 . The tree inclusion problem is to determine if T1 is included in T2 . In this paper we survey each of these problems and discuss the results obtained for them. For reference, Table 1 summarizes most of the available results. All of these and a few others are covered in the text. The tree edit distance problem is the most general of the problems. The alignment distance corresponds to a kind of restricted edit distance, while tree inclusion is a special case of both the edit and alignment distance problem. Apart from these simple relationships, interesting variations on the edit distance problem has been studied leading to a more complex picture. Both the ordered and unordered version of the problems are reviewed. For the unordered case, it turns out that all of the problems in general are NP-hard. Indeed, the tree edit distance and alignment distance problems are even MAX SNP-hard [4]. However, under various interesting restrictions, or for special cases, polynomial time algorithms are available. For instance, if we impose a structure preserving restriction on the unordered tree edit distance problem, such that disjoint subtrees are mapped to disjoint subtrees, it can be solved in polynomial time. Also, unordered alignment for constant degree trees can be solved efficiently. For the ordered version of the problems polynomial time algorithms exists. These are all based on the classic technique of dynamic programming (see, e.g., [9, Chapter 15]) and most of them are simple combinatorial algorithms. Recently, however, more advanced techniques such as fast matrix multiplication have been applied to the tree edit distance problem [8]. The survey covers the problems in the following way. For each problem and variations of it we review results for both the ordered and unordered version. This will, in most cases, include a formal definition of the problem, a comparison of the available results and a description of the techniques used to obtain the results. More importantly, we will also pick one or more of the central algorithms for each of the problems and present it in almost full detail. Specifically, we will describe the algorithm, prove that it is correct, and analyze its time complexity. For brevity, we will omit the proofs of a few lemmas and skip over some less important details. Common for the algorithms presented in detail is that, in most cases, they are the basis for more advanced algorithms. Typically, most of the algorithms for one of the above problems are refinements of the same dynamic programming algorithm. The main technical contribution of this survey is to present the problems and algorithms in a common framework. Hopefully, this will enable the reader to gain a better overview and deeper understanding of the problems and how they relate to each other. In the literature, there are some discrepancies in the presentations of the problems. For instance, the ordered edit distance problem was considered by Klein [25] who used edit operations on edges. He presented an algorithm using a reduction to a problem defined on balanced parenthesis strings. In contrast, Zhang and Shasha [55] gave an algorithm based on the postorder numbering on trees. In fact, these algorithms share many features which become apparent if considered in the right setting. In this paper we present these algorithms in a new framework bridging the gap between the two descriptions. Another problem in the literature is the lack of an agreement on a definition of the edit distance problem. The definition given here is by far the most studied and in our 220 P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 opinion the most natural. However, several alternatives ending in very different distance measures have been considered [30,45,38,31]. In this paper we review these other variants and compare them to our definition. We should note that the edit distance problem defined here is sometimes referred to as the tree-to-tree correction problem. This survey adopts a theoretical point of view. However, the problems above are not only interesting mathematical problems but they also occur in many practical situations and it is important to develop algorithms that perform well on real-life problems. For practical issues see, e.g., [49,46,40]. We restrict our attention to sequential algorithms. However, there has been some research in parallel algorithms for the edit distance problem, e.g., [55,53,41]. This summarizes the contents of this paper. Due to the fundamental nature of comparing trees and its many applications several other ways to compare trees have been devised. In this paper, we have chosen to limit ourselves to a handful of problems which we describe in detail. Other problems include tree pattern matching [27,10] and [16,35,56], maximum agreement subtree [19,11], largest common subtree [2,20], and smallest common supertree [34,13]. 1.1. Outline In Section 2 we give some preliminaries. In Sections 3, 4, and 5 we survey the tree edit distance, alignment distance, and inclusion problems, respectively. We conclude in Section 6 with some open problems. 2. Preliminaries and notation In this section we define notations and definitions that we use throughout the paper. For a graph G we denote the set of nodes and edges by V (G) and E(G), respectively. Let T be a rooted tree. The root of T is denoted by root(T ). The size of T, denoted by |T |, is |V (T )|. The depth of a node v ∈ V (T ), depth(v), is the number of edges on the path from v to root(T ). The in-degree of a node v, deg(v) is the number of children of v. We extend these definitions such that depth(T ) and deg(T ) denotes the maximum depth and degree, respectively, of any node in T. A node with no children is a leaf and otherwise an internal node. The number of leaves of T is denoted by leaves(T ). We denote the parent of node v by parent(v). Two nodes are siblings if they have the same parent. For two trees T1 and T2 , we will frequently refer to leaves(Ti ), depth(Ti ), and deg(Ti ) by Li , Di , and Ii , i = 1, 2. Let denote the empty tree and let T (v) denote the subtree of T rooted at a node v ∈ V (T ). If w ∈ V (T (v)) then v is an ancestor of w, and if w ∈ V (T (v))\{v} then v is a proper ancestor of w. If v is a (proper) ancestor of w then w is a (proper) descendant of v. A tree T is ordered if a left-to-right order among the siblings is given. For an ordered tree T with root v and children v1 , . . . , vi , the preorder traversal of T (v) is obtained by visiting v and then recursively visiting T (vk ), 1 k i, in order. Similarly, the postorder traversal is obtained by first visiting T (vk ), 1 k i, and then v. The preorder number and postorder number of a node w ∈ T (v), denoted by pre(w) and post(w), is the number of nodes preceding w in the preorder and postorder traversal of T, respectively. The nodes to the left of w in T is P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 221 the set of nodes u ∈ V (T ) such that pre(u) < pre(w) and post(u) < post(w). If u is to the left of w then w is to the right of u. A forest is a set of trees. A forest F is ordered if a left-to-right order among the trees is given and each tree is ordered. Let T be an ordered tree and let v ∈ V (T ). If v has children v1 , . . . , vi define F (vs , vt ), where 1 s t i, as the forest T (vs ), . . . , T (vr ). For convenience, we set F (v) = F (v1 , vi ). We assume throughout the paper that labels assigned to nodes are chosen from a finite alphabet . Let ∈ denote a special blank symbol and define = ∪ . We often define a cost function, : ( × )\(, ) → R, on pairs of labels. We will always assume that is a distance metric. That is, for any l1 ,l2 ,l3 ∈ the following conditions are satisfied: 1. (l1 , l2 ) 0, (l1 , l1 ) = 0, 2. (l1 , l2 ) = (l2 , l1 ), 3. (l1 , l3 ) (l1 , l2 ) + (l2 , l3 ). 3. Tree edit distance In this section we survey the tree edit distance problem. Assume that we are given a cost function defined on each edit operation. An edit script S between two trees T1 and T2 is a sequence of edit operations turning T1 into T2 . The cost of S is the sum of the costs of the operations in S. An optimal edit script between T1 and T2 is an edit script between T1 and T2 of minimum cost. This cost is called the tree edit distance, denoted by (T1 , T2 ). An example of an edit script is shown in Fig. 2. The rest of the section is organized as follows. First, in Section 3.1, we present some preliminaries and formally define the problem. In Section 3.2 we survey the results obtained for the ordered edit distance problem and present two of the currently best algorithms for the problem. The unordered version of the problem is reviewed in Section 3.3. In Section 3.4 we review results on the edit distance problem when various structure-preserving constraints are imposed. Finally, in Section 3.5 we consider some other variants of the problem. 3.1. Edit operations and edit mappings Let T1 and T2 be labeled trees. Following [43] we represent each edit operation by (l1 → l2 ), where (l1 , l2 ) ∈ ( × )\(, ). The operation is a relabeling if l1 = and l2 = , a deletion if l2 = , and an insertion if l1 = . We extend the notation such that (v → w) for nodes v and w denotes (label(v) → label(w)). Here, as with the labels, v or w may be . Given a metric cost function defined on pairs of labels we define the cost of an edit operation by setting (l1 → l2 ) = (l1 , l2 ). The cost of a sequence S = s1 , . . . , sk of operations is given by (S) = ki=1 (si ). The edit distance, (T1 , T2 ), between T1 and T2 is formally defined as: (T1 , T2 ) = min{(S) | S is a sequence of operations transforming T1 into T2 }. Since is a distance metric becomes a distance metric too. An edit distance mapping (or just a mapping) between T1 and T2 is a representation of the edit operations, which is used in many of the algorithms for the tree edit distance problem. 222 P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 f a f e d e d c c d d a a b a b (a) b (b) (c) Fig. 2. Transforming (a) into (c) via editing operations. (a) A tree. (b) The tree after deleting the node labeled c. (c) The tree after inserting the node labeled c and relabeling f to a and e to d. a f e d c c d d a a b b Fig. 3. The mapping corresponding to the edit script in Fig. 2. Formally, define the triple (M, T1 , T2 ) to be an ordered edit distance mapping from T1 to T2 , if M ⊆ V (T1 ) × V (T2 ) and for any pair (v1 , w1 ), (v2 , w2 ) ∈ M: 1. v1 = v2 iff w1 = w2 (one-to-one condition). 2. v1 is an ancestor of v2 iff w1 is an ancestor of w2 (ancestor condition). 3. v1 is to the left of v2 iff w1 is to the left of w2 (sibling condition). Fig. 3 illustrates a mapping that corresponds to the edit script in Fig. 2. We define the unordered edit distance mapping between two unordered trees as the same, but without the sibling condition. We will use M instead of (M, T1 , T2 ) when there is no confusion. Let (M, T1 , T2 ) be a mapping. We say that a node v in T1 or T2 is touched by a line in M if v occurs in some pair in M. Let N1 and N2 be the set of nodes in T1 and T2 , respectively, not touched by any line in M. The cost of M is given by (M) = (v → w) + (v → ) + ( → w). (v,w)∈M v∈N1 w∈N2 Mappings can be composed. Let T1 , T2 , and T3 be labeled trees. Let M1 and M2 be a mapping from T1 to T2 and T2 to T3 , respectively. Define M1 ◦ M2 = {(v, w) | ∃u ∈ V (T2 ) such that (v, u) ∈ M1 and (u, w) ∈ M2 }. With this definition it follows easily that M1 ◦ M2 itself becomes a mapping from T1 to T3 . Since is a metric, it is not hard to show that a minimum cost mapping is equivalent to the edit distance: (T1 , T2 ) = min{(M) | (M, T1 , T2 ) is an edit distance mapping}. Hence, to compute the edit distance we can compute the minimum cost mapping. We extend the definition of edit distance to forests. That is, for two forests F1 and F2 , (F1 , F2 ) P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 223 denotes the edit distance between F1 and F2 . The operations are defined as in the case of trees, however, roots of the trees in the forest may now be deleted and trees can be merged by inserting a new root. The definition of a mapping is extended in the same way. 3.2. General ordered edit distance The ordered edit distance problem was introduced by Tai [43] as a generalization of the well-known string edit distance problem [48]. Tai presented an algorithm for the ordered version using O(|T1 ||T2 ||L1 |2 |L2 |2 ) time and space. Subsequently, Zhang and Shasha [55] gave a simple algorithm improving the bounds to O(|T1 ||T2 | min(L1 , D1 ) min(L2 , D2 )) time and O(|T1 ||T2 |) space. This algorithm was modified by Klein [25] to get a better worst-case time bound of O(|T1 |2 |T2 | log |T2 |) 1 under the same space bounds. We present the latter two algorithms in detail below. Recently, Chen [8] has presented an algorithm 2 using O(|T1 ||T2 | + L21 |T2 | + L2.5 1 L2 ) time and O((|T1 | + L1 ) min(L2 , D2 ) + |T2 |) space. Hence, for certain kinds of trees the algorithm improves the previous bounds. This algorithm is more complex than all the above and uses results on fast matrix multiplication. Note that in the above bounds we can exchange T1 with T2 since the distance is symmetric. 3.2.1. A simple algorithm We first present a simple recursion which will form the basis for the two dynamic programming algorithms we present in the next two sections. We will only show how to compute the edit distance. The corresponding edit script can be easily obtained within the same time and space bounds. The algorithm is due to Klein [25]. However, we should note that the presentation given here is somewhat different. We believe that our framework is more simple and provides a better connection to previous work. Let F be a forest and v be a node in F. We denote by F − v the forest obtained by deleting v from F. Furthermore, define F − T (v) as the forest obtained by deleting v and all descendants of v. The following lemma provides a way to compute edit distances for the general case of forests. Lemma 1. Let F1 and F2 be ordered forests and be a metric cost function defined on labels. Let v and w be the rightmost (if any) roots of the trees in F1 and F2 , respectively. We have, (, ) = 0, (F1 , ) = (F1 − v, ) + (v → ), (, F2 ) = (, F2 − w) + ( → w), (F1 − v, F2 ) + (v → ), (F1 , F2 ) = min (F1 , F2 − w) + ( → w), (F1 (v), F2 (w)) + (F1 − T1 (v), F2 − T2 (w)) + (v → w). 1 Since the edit distance is symmetric this bound is in fact O(min(|T |2 |T | log |T |, |T |2 |T | log |T |). For 1 2 2 2 1 1 brevity we will use the short version. 224 P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 Proof. The first three equations are trivially true. To show the last equation consider a minimum cost mapping M between F1 and F2 . There are three possibilities for v and w: Case 1: v is not touched by a line. Then (v, ) ∈ M and the first case of the last equation applies. Case 2: w is not touched by a line. Then (, w) ∈ M and the second case of the last equation applies. Case 3: v and w are both touched by lines. We show that this implies (v, w) ∈ M. Suppose (v, h) and (k, w) are in M. If v is to the right of k then h must be to right of w by the sibling condition. If v is a proper ancestor of k then h must be a proper ancestor of w by the ancestor condition. Both of these cases are impossible since v and w are the rightmost roots and hence (v, w) ∈ M. By the definition of mappings the equation follows. Lemma 1 suggests a dynamic programming algorithm. The value of (F1 , F2 ) depends on a constant number of subproblems of smaller size. Hence, we can compute (F1 , F2 ) by computing (S1 , S2 ) for all pairs of subproblems S1 and S2 in order of increasing size. Each new subproblem can be computed in constant time. Hence, the time complexity is bounded by the number of subproblems of F1 times the number of subproblems of F2 . To count the number of subproblems, define for a rooted, ordered forest F the (i, j )deleted subforest, 0 i +j |F |, as the forest obtained from F by first deleting the rightmost root repeatedly j times and then, similarly, deleting the leftmost root i times. We call the (0, j )-deleted and (i, 0)-deleted subforests, for 0 j |F |, the prefixes and the suffixes of |F | F, respectively. The number of (i, j )-deleted subforests of F is k=0 k = O(|F |2 ), since for each i there are |F | − i choices for j. It is not hard to show that all the pairs of subproblems S1 and S2 that can be obtained by the recursion of Lemma 1 are deleted subforests of F1 and F2 . Hence, by the above discussion the time complexity is bounded by O(|F1 |2 |F2 |2 ). In fact, fewer subproblems are needed, which we will show in the next sections. 3.2.2. Zhang and Shasha’s algorithm The following algorithm is due to Zhang and Shasha [55]. Define the keyroots of a rooted, ordered tree T as follows: keyroots(T ) = {root(T )} ∪ {v ∈ V (T ) | v has a left sibling}. The special subforests of T is the forests F (v), where v ∈ keyroots(T ). The relevant subproblems of T with respect to the keyroots is the prefixes of all special subforests F (v). In this section we refer to these as the relevant subproblems. Lemma 2. For each node v ∈ V (T ), F (v) is a relevant subproblem. It is easy to see that, in fact, the subproblems that can occur in the above recursion are either subforests of the form F (v), where v ∈ V (T ), or prefixes of a special subforest of T. Hence, it follows by Lemma 2 and the definition of a relevant subproblem, that to compute (F1 , F2 ) it is sufficient to compute (S1 , S2 ) for all relevant subproblems S1 and S2 of T1 and T2 , respectively. P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 225 The relevant subproblems of a tree T can be counted as follows. For a node v ∈ V (T ) define the collapsed depth of v, cdepth(v), as the number of keyroot ancestors of v. Also, define cdepth(T ) as the maximum collapsed depth of all nodes v ∈ V (T ). Lemma 3. For an ordered tree T the number of relevant subproblems, with respect to the keyroots is bounded by O(|T |cdepth(T )). Proof. The relevant subproblems can be counted using the following expression: |F (v)| < |T (v)| = cdepth(v) |T |cdepth(T ) v∈keyroots(T ) v∈keyroots(T ) v∈V (T ) Since the number prefixes of a subforest F (v) is |F (v)| the first sum counts the number of relevant subproblems of F (v). To prove the first equality note that for each node v the number of special subforests containing v is the collapsed depth of v. Hence, v contributes the same amount to the left and right side. The other equalities/inequalities follow immediately. Lemma 4. For a tree T, cdepth(T ) min{depth(T ), leaves(T )} Thus, using dynamic programming it follows that the problem can be solved in O(|T1 ||T2 | min{D1 , L1 } min{D2 , L2 }) time and space. To improve the space complexity we carefully compute the subproblems in a specific order and discard some of the intermediate results. Throughout the algorithm we maintain a table called the permanent table storing the distances (F1 (v), F2 (w)), v1 ∈ V (F1 ) and w2 ∈ V (F2 ), as they are computed. This uses O(|F1 ||F2 |) space. When the distances of all special subforests of F1 and F2 are available in the permanent table, we compute the distance between all prefixes of F1 and F2 in order of increasing size and store these in a table called the temporary table. The values of the temporary table that are distances between special subforests are copied to the permanent table and the rest of the values are discarded. Hence, the temporary table also uses at most O(|F1 ||F2 |) space. By Lemma 1 it is easy to see that all values needed to compute (F1 , F2 ) are available. Hence, Theorem 1 (Zhang and Shasha [55]). For ordered trees T1 and T2 the edit distance problem can be solved in time O(|T1 ||T2 | min{D1 , L1 } min{D2 , L2 }) and space O(|T1 ||T2 |). 3.2.3. Klein’s algorithm In the worst case, that is for trees with linear depth and a linear number of leaves, Zhang and Shasha’s algorithm of the previous section still requires O(|T1 |2 |T2 |2 ) time as the simple algorithm. In [25] Klein obtained a better worst-case time bound of O(|T1 |2 |T2 | log |T2 |). The reported space complexity of the algorithm is O(|T1 |2 |T2 | log |T2 |) which is significantly worse than the algorithm of Zhang and Shasha. However, according to Klein [23] this algorithm can also be improved to O(|T1 ||T2 |). The algorithm is based on an extension of the recursion in Lemma 1. The main idea is to consider all of the O(|T1 |2 ) deleted subforests of T1 but only O(|T2 | log |T2 |) deleted 226 P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 subforests of T2 . In total the worst-case number of subproblems is thus reduced to the desired bound above. A key concept in the algorithm is the decomposition of a rooted tree T into disjoint paths called heavy paths. This technique was introduced by Harel and Tarjan [15]. We define the size a node v ∈ V (T ) as |T (v)|. We classify each node of T as either heavy or light as follows. The root is light. For each internal node v we pick a child u of v of maximum size among the children of v and classify u as heavy. The remaining children are light. We call an edge to a light child a light edge, and an edge to a heavy child a heavy edge. The light depth of a node v, ldepth(v), is the number of light edges on the path from v to the root. Lemma 5 (Harel and Tarjan [15]). For any tree T and any v ∈ V (T ), ldepth(v) log |T |+ O(1). By removing the light edges T is partitioned into heavy paths. We define the relevant subproblems of T with respect to the light nodes below. We will refer to these as relevant subproblems in this section. First fix a heavy path decomposition of T. For a node v in T we recursively define the relevant subproblems of F (v) as follows: F (v) is relevant. If v is not a leaf, let u be the heavy child of v and let l and r be the number of nodes to the left and to the right of u in F (v), respectively. Then, the (i, 0)deleted subforests of F (v), 0 i l, and the (l, j )-deleted subforests of F (v), 0 j r are relevant subproblems. Recursively, all relevant subproblems of F (u) are relevant. The relevant subproblems of T with respect to the light nodes is the union of all relevant subproblems of F (v) where v ∈ V (T ) is a light node. Lemma 6. For an ordered tree T the number of relevant subproblems with respect to the light nodes is bounded by O(|T | ldepth(T )). Proof. Follows by the same calculation as in the proof of Lemma 3. Also note that Lemma 2 still holds with this new definition of relevant subproblems. Let S be a relevant subproblem of T and let vl and vr denote the leftmost and rightmost root of S, respectively. The difference node of S is either vr if S −vr is relevant or vl if S −vl is relevant. The recursion of Lemma 1 compares the rightmost roots. Clearly, we can also choose to compare the leftmost roots resulting in a new recursion, which we will refer to as the dual of Lemma 1. Depending on which recursion we use, different subproblems occur. We now give a modified dynamic programming algorithm for calculating the tree edit distance. Let S1 be a deleted tree of T1 and let S2 be a relevant subproblem of T2 . Let d be the difference node of S2 . We compute (S1 , S2 ) as follows. There are two cases to consider: 1. If d is the rightmost root of S2 compare the rightmost roots of S1 and S2 using Lemma 1. 2. If d is the leftmost root of S2 compare the leftmost roots of S1 and S2 using the dual of Lemma 1. It is easy to show that in both cases the resulting smaller subproblems of S1 will all be deleted subforests of T1 and the smaller subproblems of S2 will all be relevant subproblems of T2 . Using a similar dynamic programming technique as in the algorithm of Zhang and Shasha we obtain the following: P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 227 Theorem 2 (Klein [25]). For ordered trees T1 and T2 the edit distance problem can be solved in time and space O(|T1 |2 |T2 | log |T2 |). Klein [25] also showed that his algorithm can be extended within the same time and space bounds to the unrooted ordered edit distance problem between T1 and T2 , defined as the minimum edit distance between T1 and T2 over all possible roots of T1 and T2 . 3.3. General unordered edit distance In the following section we survey the unordered edit distance problem. This problem has been shown to be NP-complete [58,50,57] even for binary trees with a label alphabet of size 2. The reduction is from the Exact Cover by 3-Sets problem [12]. Subsequently, the problem was shown to be MAX SNP-hard [54]. Hence, unless P = NP there is no PTAS for the problem [4]. It was shown in [58] that for special cases of the problem polynomial time algorithms exists. If T2 has one leaf, i.e., T2 is a sequence, the problem can be solved in O(|T1 ||T2 |) time. More generally, there is an algorithm running in time O(|T1 ||T2 | + L2 !3L2 (L32 + D12 )|T1 |). Hence, if the number of leaves in T2 is logarithmic the problem can be solved in polynomial time. 3.4. Constrained edit distance The fact that the general edit distance problem is difficult to solve has led to the study of restricted versions of the problem. In [51,52] Zhang introduced the constrained edit distance, denoted by c , which is defined as an edit distance under the restriction that disjoint subtrees should be mapped to disjoint subtrees. Formally, c (T1 , T2 ) is defined as a minimum cost mapping (Mc , T1 , T2 ) satisfying the additional constraint, that for all (v1 , w1 ), (v2 , w2 ), (v3 , w3 ) ∈ Mc : • nca(v1 , v2 ) is a proper ancestor of v3 iff nca(w1 , w2 ) is a proper ancestor of w3 . According to [29], Richter [37] independently introduced the structure respecting edit distance s . Similar to the constrained edit distance, s (T1 , T2 ) is defined as a minimum cost mapping (Ms , T1 , T2 ) satisfying the additional constraint, that for all (v1 , w1 ), (v2 , w2 ), (v3 , w3 ) ∈ Ms such that none of v1 , v2 , and v3 is an ancestor of the others, • nca(v1 , v2 ) = nca(v1 , v3 ) iff nca(w1 , w2 ) = nca(w1 , w3 ). It is straightforward to show that both of these notions of edit distance are equivalent. Henceforth, we will refer to them simply as the constrained edit distance. As an example consider the mappings of Fig. 4. (a) is a constrained mapping since nca(v1 , v2 ) = nca(v1 , v3 ) and nca(w1 , w2 ) = nca(w1 , w3 ). (b) is not constrained since nca(v1 , v2 ) = v4 = nca(v1 , v3 ) = v5 , while nca(w1 , w2 ) = w4 = nca(w1 , w3 ). (c) is not constrained since nca(v1 , v3 ) = v5 = nca(v2 , v3 ), while nca(w1 , w3 ) = v5 = nca(w2 , w3 ) = w4 . In [51,52] Zhang presents algorithms for computing minimum cost constrained mappings. For the ordered case he gives an algorithm using O(|T1 ||T2 |) time and for the unordered case he obtains a running time of O(|T1 ||T2 |(I1 + I2 ) log(I1 + I2 )). Both use space O(|T1 ||T2 |). The idea in both algorithms is similar. Due to the restriction on the mappings fewer subproblem need to be considered and a faster dynamic programming algorithm is obtained. In 228 P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 v5 w5 v4 v1 w1 v2 (a) w4 w2 w3 v3 v5 w4 v4 w1 w2 v1 v2 (b) v3 v5 w5 v4 v1 (c) w1 v2 w3 w4 w2 w3 v3 Fig. 4. (a) A mapping which is constrained and less-constrained. (b) A mapping which is less-constrained but not constrained. (c) A mapping which is neither constrained nor less-constrained. the ordered case the key observation is a reduction to the string edit distance problem. For the unordered case the corresponding reduction is to a maximum matching problem. Using an efficient algorithm for computing a minimum cost maximum flow Zhang obtains the time complexity above. Richter presented an algorithm for the ordered constrained edit distance problem, which uses O(|T1 ||T2 |I1 I2 ) time and O(|T1 |D2 I2 ) space. Hence, for small degree, low depth trees this algorithm gives a space improvement over the algorithm of Zhang. Recently, Lu et al. [29] introduced the less-constrained edit distance, l , which relaxes the constrained mapping. The requirement here is that for all (v1 , w1 ), (v2 , w2 ), (v3 , w3 ) ∈ Ml such that none of v1 , v2 , and v3 is an ancestor of the others, • depth(nca(v1 , v2 )) depth(nca(v1 , v3 )) and also nca(v1 , v3 ) = nca(v2 , v3 ) if and only if depth(nca(w1 , w2 )) depth(nca(w1 , w3 )) and nca(w1 , w3 ) = nca(w2 , w3 ). For example, consider the mappings in Fig. 4. (a) is less-constrained because it is constrained. (b) is not a constrained mapping, however, the mapping is less-constrained since depth(nca(v1 , v2 )) > depth(nca(v1 , v3 )), nca(v1 , v3 ) = nca(v2 , v3 ), nca(w1 , w2 ) = nca(w1 , w3 ), and nca(w1 , w3 ) = nca(w2 , w3 ). (c) is not a less-constrained mapping since depth(nca(v1 , v2 )) > depth(nca(v1 , v3 )) and nca(v1 , v3 ) = nca(v2 , v3 ), while nca(w1 , w3 ) = nca(w2 , w3 ). In paper [29] an algorithm for the ordered version of the less-constrained edit distance problem using O(|T1 ||T2 |I13 I23 (I1 + I2 )) time and space is presented. For the unordered version, unlike the constrained edit distance problem, it is shown that the problem is NPcomplete. The reduction used is similar to the one for the unordered edit distance problem. It is also reported that the problem is MAX SNP-hard. Furthermore, it is shown that there P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 229 is no absolute approximation algorithm 2 for the unordered less-constrained edit distance problem unless P = NP. 3.5. Other variants In this section we survey results for other variants of edit distance. Let T1 and T2 be rooted trees. The unit cost edit distance between T1 and T2 is defined as the number of edit operations needed to turn T1 into T2 . In [41] the ordered version of this problem is considered and a fast algorithm is presented. If u is the unit cost edit distance between T1 and T2 the algorithm runs in O(u2 min{|T1 |, |T2 |} min{L1 , L2 }) time. The algorithm uses techniques from Ukkonen [47] and Landau and Vishkin [28]. In [38] Selkow considered an edit distance problem where insertions and deletions are restricted to leaves of the trees. This edit distance is sometimes referred to as the 1-degree edit distance. He gave a simple algorithm using O(|T1 ||T2 |) time and space. Another edit distance measure where edit operations work on subtrees instead of nodes was given by Lu [30]. A similar edit distance was given by Tanaka in [45,44]. A short description of Lu’s algorithm can be found in [42]. 4. Tree alignment distance In this section we consider the alignment distance problem. Let T1 and T2 be rooted, labeled trees and let be a metric cost function on pairs of labels as defined in Section 2. An alignment A of T1 and T2 is obtained by first inserting nodes labeled with (called spaces) into T1 and T2 so that they become isomorphic when labels are ignored, and then overlaying the first augmented tree on the other one. The cost of a pair of opposing labels in A is given by . The cost of A is the sum of costs of all opposing labels in A. An optimal alignment of T1 and T2 , is an alignment of T1 and T2 of minimum cost. We denote this cost by (T1 , T2 ). Fig. 5 shows an example (from [18]) of an ordered alignment. The tree alignment distance problem is a special case of the tree editing problem. In fact, it corresponds to a restricted edit distance where all insertions must be performed before any deletions. Hence, (T1 , T2 ) (T1 , T2 ). For instance, assume that all edit operations have cost 1 and consider the example in Fig. 5. The optimal sequence of edit operations is achieved by deleting the node labeled e and then inserting the node labeled f. Hence, the edit distance is 2. The optimal alignment, however, is the tree depicted in (c) with a value of 4. Additionally, it also follows that the alignment distance does not satisfy the triangle inequality and hence it is not a distance metric. For instance, in Fig. 5 if T3 is T1 where the node labeled e is deleted, then (T1 , T3 ) + (T3 , T2 ) = 2 > 4 = (T1 , T2 ). It is a well-known fact that edit and alignment distance are equivalent in terms of complexity for sequences, see, e.g., Gusfield [14]. However, for trees this is not true which we will show in the following sections. In Section 4.1 and Section 4.2 we survey the results for the ordered and unordered tree alignment distance problem, respectively. 2 An approximation algorithm A is absolute if there exists a constant c > 0 such that for every instance I, |A(I ) − OPT(I )| c, where A(I ) and OPT(I ) are the approximate and optimal solutions of I, respectively [33]. 230 P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 a a e b d (a, a) c c (a) (e, λ) f b (b) d (λ, f ) (b, b) (c, λ) (λ, c) (d, d ) (c) Fig. 5. (a) Tree T1 . (b) Tree T2 . (c) An alignment of T1 and T2 . 4.1. Ordered tree alignment distance In this section we consider the ordered tree alignment distance problem. Let T1 and T2 be two rooted, ordered and labeled trees. The ordered tree alignment distance problem was introduced by Jiang et al. in [18]. The algorithm presented there uses O(|T1 ||T2 |(I1 + I2 )2 ) time and O(|T1 ||T2 |(I1 +I2 )) space. Hence, for small degree trees, this algorithm is in general faster than the best known algorithm for the edit distance. We present this algorithm in detail in the next section. Recently, in [17], a new algorithm was proposed designed for similar trees. Specifically, if there is an optimal alignment of T1 and T2 using at most s spaces the algorithm computes the alignment in time O((|T1 |+|T2 |) log(|T1 |+|T2 |)(I1 +I2 )4 s 2 ). This algorithm works in a way similar to the fast algorithms for comparing similar sequences, see, e.g., Section 3.3.4 in [39]. The main idea is to speedup the algorithm of Jiang et al. by only considering subtrees of T1 and T2 whose sizes differ by at most O(s). 4.1.1. Jiang, Wang, and Zhang’s algorithm In this section we present the algorithm of Jiang et al. [18]. We only show how to compute the alignment distance. The corresponding alignment can easily be constructed within the same complexity bounds. Let be a metric cost function on the labels. For simplicity, we will refer to nodes instead of labels, that is, we will use (v, w) for nodes v and w to mean (label(v), label(w)). Here, v or w may be . We extend the definition of to include alignments of forests, that is, (F1 , F2 ) denotes the cost of an optimal alignment of forest F1 and F2 . Lemma 7. Let v ∈ V (T1 ) and w ∈ V (T2 ) with children v1 , . . . , vi and w1 , . . . , wj , respectively. Then, (, ) = 0, (T1 (v), ) = (F1 (v), ) + (v, ), (, T2 (w)) = (, F2 (w)) + (, w), (F1 (v), ) = (, F2 (w)) = i k=1 (T1 (vk ), ), j k=1 (, T2 (wk )). P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 231 Lemma 8. Let v ∈ V (T1 ) and w ∈ V (T2 ) with children v1 , . . . , vi and w1 , . . . , wj , respectively. Then, (T1 (v), T2 (w)) (F1 (v), F2 (w)) + (v, w), = min (, T2 (w)) + min1 r j {(T1 (v), T2 (wr )) − (, T2 (wr )}, (T1 (v), ) + min1 r i {(T1 (vr ), T2 (w)) − (T1 (vr ), )}. Proof. Consider an optimal alignment A of T1 (v) and T2 (w). There are four cases: (1) (v, w) is a label in A, (2) (v, ) and (k, w) are labels in A for some k ∈ V (T1 ), (3) (, w) and (v, h) are labels in A for some h ∈ V (T2 ) or (4) (v, ) and (, w) are in A. Case (4) need not be considered since the two nodes can be deleted and replaced by the single node (v, w) as the new root. The cost of the resulting alignment is by the triangle inequality at least as small. Case 1: The root of A is labeled by (v, w). Hence, (T1 (v), T2 (w)) = (F1 (v), F2 (w)) + (v, w) Case 2: The root of A is labeled by (v, ). Hence, k ∈ V (T1 (ws )) for some 1 r i. It follows that, (T1 (v), T2 (w)) = (T1 (v), ) + min {(T1 (vr ), T2 (w)) − (T1 (vr ), )} 1r i Case 3: Symmetric to case 2. Lemma 9. Let v ∈ V (T1 ) and w ∈ V (T2 ) with children v1 , . . . , vi and w1 , . . . , wj , respectively. For any s, t such that 1 s i and 1 t j , (F1 (v1 , vs ), F2 (w1 , wt )) (F1 (v1 , vs−1 ), F2 (w1 , wt−1 )) + (T1 (vs ), T2 (wt )), (F1 (v1 , vs−1 ), F2 (w1 , wt )) + (T1 (vs ), ), (F1 (v1 , vs ), F2 (w1 , wt−1 )) + (, T2 (wt )), = min (, wt ) + min {(F1 (v1 , vk−1 ), F2 (w1 , wt−1 )) 1 k<s + (F (v 1 k , vs ), F2 (wk ))}, (vs , ) + min {(F1 (v1 , vs−1 ), F2 (w1 , wk−1 )) 1 k<t +(F1 (vs ), F2 (wk , wt ))}. Proof. Consider an optimal alignment A of F1 (v1 , vs ) and F2 (w1 , wt ). The root of the rightmost tree in A is labeled either (vs , wt ), (vs , ) or (, wt ). Case 1: The label is (vs , wt ). Then the rightmost tree of A must be an optimal alignment of T1 (vs ) and T2 (wt ). Hence, (F1 (v1 , vs ), F2 (w1 , wt )) = (F1 (v1 , vs−1 ), F2 (w1 , wt−1 )) + (T1 (vs ), T2 (wt )). 232 P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 Case 2: The label is (vs , ). Then T1 (vs ) is a aligned with a subforest F2 (wt−k+1 , wt ), where 0 k t. The following subcases can occur: Subcase 2.1 (k = 0): T1 (vs ) is aligned with F2 (wt−k+1 , wt ) = . Hence, (F1 (v1 , vs ), F2 (w1 , wt )) = (F1 (v1 , vs−1 ), F2 (w1 , wt )) + (T1 (vs ), ). Subcase 2.2 (k = 1): T1 (vs ) is aligned with F2 (wt−k+1 , wt ) = T2 (wt ). Similar to case 1. Subcase 2.3 (k 2): The most general case. It is easy to see that: (F1 (v1 , vs ), F2 (w1 , wt )) = (vs , ) + min ((F1 (v1 , vs−1 ), F2 (w1 , wk−1 ))) 1 r<t +(F1 (vs ), F2 (wk , wt )). Case 3: The label is (, wt ). Symmetric to case 2. This recursion can be used to construct a bottom-up dynamic programming algorithm. Consider a fixed pair of nodes v and w with children v1 , . . . , vi and w1 , . . . , wj , respectively. We need to compute the values (F1 (vh , vk ), F2 (w)) for all 1 h k i, and (F1 (v), F2 (wh , wk )) for all 1 h k j . That is, we need to compute the optimal alignment of F1 (v) with each subforest of F2 (w) and, on the other hand, compute the optimal alignment of F2 (w) with each subforest of F1 (v). For any s and t, 1 s i and 1 t j , define the set: As,t = {(F1 (vs , vp ), F2 (wt , wq )) | s p i, t q j }. To compute the alignments described above we need to compute As,1 and A1,t for all 1 s i and 1 t j . Assuming that values for smaller subproblems are known it is not hard to show that As,t can be computed, using Lemma 9, in time O((i−s)·(j −t)·(i−s+j −t)) = O(ij (i + j )). Hence, the time to compute the (i + j ) subproblems, As,1 and A1,t , 1 s i and 1 t j , is bounded by O(ij (i + j )2 ). It follows that the total time needed for all nodes v and w is bounded by: O(deg(v) deg(w)(deg(v) + deg(w))2 ) v∈V (T1 ) w∈V (T2 ) v∈V (T1 ) w∈V (T2 ) O (I1 + I2 ) 2 O(deg(v) deg(w)(deg(T1 ) + deg(T2 ))2 ) deg(v) deg(w) v∈V (T1 ) w∈V (T2 ) O(|T1 ||T2 |(I1 + I2 )2 ). In summary, we have shown the following theorem. Theorem 3 (Jiang et al. [18]). For ordered trees T1 and T2 , the tree alignment distance problem can be solved in O(|T1 ||T2 |(I1 + I2 )2 ) time and O(|T1 ||T2 |(I1 + I2 )) space. P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 233 4.2. Unordered tree alignment distance The algorithm presented above can be modified to handle the unordered version of the problem in a straightforward way [18]. If the trees have bounded degrees the algorithm still runs in O(|T1 |T2 |) time. This should be seen in contrast to the edit distance problem which is MAX SNP-hard even if the trees have bounded degree. If one tree has arbitrary degree unordered alignment becomes NP-hard [18]. The reduction is, as for the edit distance problem, from the Exact Cover by 3-Sets problem [12]. 5. Tree inclusion In this section we survey the tree inclusion problem. Let T1 and T2 be rooted, labeled trees. We say that T1 is included in T2 if there is a sequence of delete operations performed on T2 which makes T2 isomorphic to T1 . The tree inclusion problem is to decide if T1 is included in T2 . Fig. 6(a) shows an example of an ordered inclusion. The tree inclusion problem is a special case of the tree edit distance problem: If insertions all have cost 0 and all other operations have cost 1, then T1 can be included in T2 if and only if (T1 , T2 ) = 0. According to [7] the tree inclusion problem was initially introduced by Knuth [26, exercise 2.3.2-22]. The rest of the section is organized as follows. In Sections 5.1 we give some preliminaries and in Sections 5.2 and 5.3 we survey the known results on ordered and unordered tree inclusion, respectively. 5.1. Orderings and embeddings Let T be a labeled, ordered, and rooted tree. We define an ordering of the nodes of T given by v ≺ v iff post(v) < post(v ). Also, v v iff v ≺ v or v = v . Furthermore, we extend this ordering with two special nodes ⊥ and such that for all nodes v ∈ V (T ), ⊥ ≺ v ≺ . The left relatives, lr(v), of a node v ∈ V (T ) is the set of nodes that are to the left of v and similarly the right relatives, rr(v), are the set of nodes that are to the right of v. Let T1 and T2 be rooted labeled trees. We define an ordered embedding (f, T1 , T2 ) as an injective function f : V (T1 ) → V (T2 ) such that for all nodes v, u ∈ V (T1 ), • label(v) = label(f (v)) (label preservation condition). • v is an ancestor of u iff f (v) is an ancestor of f (u) (ancestor condition). • is to the left of u iff f (v) is to the left of f (u) (sibling condition). Hence, embeddings are special cases of mappings (see Section 3.1). An unordered embedding is defined as above, but without the sibling condition. An embedding (f, T1 , T2 ) is root-preserving if f (root(T1 )) = root(T2 ). Fig. 6(b) shows an example of a root-preserving embedding. 5.2. Ordered tree inclusion Let T1 and T2 be rooted, ordered and labeled trees. The ordered tree inclusion problem has been the attention of much research. Kilpeläinen and Mannila [22] (see also [21]) presented 234 P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 f f e d c e b a (a) b f f e d c b (b) e a b Fig. 6. (a) The tree on the left is included in the tree on the right by deleting the nodes labeled d, a and c. (b) The embedding corresponding to (a). the first polynomial time algorithm using O(|T1 ||T2 |) time and space. Most of the later improvements are refinements of this algorithm. We present this algorithm in detail in the next section. In [21] a more space efficient version of the above was given using O(|T1 |D2 ) space. In [36] Richter gave an algorithm using O(|T1 ||T2 | + mT1 ,T2 D2 ) time, where T1 is the alphabet of the labels of T1 and mT1 ,T2 is the set matches, defined as the number of pairs (v, w) ∈ T1 × T2 such that label(v) = label(w). Hence, if the number of matches is small the time complexity of this algorithm improves the (|T1 ||T2 |) algorithm. The space complexity of the algorithm is O(|T1 ||T2 | + mT1 ,T2 ). In [7] a more complex algorithm was presented using O(L1 |T2 |) time and O(L1 min{D2 , L2 }) space. In [3] an efficient average case algorithm was given. 5.2.1. Kilpeläinen and Mannila’s algorithm In this section we present the algorithm of Kilpeläinen and Mannila [22] for the ordered tree inclusion problem. Let T1 and T2 be ordered labeled trees. Define R(T1 , T2 ) as the set of root-preserving embeddings of T1 into T2 . We define (v, w), where v ∈ V (T1 ) and w ∈ V (T2 ): (v, w) = min({w ∈ rr(w) | ∃f ∈ R(T1 (v), T2 (w ))} ∪ {}). ≺ Hence, (v, w) is the closest right relative of w which has a root-preserving embedding of T1 (v). Furthermore, if no such embedding exists (v, w) is . It is easy to see that, by definition, T1 can be included in T2 if and only if (v, ⊥) = . The following lemma shows how to search for root preserving embeddings. Lemma 10. Let v be a node in T1 with children v1 , . . . , vi . For a node w in T2 , define a sequence p1 , . . . , pi by setting p1 = (v1 , max≺ lr(w)) and pk = (vk , pk−1 ), for 2 k i. There is a root-preserving embedding f of T1 (v) in T2 (v) if and only if label(v) = label(w) and pi ∈ T2 (w), for all 1 k i. P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 235 Proof. If there is a root-preserving embedding between T1 (v) and T2 (w) it is straightforward to check that there is a sequence pi , 1 i k such that the conditions are satisfied. Conversely, assume that pk ∈ T2 (w) for all 1 k i and label(v) = label(w). We construct a root-preserving embedding f of T1 (v) into T2 (w) as follows. Let f (v) = w. By definition of there must be a root-preserving embedding f k , 1 k i, of T1 (vk ) in T2 (pk ). For a node u in T1 (vk ), 1 k i, we set f (u) = f k (u). Since pk ∈ rr(pk−1 ), 2 k i, and pk ∈ T2 (w) for all k, 1 k i, it follows that f is indeed a root-preserving embedding. Using dynamic programming it is now straightforward to compute (v, w) for all v ∈ V (T1 ) and w ∈ V (T2 ). For a fixed node v we traverse T2 in reverse postorder. At each node w ∈ V (T2 ) we check if there is a root-preserving embedding of T1 (v) in T2 (w). If so we set (v, q) = w, for all q ∈ lr(w) such that x q, where x is the next root-preserving embedding of T1 (v) in T2 (w). For a pair of nodes v ∈ V (T1 ) and w ∈ V (T2 ) we test for a root-preserving embedding using Lemma 10. Assuming that values for smaller subproblems has been computed, the time used is O(deg(v)). Hence, the contribution to the total time for the node w is v∈V (T1 ) O(deg(v)) = O(|T1 |). It follows that the time complexity of the algorithm is bounded by O(|T1 ||T2 |). Clearly, only O(|T1 ||T2 |) space is needed to store . Hence, we have the following theorem, Theorem 4 (Kilpeläinen and Mannila [22]). For any pair of rooted, labeled, and ordered trees T1 and T2 , the tree inclusion problem can be solved in O(|T1 ||T2 |) time and space. 5.3. Unordered tree inclusion In [22] it is shown that the unordered tree inclusion problem is NP-complete. The reduction used is from the Satisfiability problem [12]. Independently, Matoušek and Thomas [32] gave another proof of NP-completeness. An algorithm for the unordered tree inclusion problem is presented in [22] using O(|T1 |I1 22I1 |T2 |) time. Hence, if I1 is constant the algorithm runs in O(|T1 ||T2 |) time and if I1 = log |T2 | the algorithm runs in O(|T1 | log |T2 ||T2 |3 ). 6. Conclusion We have surveyed the tree edit distance, alignment distance, and inclusion problems. Furthermore, we have presented, in our opinion, the central algorithms for each of the problems. There are several open problems, which may be the topic of further research. We conclude this paper with a short list proposing some directions. • For the unordered versions of the above problems some are NP-complete while others are not. Characterizing exactly which types of mappings that gives NP-complete problems for unordered versions would certainly improve the understanding of all of the above problems. • The currently best worst-case upper bound on the ordered tree edit distance problem is the algorithm of [25] using O(|T1 |2 |T2 | log |T2 |). Conversely, the quadratic lower bound for 236 Table 1 Results for the tree edit distance, alignment distance, and inclusion problem listed according to variant Type Time Tree edit distance General General General General General Constrained Constrained Constrained Less-constrained Less-constrained Unit-cost 1-degree O O O O U O O U O U O O O(|T1 ||T2 |D12 D22 ) O(|T1 ||T2 | min(L1 , D1 ) min(L2 , D2 )) O(|T1 |2 |T2 | log |T2 |) O(|T1 ||T2 | + L21 |T2 | + L2.5 1 L2 ) Tree alignment distance General General Similar O U O O(|T1 ||T2 |(I1 + I2 )2 ) O(|T1 ||T2 |(I1 + I2 )) MAX SNP-hard O((|T1 | + |T2 |) log(|T1 | + |T2 |)(I1 + I2 )4 s 2 ) O((|T1 | + |T2 |) log(|T1 | + |T2 |)(I1 + I2 )4 s 2 ) [18] [18] [17] Tree inclusion General General General General O O O U O(|T1 ||T2 |) O(|T1 ||T2 | + mT1 ,T2 D2 ) O(L1 |T2 |) [21] [36] [7] [22,32] O(|T1 ||T2 |) O(|T1 ||T2 |I1 I2 ) O(|T1 ||T2 |(I1 + I2 ) log(I1 + I2 )) O(|T1 ||T2 |I13 I23 (I1 + I2 )) O(u2 min(|T1 |, |T2 |) min(L1 , L2 )) O(|T1 ||T2 |) Space O(|T1 ||T2 |D12 D22 ) O(|T1 ||T2 |) O(|T1 ||T2 |) O((|T1 | + L21 ) min(L2 , D2 ) + |T2 |) MAX SNP-hard O(|T1 ||T2 |) O(|T1 ||D2 I2 ) O(|T1 ||T2 |) O(|T1 ||T2 |I13 I23 (I1 + I2 )) MAX SNP-hard O(|T1 ||T2 |) O(|T1 ||T2 |) O(|T1 | min(D2 L2 )) O(|T1 ||T2 | + mT1 ,T2 ) O(L1 min(D2 L2 )) NP-hard Reference [43] [55] [25] [8] [54] [51] [37] [52] [29] [29] [41] [38] Di , Li , and Ii denotes the depth, the number of leaves, and the maximum degree, respectively, of Ti , i = 1, 2. The type is either O for ordered or U for unordered. The value u is the unit cost edit distance between T1 and T2 and the value s is the number of spaces in the optimal alignment of T1 and T2 . The value T1 is set of labels used in T1 and mT1 ,T2 is the number of pairs of nodes in T1 and T2 which have the same label. P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 Variant P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 237 the longest common subsequence problem [1] problem is the best general lower bound for the ordered tree edit distance problem. Hence, a large gap in complexity exists which needs to be closed. • Several meaningful edit operations other than the above may be considered depending on the particular application. Each set of operations yield a new edit distance problem for which we can determine the complexity. Some extensions of the tree edit distance problem have been considered [6,5,24]. Acknowledgements Thanks to Inge Li Gørtz and Anna Östlin for proof reading and helpful discussions. References [1] A.V. Aho, J.D. Ullman, D.S. Aho, J.D. Hirschberg, Bounds on the complexity of the longest common subsequence problem, J. ACM 1 (23) (1976) 1–12. [2] T. Akutsu, M.M. Halldórsson. On the approximation of largest common point sets and largest common subtrees, in: Proc. 5th Ann. Internat. Symp. on Algorithms and Computation, Lecture Notes in Computer Science (LNCS), Vol. 834. Springer, Berlin, 1994, pp. 405–413. [3] L. Alonso, R. Schott, 1993. On the tree inclusion problem, in: Proc. Mathematical Foundations of Computer Science, 1993, pp. 211–221. [4] A. Arora, C. Lund, R. Motwani, M. Sudan, M. Szegedy, Proof verification and hardness of approximation problems, in: Proc. 33rd IEEE Symp. on the Foundations of Computer Science (FOCS), 1992, pp. 14–23. [5] S. Chawathe, H. Garcia-Molina, Meaningful change detection in structured data, in: Proc. ACM SIGMOD International Conference on Management of Data, Tuscon, Arizona, 1997, pp. 26–37. [6] S.S. Chawathe, A. Rajaraman, H. Garcia-Molina, J. Widom, Change detection in hierarchically structured information, in: Proc. ACM SIGMOD Internat. Conf. on Management of Data, Montréal, Québec, June 1996, pp. 493–504. [7] W. Chen, More efficient algorithm for ordered tree inclusion, J. Algorithms 26 (1998) 370–385. [8] W. Chen, New algorithm for ordered tree-to-tree correction problem, J. Algorithms 40 (2001) 135–158. [9] T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algorithms, second ed., MIT Press, Cambridge, MA, 2001. [10] M. Dubiner, Z. Galil, E. Magen, Faster tree pattern matching, in: Proc. 31st IEEE Symp. on the Foundations of Computer Science (FOCS), 1990, pp. 145–150. [11] M. Farach, M. Thorup, Fast comparison of evolutionary trees, in: Proc. 5th Ann. ACM-SIAM Symp. on Discrete Algorithms, 1994, pp. 481–488. [12] M.J. Garey, D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-completeness, Freeman, New York, 1979. [13] A. Gupta, N. Nishimura, Finding largest subtrees and smallest supertrees, Algorithmica 21 (1998) 183–210. [14] D. Gusfield, Algorithms on Strings, Trees, and Sequences, Cambridge University Press, Cambridge, 1997. [15] D. Harel, R.E. Tarjan, Fast algorithms for finding nearest common ancestors, SIAM J. Comput. 13 (2) (1984) 338–355. [16] C.M. Hoffmann, M.J. O’Donnell, Pattern matching in trees, J. ACM 29 (1) (1982) 68–95. [17] J. Jansson, A. Lingas, A fast algorithm for optimal alignment between similar ordered trees, in: Proc. 12th Ann. Symp. Combinatorial Pattern Matching (CPM), Lecture Notes of Computer Science (LNCS). Vol. 2089, Springer, Berlin, 2001. [18] T. Jiang, L. Wang, K. Zhang, Alignment of trees—an alternative to tree edit, Theoret. Comput. Sci. 143 (1995). [19] D. Keselman, A. Amir, Maximum agreement subtree in a set of evolutionary trees—metrics and efficient algorithms, in: Proc. 35th Ann. Symp. on Foundations of Computer Science (FOCS), 1994, pp. 758–769. 238 P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 [20] S. Khanna, R. Motwani, F.F. Yao, Approximation algorithms for the largest common subtree problem, Technical Report, Stanford University, 1995. [21] P. Kilpeläinen, Tree matching problems with applications to structured text databases, Ph.D. Thesis, Department of Computer Science, University of Helsinki, November 1992. [22] P. Kilpeläinen, H. Mannila, Ordered and unordered tree inclusion, SIAM J. Comput. 24 (1995) 340–356. [23] P. Klein, 2002. Personal communication. [24] P. Klein, S. Tirthapura, D. Sharvit, B. Kimia, A tree-edit-distance algorithm for comparing simple, closed shapes, in: Proc. 11th Ann. ACM-SIAM Symp. on Discrete Algorithms (SODA), 2000, pp. 696–704. [25] P.N. Klein, Computing the edit-distance between unrooted ordered trees, in: Proc. 6th Ann. European Symp. on Algorithms (ESA), Springer, Berlin, 1998, pp. 91–102. [26] D.E. Knuth, The Art of Computer Programming, Vol. 1, Addison-Wesley, Reading, MA, 1969. [27] S.R. Kosaraju, Efficient tree pattern matching, in: Proc. 30th IEEE Symp. on the Foundations of Computer Science (FOCS), 1989, pp. 178–183. [28] G.M. Landau, U. Vishkin, Fast parallel and serial approximate string matching, J. Algorithms 10 (1989) 157–169. [29] S.Y. Lu, A tree-to-tree distance and its application to cluster analysis, IEEE Trans. Pattern Anal. Mach. Intell. 1 (1979) 219–224. [30] S.Y. Lu, A tree-matching algorithm based on node splitting and merging, IEEE Trans. Pattern Anal. Mach. Intell. 6 (2) (1984) 249–256. [31] C.L. Lu, Z.-Y. Su, C.Y., Tang, A new measure of edit distance between labeled trees, in: Proc. 7th Ann. Internat. Conf. on Computing and Combinatorics (COCOON), Lecture Notes in Computer Science (LNCS). Vol. 2108, Springer, Berlin, 2001. [32] J. Matoušek, R. Thomas, On the complexity of finding iso- and other morphisms for partial k-trees, Discrete Math. 108 (1992) 343–364. [33] R. Motwani, Lecture Notes on Approximation Algorithms, Vol. 1. Technical Report STAN-CS-92-1435, Department of Computer Science, Stanford University, 1992. [34] N. Nishimura, P. Ragde, D.M. Thilikos, Finding smallest supertrees under minor containment, Internat. J. Found. Comput. Sci. 11 (3) (2000) 445–465. [35] R. Ramesh, I.V. Ramakrishnan, Nonlinear pattern matching in trees, J. ACM 39 (2) (1992) 295–316. [36] T. Richter, A new algorithm for the ordered tree inclusion problem, in: Proc. 8th Ann. Symp. on Combinatorial Pattern Matching (CPM), Lecture Notes of Computer Science (LNCS), Vol. 1264, Springer, Berlin, 1997, pp. 150–166. [37] T. Richter, A new measure of the distance between ordered trees and its applications, Technical Report 85166-cs. Department of Computer Science, University of Bonn, 1997. [38] S.M. Selkow, The tree-to-tree editing problem, Inform. Process. Lett. 6 (6) (1977) 184–186. [39] J. Setubal, J. Meidanis, Introduction to Computational Biology, PWS Publishing Company, MA, 1997. [40] D. Shasha, J.T.-L. Wang, H. Shan, K. Zhang, Atreegrep: approximate searching in unordered trees, in: Proc. 14th Internat. Conf. on Scientific and Statistical Database Management, 2002, pp. 89–98. [41] D. Shasha, K. Zhang, Fast algorithms for the unit cost editing distance between trees, J. Algorithms 11 (1990) 581–621. [42] D. Shasha, K. Zhang, Approximate tree pattern matching, in: Pattern Matching in String, Trees and Arrays, Oxford University, Oxford, 1997, pp. 341–371. [43] K.-C. Tai, The tree-to-tree correction problem, J. ACM 26 (1979) 422–433. [44] E. Tanaka, A note on a tree-to-tree editing problem, Internat. J. Pattern Recogn. Artif. Intell. 9 (1) (1995) 167–172. [45] E. Tanaka, K. Tanaka, The tree-to-tree editing problem, Internat. J. Pattern Recogn. Artif. Intell. 2 (2) (1988) 221–240. [46] S. Tirthapura, D. Sharvit, P. Klein, B.B. Kimia, Indexing based on edit-distance matching of shape graphs, in: Proc. SPIE Internat. Symp. Voice, Video and Data Communications, 1998, pp. 91–102. [47] E. Ukkonen, Finding approximate patterns in strings, J. Algorithms 6 (1985) 132–137. [48] R.A. Wagner, M.J. Fischer, The string-to-string correction problem, J. ACM 21 (1974) 168–173. [49] J.T.-L. Wang, K. Zhang, K. Jeong, D. Shasha, A system for approximate tree matching, IEEE Trans. Knowl. Data Eng. 6 (4) (1994) 559–571. [50] K. Zhang, The Editing Distance Between Trees: Algorithms and Applications, Ph.D. Thesis, Department of Computer Science, Courant Institute, 1989. P. Bille / Theoretical Computer Science 337 (2005) 217 – 239 239 [51] K. Zhang, Algorithms for the constrained editing problem between ordered labeled trees and related problems, Pattern Recognition 28 (1995) 463–474. [52] K. Zhang, A constrained edit distance between unordered labeled trees, Algorithmica 15 (3) (1996) 205– 222. [53] K. Zhang, Efficient parallel algorithms for tree editing problems, in: Proc. 7th Ann. Symp. Combinatorial Pattern Matching (CPM), Lecture Notes in Computer Science, Vol. 1075, 1996, pp. 361–372. [54] K. Zhang, T. Jiang, Some MAX SNP-hard results concerning unordered labeled trees, Inform. Process. Lett. 49 (1994) 249–254. [55] K. Zhang, D. Shasha, Simple fast algorithms for the editing distance between trees and related problems, SIAM J. Comput. 18 (1989) 1245–1262. [56] K. Zhang, D. Shasha, J.T.L. Wang, Approximate tree matching in the presence of variable length don’t cares, J. Algorithms 16 (1) (1994) 33–66. [57] K. Zhang, R. Statman, D. Shasha, On the editing distance between unordered labeled trees, Technical Report 289, Department of Computer Science, The University of Western Ontario, 1991. [58] K. Zhang, R. Statman, D. Shasha, On the editing distance between unordered labeled trees, Inform. Process. Lett. 42 (1992) 133–139.
© Copyright 2026 Paperzz