Treehandling in a relational db

Representing Trees in a relational DB
Author: Lennart Jonsson
Algorithms for handling tree stucture
in apollo
2001-04-07
TOC
Abstract ............................................................................................................... 3
Definitions ........................................................................................................... 3
Operations ........................................................................................................... 4
Add node ........................................................................................................ 5
Delete node ..................................................................................................... 5
Move node ...................................................................................................... 6
Extensions ........................................................................................................... 8
Add Link ......................................................................................................... 9
Remove Link ................................................................................................ 10
)
Summary ................................................................................................ 12
Summary ........................................................................................................... 13
Add node ...................................................................................................... 13
Delete node ................................................................................................... 13
Move node .................................................................................................... 13
Add Link ....................................................................................................... 14
Remove Link ................................................................................................ 14
Implementation Considerations......................................................................... 16
Author: Lennart Jonsson
Algorithms for handling tree stucture
in apollo
2001-04-07
Abstract
This document describes an approach for handling hierarchical data in a relational
database.
Definitions
A hierarchical data structure (tree) is a set of nodes. Each node is a 2-tupel consisting
of an id and a parent id. We therefore define two functions that operates on nodes as:
id ( x, y )  x
parent ( x, y )  ( y, z )
The id of a node is unique for a tree. More formally we can express that as:
( x, y )(u, v) : x  u, y  v
We also define the root of a tree to be a node having itself as parent:
root  id ( x, y)  id ( parent ( x, y))
def
Each tree is defined to have at least one root node. Given this we can easily extend
our set of definitions to include path and sub tree. We can think of the path as if we
pick a node from a tree and recursively pick the parent until we reach a root node. The
sub tree of a node x is any node having x in its path.
 , if id ( x, y )  id ( parent ( x, y ))
path( x, y )  
id ( parent ( x, y ))  path( parent ( x, y ))
subtree( x, y)  ( z, u) x  path( z, u)
Since the id of a node is unique we will sometimes use the abbreviations:
path(x) and subtree(x)
Author: Lennart Jonsson
Algorithms for handling tree stucture
in apollo
2001-04-07
Operations
If we assume that one of the most common operation in a database that stores
hierarchical data, is to query for related data – for example, sub tree or path - it is not
a very convenient thing to recursively ask for data until some condition is fulfilled. In
the rest of this document we will focus on an approach that uses a table to remember
the path of a node, and thereby also keeps information regarding sub tree. We start by
defining the table that contains the nodes.
Figure 1: Table containing data.
In any sensible situation this table should contain some data as well, but that is of no
concern here. We can then add the table that “remembers” the path.
Figure 2:Remember table and relation to data table.
In figure 3 a tiny tree and the resulting tables are shown.
Data
1
2
3
4
5
6
Path
1
1
1
2
2
5
2
3
4
4
5
5
6
6
6
1
1
2
1
2
1
5
2
1
Figure 3: A small tree, resulting Data and Path tables
Author: Lennart Jonsson
Algorithms for handling tree stucture
in apollo
2001-04-07
To ask for the path of a given node is straightforward thing. Lets say we would like to
retrieve the path of the node with id 5:
“Select AncestorId from Path where Id = 5”
The result of this query is 1 and 2. Asking for the sub tree of a given node is almost as
easy. Assume we would like to know the sub tree of the node with id 2:
“Select Id from Path where AncestorId = 2”
The result of this question is as expected 4, 5 and 6. For more complicated questions
we probably need to join more tables in the query, but the general idea remains the
same.
Thus, given a remember table we can quite easily retrieve hierarchically information.
The trouble starts when we change the content of the data table. When this happens
we must ensure that the Path table is also changed to reflect the new content of the
Data table.
Add node
If we add a new node (x,y) to Data, we must also add the following nodes to Path:
{( x, y )}  ( x  path( parent ( x, y ))
 denotes the Cartesian product, and path and parent functions are defined in the
previous paragraph. In SQL we can define this operation as:
“Insert into Path (values (x,y) union Select x, AncestorId from Path where Id = y”
Delete node
When deleting nodes we assume that nodes that act as parents for other nodes cant be
deleted. Thus it is only leafs that can be deleted from the Data table. If we delete the
leaf (x,y) from the Data table, we need to delete the set:
x  path( x, y )
from the Path table. This is easily expressed in SQL as:
“Delete from Path where Id = x”
So inserting and deleting (with the limitations given) is fairly simple. Now we will
have a look on moving a set of nodes.
Author: Lennart Jonsson
Algorithms for handling tree stucture
in apollo
2001-04-07
Move node
A naïve approach would be to just delete all nodes that are about to move and then
insert them in their new position one by one. This is however most inefficient since
there is no need to change the structure within the sub tree we are moving. If we for
example move node 5 so that it becomes a child of node 3 instead of 2, there is no
need to change the relation between 5 and 6.
Data
1
2
3
4
5
6
Path
1
1
1
2
3
5
2
3
4
4
5
5
6
6
6
1
1
2
1
3
1
5
3
1
Figure 4: Node 5 moved from 2 to 3 and resulting tables.
If we look in the Path table above we see that there are only two rows that are
changed compared to the Path table in figure 3.
A more efficient approach is to delete only those rows where Id belongs to the sub
tree of x, and AncestorId belongs to the path of x. Formally we can express that as:
( x  path( x))  ( subtree( x)  path( x))
or simplified as:
( x  subtree( x))  path( x)
The same goes for insertion. If we assume that we are moving x so that it becomes a
child of y, it is only necessary to insert x and the sub tree of x, together with y and the
path of y. The expression goes:
( x  subtree( x))  ( y  path( y ))
The delete statement is rather straightforward to transform to SQL:
delete from Path where
((Id = x) or Id in (select Id from Path where AncestorId = x)) and
AncestorId in (select AncestorId from Path where Id = x)
Author: Lennart Jonsson
Algorithms for handling tree stucture
in apollo
2001-04-07
The insert part is a bit trickier, and we will do the transformation in two steps. First
we will construct some pseudo SQL using the functions path, and subtree:
insert into Path (
(values x, y)
union
(select x, path(y) from Path)
union
(select subtree(x), y from Path)
union
(select subtree(x), path(y) from Path)
)
which when expanded becomes:
insert into Path (
(values x, y)
union
(select x, AncestorId from Path where Id = y)
union
(select Id, y from Path where AncestorId = x)
union
(select a.Id, b.AncestorId from Path a, Path b where
a.AncestorId = x and b.Id = y)
)
One can argue that it is not necessary to change the intersection between the old path
and the new path.
x  ( path( x)  path( y ))
But it is left as an exercise for the reader to include this in the SQL statements :-)
Author: Lennart Jonsson
Algorithms for handling tree stucture
in apollo
2001-04-07
Extensions
In the previous sections we have assumed that there is a single path – that might be
empty - from any node. If we remove this assumption we are dealing with a Directed
Acyclic Graph (DAG) instead of a tree. How does this affect our implementation?
Lets start by stating that we can’t insert a new element in the Data table containing the
same Id and another ParentId as another row. Doing so would violate the primary key
constraint of the Data table. Hence we will need some kind of “Link” table between
nodes. Furthermore lets limit the functionality of links so that the FromId and ToId
may not belong to the same tree. That is, if we link a node from x to y, the intersection
of path(x) and path(y) must be the empty set. By disallowing links within a tree we
avoids the problem with cycles.
Figure 5 By allowing links we end up with a Graph.
Since a node can be linked to several other nodes, we will need an n-n relation to
represent this structure.
Figure 6 Link table and relations to other tables involved.
From now on we will denote a link from x to y with:
[ x, y ]
x is the node we are linking and y is the position we link it to. While we are at it we
might define functions to handle links.
deref [ x, y ]  x
ref [ x, y ]  y
ref [ x]  [ y, z ] x  deref [ y, z ]
Since the id of a node is unique we will sometimes use the abbreviation:
Author: Lennart Jonsson
Algorithms for handling tree stucture
in apollo
2001-04-07
deref [ x]  x
Lets start by looking at what will happen when we apply our new set of operations.
Add Link
We saw earlier that inserting a node in the Data table should be reflected in the Path
table as:
{( x, y )}  ( x  path( parent ( x, y ))
Without defining the specific operations we need, it is clear that adding a link between
x and y means that we should add not only the link, but also the path from the target
to the Path table. Thus adding a link means that we should insert
( x  subtree( x))  (ref [ x, y ]  path(ref [ x, y ])
into the Path table. Note that we contrary to the situation in “add node”, also must
take into consideration the sub tree of x. Furthermore we cant just insert (x,y), instead
we have to insert x and its sub tree together with y. Thus, we end up with the insert
part of the move node operation. This makes perfect sense since we are sort of
copying the node. We must also remember to exclude elements already in the Path
table. It is probably easiest to exclude the Path table itself from the insert expression.
Transformation to SQL is relatively simple and goes like:
insert into Path (
select * from (
(values x, y)
union
(select x, AncestorId from Path where Id = y)
union
(select Id, y from Path where AncestorId = x)
union
(select a.Id, b.AncestorId from Path a, Path b where
a.AncestorId = x and b.Id = y)
) as T (ActorId, AncestorId)
where not exists (
select * from Path P where
P.ActorId = T.ActorId and P.AncestorId = T.AncestorId
)
One might note how an extra select clause is added in order to be able to refer to the
union in the “exists” clause.
Author: Lennart Jonsson
Algorithms for handling tree stucture
in apollo
2001-04-07
Remove Link
This is the time when we need to put our thinking hats back on. In fact, this operation
is the one to blame for this document. Lets start with a simple case like the one in
Figure 7
Figure 7
If we look at the Path table for node 6 we have the elements:
{(6,5),(6,2), (6,1), (6,14), (6,12), (6,11)}
It is clear that we should remove the elements (6,14), (6,12), (6,11) when we remove
the link [6,4]. Naively we can assume that the expression
[ x, y ]  (( deref [ x, y ]  subtree(deref [ x, y ]))  path(ref [ x, y ])
satisfies our needs. However, a quick glance in Figure 8 reveals that this is not true. If
we would delete this expression from the Path table, we would also delete the nodes
(6,12) and (6,11). Since node 6 should still be linked to 16, and 12 and 11 lies in the
path of 16 we can’t remove those nodes. Do note that we have included the sub tree of
x in our set.
Figure 8 Links from node 6 to 14 and 16
Author: Lennart Jonsson
Algorithms for handling tree stucture
in apollo
2001-04-07
The solution is to exclude other link paths. In order to do this we need to introduce a
new function complement. We define the complement of a link to be all links but the
link itself.
compl ([ x, y ])  {[ u, v] | (u  x  v  y )  [u, v]  Links}
With the help of this function we can define the elements that should be deleted as the
closure of the link except the closure of all other links. For simplicity we will also
define the closure of a link [x,y] as:
closure ([ x, y ])  (deref [ x, y ]  subtree(deref [ x, y ]))  (ref [ x, y ]  path(ref [ x, y ]))
and the closure of a set of links as:
closure ( X ) 
 closure ( x )
i
xi X
Given this we are now ready to deal with the removal of a link. The elements that
shall be removed is:
closure ([ x, y ])  closure (compl ([ x, y ])
This is perhaps not trivial to transform to SQL (it wasn’t for me anyhow). We already
know how express the closure of [x,y], thus we need to think of second part of the
expression. It is clear that we must use the same kind of union construction to express
the closure, but in extent we must also join in the remaining links in each sub
expression.
select g.ActorId, g.GroupId from DataLink g where
(g.ActorId <> x or g.GroupId <> y)
union
select g.ActorId, b.AncestorId from Path b, ActorLink g where
(g.GroupId <> y) and
(b.ActorId = g.GroupId)
union
select a.ActorId, g.GroupId from Path a, ActorLink g where
(g.ActorId <> x) and
(a.AncestorId = g.ActorId)
union
select a.ActorId, b.AncestorId from Path a, Path b, DataLink g where
(g.ActorId <> x or g.GroupId <> y) and
(a.AncestorId = g.ActorId) and
(b.ActorId = g.GroupId)
Author: Lennart Jonsson
Algorithms for handling tree stucture
in apollo
2001-04-07
Therefore the final expression to remove a link becomes:
delete from Path where (ActorId, AncestorId) in
(
select * from (
(values x, y)
union
(select x, AncestorId from Path where Id = y)
union
(select Id, y from Path where AncestorId = x)
union
(select a.Id, b.AncestorId from Path a, Path b where
a.AncestorId = x and b.Id = y)
) except select * from (
select g.ActorId, g.GroupId from DataLink g where
(g.ActorId <> x or g.GroupId <> y)
union
select g.ActorId, b.AncestorId from Path b, ActorLink g where
(g.GroupId <> y) and
(b.ActorId = g.GroupId)
union
select a.ActorId, g.GroupId from Path a, ActorLink g where
(g.ActorId <> x) and
(a.AncestorId = g.ActorId)
union
select a.ActorId, b.AncestorId from Path a, Path b, DataLink g where
(g.ActorId <> x or g.GroupId <> y) and
(a.AncestorId = g.ActorId) and
(b.ActorId = g.GroupId)
)
)
Author: Lennart Jonsson
Algorithms for handling tree stucture
in apollo
2001-04-07
Summary
This section is a summary of how different operations on a tree should be reflected in
the Path table.
Add node
Insert: {( x, y )}  ( x  path( parent ( x, y ))
“Insert into Path (values (x,y) union Select x, AncestorId from Path where Id = y”
Delete node
Delete: x  path( x, y )
“Delete from Path where Id = x”
Move node
Delete: ( x  subtree( x))  path( x)
Insert: ( x  subtree( x))  ( y  path( y ))
delete from Path where
((Id = x) or Id in (select Id from Path where AncestorId = x)) and
AncestorId in (select AncestorId from Path where Id = x)
insert into Path (
(values x, y)
union
(select x, AncestorId from Path where Id = y)
union
(select Id, y from Path where AncestorId = x)
union
(select a.Id, b.AncestorId from Path a, Path b where
a.AncestorId = x and b.Id = y)
)
Optionally one might exclude the set:
x  ( path( x)  path( y ))
from both operations.
Author: Lennart Jonsson
Algorithms for handling tree stucture
in apollo
2001-04-07
Add Link
Insert: ( x  subtree( x))  (ref [ x, y ]  path(ref [ x, y ])  {( u, v) | (u, v)  D}
insert into Path (
select * from (
(values x, y)
union
(select x, AncestorId from Path where Id = y)
union
(select Id, y from Path where AncestorId = x)
union
(select a.Id, b.AncestorId from Path a, Path b where
a.AncestorId = x and b.Id = y)
) as T (ActorId, AncestorId)
where not exists (
select * from Path P where
P.ActorId = T.ActorId and P.AncestorId = T.AncestorId
)
Remove Link
Delete:
compl ([ x, y ])  {[ u, v] | (u  x  v  y )  [u, v]  Links}
closure ([ x, y ])  (deref [ x, y ]  subtree(deref [ x, y ]))  (ref [ x, y ]  path(ref [ x, y ]))
closure ( X ) 
 closure ( x )
i
xi X
closure ([ x, y ])  closure (compl ([ x, y ])
Author: Lennart Jonsson
Algorithms for handling tree stucture
in apollo
2001-04-07
delete from Path where (ActorId, AncestorId) in
(
select * from (
(values x, y)
union
(select x, AncestorId from Path where Id = y)
union
(select Id, y from Path where AncestorId = x)
union
(select a.Id, b.AncestorId from Path a, Path b where
a.AncestorId = x and b.Id = y)
) except select * from (
select g.ActorId, g.GroupId from DataLink g where
(g.ActorId <> x or g.GroupId <> y)
union
select g.ActorId, b.AncestorId from Path b, ActorLink g where
(g.GroupId <> y) and
(b.ActorId = g.GroupId)
union
select a.ActorId, g.GroupId from Path a, ActorLink g where
(g.ActorId <> x) and
(a.AncestorId = g.ActorId)
union
select a.ActorId, b.AncestorId from Path a, Path b, DataLink g where
(g.ActorId <> x or g.GroupId <> y) and
(a.AncestorId = g.ActorId) and
(b.ActorId = g.GroupId)
)
)
Author: Lennart Jonsson
Algorithms for handling tree stucture
in apollo
2001-04-07
Implementation Considerations
The operations discussed here might be implemented in a number of ways. The most
obvious is to implement the functionality within the API methods that handles the
tree. The main advantage with this approach is portability1. The drawback is that it is
not obvious what to do at all times. One example already mentioned is when we
remove a node that has links related to it. The referential constraint in the db makes
sure that the links are removed, but it does not clean up the path table. Thus before we
can remove a node we must look up what links that are related, and manually remove
them.
Another approach is to hide this functionality within the db, for example in a trigger.
Then the db itself detects the link removal, and can therefore clean up the path table
without us having to bother.
1
Even though triggers are part of SQL99, different vendors seems to have slightly
different syntax