Path Sharing and Predicate Evaluation for High

Path Sharing and Predicate Evaluation for High-Performance
XML Filtering*
Yanlei Diao, Michael J. Franklin, Hao Zhang, Peter Fischer
EECS, University of California, Berkeley
{diaoyl, franklin, nhz, fischerp}@cs.berkeley.edu
Abstract
1
Data Source
user
queries
XML
streams
XML filtering systems aim to provide fast
matching of documents to large numbers of
query specifications containing both structurebased and value-based predicates. These two
types of predicates present very different
challenges for matching algorithms. Previous
work has addressed structure and value matching
to various degrees, but has not investigated how
to best integrate these two functions. In this
work, we first present a highly-efficient, NFAbased structure matching approach that exploits
commonality among path expressions. We then
propose two alternative techniques for extending
this model with value-based predicates. These
approaches are then evaluated experimentally.
Our results show that there are substantial
differences in the performance of these
techniques, demonstrating that to be most
effective, XML filtering systems must indeed
address the interaction of structural and contentbased aspects of matching.
path expressions) and value (i.e., node filters).
In the past few years, there have been a number of
efforts to build efficient large-scale XML filtering
systems. While most of these systems support both
structure and value matching to some extent, they have
tended to emphasize either the processing of path
expressions (e.g., XFilter [AF00], WebFilter [PFL01],
CQMC [OKA01], XTrie [CFG02]), or the processing of
value-based predicates (e.g., TriggerMan [HCH99],
NiagraCQ [CDT00], Le Subscribe [FJL01]). Thus, even
though most systems have included both aspects of
matching, none of the prior work has focused on the
fundamental issues that arise in their integration.
Introduction
1.1 Structure Matching
The emergence of XML as a common mark-up language
for data interchange on the Internet has spawned
significant interest in techniques for filtering and contentbased routing of XML data. In an XML filtering system,
continuously arriving streams of XML documents are
passed through a filter engine (see Figure 1), that matches
documents to queries and routes the matched the
documents to users, systems, or even to other routers.
Queries in these systems are expressed in a language such
as XPath, which provides predicates over structure (i.e.,
Permission to copy without fee all or part of this material is granted
provided that the copies are not made or distributed for direct
commercial advantage, the VLDB copyright notice and the title of the
publication and its date appear, and notice is given that copying is by
*This work
partEndowment.
by the National
permission
of has
the been
Very supported
Large Data inBase
To copy
Science or
Foundation
ITR
andfrom
otherwise,
to republish,under
requires
a feegrant
and/orIIS00-86057,
special permission
the
byEndowment
IBM, Microsoft, and the UC MICRO program.
Proceedings of the 28th VLDB Conference,
Hong Kong, China, 2002
Data Source
A Filter
Engine
Data Source
query
results
Fig. 1: Architecture of an XML filtering system
For structure matching, the natural approach has been to
adopt some form of Finite State Machine (FSM) in which
elements of path expressions are mapped to machine
states. Arriving XML documents are then parsed with an
event-based parser; the events raised during parsing are
used to drive the machine through its various transitions.
A query is determined to match a document if during
parsing, an accepting state for that query is reached.
We have developed a system, called YFilter, that
employs a novel structure-matching approach based on
Nondeterministic Finite Automata (NFA). In a largescale filtering environment, NFAs have the significant
advantage that the number of machine states required to
represent even large numbers of path expressions can be
kept relatively small. The NFA-based approach naturally
supports the sharing of processing for overlapping
queries, and has other practical advantages including: 1)
the ability to support more complicated document types
(e.g., with recursive nesting) and queries (e.g., with
Draft – Submitted for Publication, February, 2002
1
multiple wildcards and ancestor-descendent axes), and 2)
incremental construction and maintenance.
While in general there may be concern that an NFAbased approach could prove inefficient, our results show
that YFilter is sufficiently fast. When using it for
processing many (i.e., 10’s or 100’s of thousands) queries,
structure matching is no longer the dominant cost of XML
filtering. Thus, while we make no claims that YFilter is
the fastest possible structure-matching engine, we believe
that from a practical standpoint, the combination of fast
processing, flexibility, and ease of maintenance, make
YFilter an excellent choice for large-scale XML filtering.
1.2 Value-based Predicates
Given the NFA-based structure-matching engine, an
intuitive approach to supporting value-based predicates
would be to simply extend the NFA by including
predicates as labels on additional transitions between
states. Unfortunately, such an approach would result in a
potentially huge increase in the number of states in the
NFA, and would also destroy the sharing of path
expressions that is a primary advantage of the NFA.
For this reason, we have investigated several
alternative approaches to combining structure-based and
value-based filtering in YFilter. Similar to traditional
relational query processing, the placement of predicate
evaluation in relation to the other aspects of such
processing can have a major impact on performance.
Relational systems use the heuristic of “pushing” cheap
selections as far as possible down the query plan so that
they are processed early in the execution. Following this
intuition, we have developed an approach called “Inline”,
that processes value-based predicates as soon as the
relevant state is reached during structure matching. In
addition, we have developed an alternative approach,
called Selection Postponed, that waits until an accepting
state is reached during structure matching, and at that
point applies all the value-based predicates for the
matched queries. As we will see, the tradeoffs in
predicate processing for XML filtering differ substantially
from those found in more traditional environments.
1.3 Contributions and Overview
In this paper, we present YFilter, an XML filtering engine
that integrates structure-based and value-based
processing. The contributions include the following:
• We describe a novel NFA-based structure matching
approach that provides excellent performance and
flexibility for large-scale filtering environments.
• We propose two alternative methods for integrating
value-based predicate processing with the NFA-based
engine.
• We present results of a detailed performance study of
an implementation of YFilter, focusing first on the
filtering time for path expressions, and then on the
comparison of the two methods for value-based
processing. This latter study is, to our knowledge,
the first study focused on alternative approaches to
the important issue of combined structure and valuebased filtering.
The remainder of the paper is organized as follows.
The NFA structure matching approach and our
alternatives for value-based predicates are presented in
Sections 2 and 3 respectively. We report the results of
our performance study in Sections 4 and 5. We address
related work in Section 6, and present conclusions in
Section 7.
2 NFA Processing for Path Expressions
In this work, we focus on queries that are written in a
subset of XPath [CD99]. XPath allows parts of XML
documents to be addressed according to their logical
structure. A query path expression in XPath is composed
of a sequence of location steps. Each location step
consists of an axis, a node test and zero or more
predicates. An axis specifies the hierarchical relationship
between the nodes. We focus on two common axes: the
parent-child operator ‘/’, and the descendent-or-self
operator “//”1. We support node tests that are specified by
either an element name or the wildcard operator ‘*’,
which matches any element name. Predicates can be
applied to attributes of an element, to the contents of an
element, or may contain references to other elements in
the document. In this section, we describe our NFA-based
processing for structure matching. Value-based predicates
are discussed in detail in Section 3.
2.1 An NFA-based Model with Output
Any single path expression written using the axes and
node tests described above can be transformed into a
regular expression. Thus, there exists a Finite State
Machine (FSM) that accepts the language described by
such a path expression [HU79].
An obvious way to handle multiple queries in a
filtering system, would be to build an individual FSM for
each query, and to execute all of these machines each time
a new XML document arrives. XFilter [AF00] took this
approach, but used a sophisticated indexing scheme to
identify only potentially relevant machines. It then
executed all of these machines simultaneously. The
drawback of such an approach, however, is that it does
not exploit any commonality that may exist among the
path expressions. If two path expressions share a subexpression, the language described by that sub-expression
can be accepted by a single FSM. For large-scale filtering
of XML data, exploiting such commonality is the key to
scalability.
For this reason, we have taken a different approach.
Rather than representing each query individually, we
combine all of them into a single NFA. This single
machine is effectively a trie over the strings representing
the structural components of the path expressions. As
1
The “following-sibling” and “preceding-sibling” axes are not
considered yet, so only unordered matching is supported
currently.
Draft – Submitted for Publication, February, 2002
2
{Q1}
Q1=/a/b
Q2=/a/c
Q3=/a/b/c
Q4=/a//b/c
Q5=/a/*/c
Q6=/a//c
Q7=/a/*/*/c
Q8=/a/b/c
c
{Q3, Q8}
b
a
c
ε
{Q2}
* b
c
*
c
*
c
{Q4}
{Q6}
{Q5}
c
{Q7}
Fig. 2: XPath queries and a corresponding NFA
such, all common prefixes of the paths are represented
only once in the structure.
Figure 2 shows an example of such an NFA,
representing eight queries (we describe the process for
constructing such a machine in the following section). A
circle denotes a state. Two concentric circles denote an
accepting state; such states are also marked with the IDs
of the queries they represent. A directed edge represents a
transition. The symbol on an edge represents the input
that triggers the transition. The special symbol “*”
matches any element. The symbol “ε” is used to mark a
transition that requires no input. In the figure, shaded
circles represent states shared by queries. Note that the
common prefixes of all the queries are shared. Also note
that the NFA contains multiple accepting states. While
each query in the NFA has only a single accepting state,
the NFA represents multiple queries. Identical (and
structurally equivalent) queries share the same accepting
state (recall that at the point in the discussion, we are not
considering predicates).
This NFA can be formally defined as a Moore
Machine [HU79]. The output function of the Moore
Machine here is a mapping from the set of accepting
states to a partitioning of identifiers of all queries in the
system, where each partition contains the identifiers of all
the queries that share the accepting state.
Some Comments on Efficiency
A key benefit of using an NFA-based approach is the
tremendous reduction in machine size it affords. Of
course, it is reasonable to be concerned that using an
NFA-based model could lead to performance problems
due to (for example) the need to support multiple
transitions from each state. A standard technique for
avoiding such overhead is to convert the NFA into an
equivalent DFA [HU79]. A straightforward conversion
could theoretically result in severe scalability problems
due to an explosion in the number states. But, as pointed
out in [OS02], this explosion can be avoided in many
cases by placing restrictions on the set of DTDs (i.e.,
document types) and queries supported, and lazily
constructing the DFA.
Our experimental results (described in Section 4),
however, indicate that such concerns about NFA
performance in this environment are unwarranted. In fact,
in the YFilter system, path evaluation (using the NFA) is
sufficiently fast, that it is typically not the dominant cost
of filtering. Rather, other costs such as document parsing
and result collection are in many cases more expensive
than the basic path matching, particularly for systems
with large numbers of similar queries. Thus, while it may
in fact be possible to further improve path matching
speed, we believe that the substantial benefits of
flexibility and ease of maintenance provided by the NFA
model outweigh any marginal performance improvements
that remain to be gained by even faster path matching.
We revisit this issue in Section 4
2.2 Constructing a Combined NFA
Having presented the basic NFA model used by YFilter,
we now describe an incremental process for NFA
construction and maintenance. The shared NFA shown in
Figure 2 was the result of applying this process to the
eight queries shown in that figure.
The four basic location steps in our subset of XPath
are “/a”, “//a”, “/*” and “//*”, where ‘a’ is an arbitrary
symbol from the alphabet consisting of all elements
defined in a DTD, and ‘*’ is the wildcard operator. Figure
3 shows the directed graphs, called NFA fragments, that
correspond to these basic location steps.
Note that in the NFA fragments constructed for
location steps with “//”, we introduce an ε-transition
moving to a state with a self-loop. This ε-transition is
needed so that when combining NFA fragments
representing “//” and “/” steps, the resulting NFA
accurately maintains the different semantics of both steps
(see the examples in Figure 4 below). The NFA for a
path expression, denoted as NFAp, can be built by
concatenating all the NFA fragments for its location steps.
The final state of this NFAp is the (only) accepting state
for the expression.
NFAps are combined into a single NFA as follows:
There is a single initial state shared by all NFAps. To
insert a new NFAp, we traverse the combined NFA until
either: 1) the accepting state of the NFAp is reached, or 2)
a state is reached for which there is no transition that
matches the corresponding transition of the NFAp. In the
first case, we make that final state an accepting state (if it
is not already one) and add the query ID to the query set
associated with the accepting state. In the second case,
we create a new branch from the last state reached in the
combined NFA. This branch consists of the mismatched
transition and the remainder of the NFAp. Figure 4
provides four examples of this process.
Figure 4(a) shows the process of merging a fragment
for location step “/a” with a state in the combined NFA
that represents a “/b” step. We do not combine the edge
marked by ‘a’ and the edge marked by ‘b’ into one
marked by “a,b” as in a standard NFA, because the states
after edge ‘a’ and edge ‘b’ differ in their outputs, so they
cannot be combined. For the same reason, this process
treats the ‘*’ symbol in the way that it treats the other
symbols in the alphabet, as shown in Figure 4(b).
Figure 4(c) shows the process of merging a “//a” step
with a “/b” step, while Figure 4(d) shows the merging of a
“//a” step with a “//b” step. Here we see why we need the
Draft – Submitted for Publication, February, 2002
3
a
/a
//a
/*
//*
Location steps
ε
*a
*
ε
**
NFA fragments
a
*
ε
b
b
b
a
*
ε
*
b
b
b
(a)
(b)
(c)
Fig. 3: NFA fragments of basic location steps
ε-transition in the NFA fragment for “//a”. Without it,
when we combine the fragment with the NFA fragment
for “/b”, the latter would be semantically changed to
“//b”. The merging process for “//*” with other fragments
(not shown) is analogous to that for “//a”.
The ‘*’ and “//” operators introduce Non-determinism
into the model. ‘*’ requires two edges, one marked by the
input symbol and the other by ‘*’, to be followed. The
descendent-or-self operator “//” means the associated
node test can be satisfied at any level at or below the
current document level. In the corresponding NFA
model, if a matching symbol is read at the state with a
self-loop, the processing must both transition to the next
state, and remain in the current state awaiting further
input.
It is important to note that because NFA construction
in YFilter is an incremental process, new queries can
easily be added to an existing system.
This ease of
maintenance is a key benefit of the NFA-based approach
2.3 Implementing the NFA Structure
The previous section described the logical construction of
the NFA model. For efficient execution we implement
the NFA using a “hash table-based” approach, which has
been shown to have low time complexity for
inserting/deleting states, inserting/deleting transitions, and
actually performing the transitions [Wat97].
In this approach, a data structure is created for each
state, containing: 1) The ID of the state, 2) type
information (i.e., if it is an accepting state or a //-child as
described below), 3) a small hash table that contains all
the legal transitions from that state, and 4) for accepting
states, an ID list of the corresponding queries.
The transition hash table for each state contains
[symbol, stateID] pairs where the symbol, which is the
key, indicates the label of the outgoing transition (i.e.,
element name, ‘*’, or ‘ε’) and the stateID identifies the
child state that the transition leads to. Note that the child
states of the ‘ε’ transitions are treated specially. Recall
that such states have a self-loop marked with ‘*’ (see
Figure 3). For such states, (called “//-child” states) we do
not index the self-loop. As described in the next section,
this is possible because transitions marked with ‘ε’ are
treated specially by the execution mechanism.
ε
* a
ε
*b
ε
*
* a
a
a
b
(d)
Fig. 4: Combining NFA Fragments
2.4 Executing the NFA
Having walked through the logical construction and
physical implementation we can now describe the
execution of the machine.
In YFilter, the streaming data is a sequence of XML
documents. Each XML document comprises well-formed
start-end element pairs with arbitrary nesting. As such,
simply treating the XML document as an input string is
beyond the scope of any language accepted by a FSM.
One way to solve this problem would be to “shred” each
XML document in to a set of root-to-leaf paths, and to
invoke the NFA on each of these. Such an approach,
however, is likely to perform significant redundant
processing due to the commonality of the path prefixes.
Instead, similar to other approaches [AF00, CFG02,
etc] we chose to execute the NFA in an event-driven
fashion and use a stack mechanism to enable
backtracking. As an arriving document is parsed, the
events raised by the parser drive the transitions in the
NFA. In this way, an element in a document is processed
only once. On the other hand, the nesting of XML
elements requires that when an “end-of-element” event is
raised, NFA execution must backtrack to the states it was
in when the corresponding “start-of-element” was raised.
Since in an NFA, many states can be active
simultaneously, the run-time stack mechanism must be
capable of tracking multiple active paths. Finally, it is
important to note that, unlike a traditional NFA, whose
goal is to find one accepting state for an input, our NFA
execution must continue until all potential accepting
states have been reached. This is because we must find
all queries that match the input document.
When an XML document arrives to be parsed, the
execution of the NFA begins at the initial state. When a
new element name is read from the document, the NFA
execution follows all matching transitions from all
currently active states. The transitions are performed as
follows: For each active state, four checks are performed.
• First, the incoming element name is looked up in the
state’s hash table. If it is present, the corresponding
stateID is added to a set of “target states”.
• A transition marked by the ‘*’ symbol is checked in
the same way.
• Then, the type information of the state is checked. If
the state itself is a “//-child” state, then its own stateID
Draft – Submitted for Publication, February, 2002
4
Index
3 {Q1} 5
c
2
1
a
4
b
c
6
ε
*
{Q2}
b
c
7
c
c
*
{Q4}
11
{Q6}
10
9
8
12
c
Runtime Stack
An XML
fragment
{Q3, Q8}
{Q5}
13
1
initial
<a>
<b>
<c>
</c>
</b>
</a>
3976
2
1
2
1
read <a>
read <b>
match Q1
3976
2
1
read </c>
{Q7}
2
1
5 10 12
8 11 6
3976
2
1
read <c>
match Q3 Q8
1
Q5 Q6 Q4
read </b> read </a>
Fig. 5: An example of NFA execution
is added to the set, which effectively implements a
self-loop in the combined NFA.
• Finally, to perform an ε-transition, the hash table is
checked for the “ε” symbol, and if one is present, the
//-child state indicated by the corresponding stateID is
processed recursively, according to these same rules.2
After all the currently active states have been checked
in this manner, the set of “target states” is pushed onto the
top of the run-time stack. They then become the “active”
states for the next event. When an end-of-element is
encountered, backtracking is performed by simply
popping the top set of states off the stack. Note that when
an accepting state is reached during processing, the
identifiers of all queries associated with the state are
collected and added to an output data structure.3
An example of this execution model is shown in
Figure 5. On the left of the figure is the index created for
the NFA of Figure 1. The number on the top-left of each
hash table is a state ID and hash tables with a bold border
represent accepting states. The right of the figure shows
the evolution of the contents of the runtime stack as an
example XML document is parsed. In the stack, each state
is represented by its ID. An underlined ID indicates that
the state is a //-child.
3 Predicate Evaluation
As stated in the introduction, previous work on
information filtering has emphasized either structurebased filtering or content-based filtering. In general,
however, languages such as XPath allow the specification
of queries that contain both types of constraints. Thus, a
primary goal and contribution of our work is to explore
how to efficiently support combined structure/content
queries for large-scale filtering.
XPath predicates can impose constraints on elements
in an XML document by addressing properties of
elements, such as their content, their position, and their
attributes. Examples include:
• Existence or the value of an attribute in an element,
e.g., /nitf[@id], /nitf[@id >= 5];
• The
text
data
of
an
element,
e.g.,
/nitf//title[.=“XPath”].
• The position of an element, e.g. /nitf/head/meta
[position()=2], which means “select the second meta
child of head that is a child of nitf”.
Any number of these predicates can be attached to a
location step in a query. Predicates can also reference
another path in the document. Such “nested paths” are
handled by decomposing them into separate paths in the
NFA and performing post-processing to ensure that all
paths of a query are satisfied (a similar technique is
employed in XFilter). Due to space constraints we focus
in this section on predicates that do not contain such
nested paths.
An intuitive approach to predicate evaluation is to
simply extend the NFA model described in the Section
2.2 by including predicates as labels on additional
transitions from the states representing the location steps
they are associated with. Unfortunately, such an approach
would result in a potentially huge increase in the number
of states in the NFA, and would destroy the sharing of
path expressions that is a primary advantage of the NFA.
Instead, we have chosen to implement predicates
using a special operator, called Selection, that exists
outside the NFA model. We have developed two
alternative approaches to implement selection. The first
approach, called Inline, applies selection during the
execution of the NFA, while the second, called SP (for
“selection postponed”), simply runs the NFA as described
in the preceding section, and then applies the selection
predicates in a post-processing phase. Below, we discuss
these two alternatives in more detail. We compare their
performance in Section 5.
3.1 Implementation of the Inline Approach
2
We use query rewriting to collapse adjacent “//” operators in to
a single “//” operator, which is semantically equivalent. Thus,
the process traverses at most one additional level, since //-child
nodes do not themselves contain a “ε” symbol.
3 If predicate processing is not needed, we can also mark the
accepting state as “visited” to avoid processing matched queries
more than once.
For the Inline approach, we extend the information stored
in each state of the NFA to include any predicates that are
associated with that state, as shown in Figure 6. Since
multiple path expressions may share a state, this table can
include predicates from different queries. We distinguish
among these by including a Query Id field. In a particular
Draft – Submitted for Publication, February, 2002
5
{Q1}
*
*
{Q3}
{Q8}
step number property operator value
…
…
…
…
QueryId PredicateId property operator value
…
…
…
…
…
Fig. 6: Predicate Storage for Inline
state, the pair (Query Id, Predicate Id) uniquely identifies
a predicate.
The Inline approach works as follows. When a startof-element event is received, the NFA transitions to new
states as described in Section 2. For each of these states,
all of the predicates in the state-local predicate tables are
checked. For each query, bookkeeping information is
maintained, indicating which of the predicates of that
query have been satisfied. When an accepting state is
reached, the bookkeeping information for all of the
queries of that state is checked, and those queries for
which all predicates have been satisfied are returned as
matches.
There are several details to consider with this
approach, however. The first issue has to do with if and
when predicate failure can reduce further work. When a
predicate test at a state fails, the execution along that path
cannot necessarily be terminated due to the shared nature
of the NFA  if other queries share the state but did not
contain a predicate that failed there, then processing must
continue for those queries. Furthermore, if a query
contains a “//” prior to a predicate, then even if the
predicate fails, the query effectively remains active due to
the non-determinism introduced by that axis. For these
reasons, the common query optimization heuristic of
“pushing selects” to earlier in the evaluation process in
order to reduce work later, is not likely to be effective in
this environment.
A second issue is that, due to the nested structure of
XML documents, it is likely that backtracking will occur
during the NFA processing. Such backtracking further
complicates the task of tracking which predicates of
which query have been satisfied.
For example, consider query q1= “/b[@fb1=u][@fb2=
w]”. q1 consists of a single location step with two
predicates (on two different attributes of “b” elements). If
care is not taken during backtracking, the following error
could occur. Consider the XML fragment “<b fb1=u>
</b> <b fb2=w> </b>”. When the first “b” element is
parsed the first predicate of q1 is set to true. When the
corresponding end-of-element tag is parsed, the NFA
returns to the previous state. Then, when the second “b”
Fig. 7: Predicate Storage for SP
element is parsed, the second predicate of q1 is marked as
true. Thus, it wrongly appears that q1 has been satisfied.
This problem can be solved by augmenting the
backtracking to include the “undo” of any changes to
predicate bookkeeping information made while
processing the state.
Unfortunately, the above solution does not solve a
similar problem that exists for recursively nested
elements. For example, consider query q1 when applied
to the following XML fragment: <b fb1=u> <b fb2=w>
</b> </b>. When the first ‘b’ element is parsed, the first
predicate of q is set true. Before the execution backtracks
for this ‘b’, another ‘b’ is read, which sets the second
predicate to true, thus again, erroneously indicating that
query q1 has been satisfied.
In order to solve this latter problem, additional
bookkeeping information must be kept for predicate
evaluation. This additional information identifies the
particular event that caused each predicate to be set to
True. During the final evaluation for a query at its
accepting state, the query is considered to be satisfied
only if all predicates attached to the same location step are
satisfied by the same event.
Finally, as an optimization used in our implementation
of Inline, the predicates stored at a state are partitioned by
property, operator and value. When an element is read,
the information obtained from the start-element event is
used to probe the predicate table, thus avoiding a full scan
of the table.
3.2 Implementation of Selection Postponed (SP)
Effort spent evaluating predicates with Inline will be
wasted if ultimately, the structure-based aspects of a
query are not satisfied. An alternative approach that
avoids this problem is to delay predicate processing until
after the structure matching has been completed. In this
approach, called SP, the predicates are stored with each
query, as shown in Figure 7. We chose to index the
predicates for a query by the field “step number”, which
indicates the location step where a predicate appears.
In contrast to Inline, using SP the execution of
structure-matching and predicate-checking are clearly
Draft – Submitted for Publication, February, 2002
6
q2: //a[@fa1=v]//b
1
ε
2
*a
3
5
Runtime Stack
ε
4
*b
5
3
4
2
3
1
An XML fragment:
<a fa1=u ><a fa1=v ><b></b></a></a>
read <b>
match q2
Fig 8: A sample query, its NFA, and the NFA execution
separated. When an accepting state is reached, selection
is performed in bulk, by evaluating all of the relevant
predicates. This approach has several potential
advantages. First, there is no need to extend the NFA
backtracking logic as described for Inline. Second, since
the predicates of different location steps are treated as
conjunctions, a short-cut evaluation method is possible,
where some predicate checking may be avoided for
queries that are determined not to match a particular
document.4
In order to delay selection until the end of structure
matching, however, the NFA must be extended to retain
some additional history about the states visited during
structure matching. The reason for this is demonstrated
by the following example.
Consider query q2 and an XML document fragment as
shown in Figure 8. When element ‘b’ of the document is
parsed, the NFA execution arrives at the accepting state of
the NFA for this query (also shown in Figure 8). When
selection processing is performed for q2, we need to
determine which of the two ‘a’ elements encountered
during parsing to test the predicate on.
A naïve method would be to simply check all of the
‘a’ elements encountered. Unfortunately with more “//”
operators in a query or more recursive elements in the
document, searching for matching elements for predicate
evaluation could become as expensive as running NFA
again for this query. Instead, we extend the NFA to
output not only query IDs, but a list indicating the
sequences of document elements that would participate in
predicate evaluation.
For example in the accepting state for q2, the NFA
would report the sequences “a1 b” and “a2 b”, where a1
represents the first ‘a’ element and a2 represents the
second (nested) ‘a’ element. Since predicates are indexed
by “step number”, it is easy for the selection operator to
determine which elements need to be tested. For query
q2, the first sequence does not satisfy the query because a1
does not satisfy the predicate, but the second sequence
does.
The NFA is easily extended to output these sequences
by linking the states in the runtime stack backwards
4 Note however, that with predicate evaluation it becomes possible
to visit a given accepting state multiple times, due to predicate
failure and backtracking. Such short-cut predicate evaluation only
saves work for a single visit.
towards the root (also as shown in Figure 8, which
includes the content of the stack for the accepting state).
For each active state that is an accepting state, we can
traverse backwards to find the sequence of state visits that
lead to the state. Note that elements that trigger
transitions to “//-child” states can be ignored in this
process, as they do not participate in predicate evaluation.
Returning to the example in Figure 8, there are two
routes, namely “2 3 5” and “3 4 5” that NFA took when
the path of elements a1, a2 and b were read. After
eliminating the elements that trigger transitions to “//child” states for each route, the two sequences of
matching elements, “a1 b” and “as b”, can be generated.
4 Performance Study of the NFA Execution
Having described YFilter in some detail, we now turn to
an investigation of its performance. We proceed in two
steps. First, in this section, we examine the performance
of the NFA-based structure matching approach employed
by YFilter in the absence of support for value-based
predicates. Here, we compare YFilter to two alternatives:
XFilter, and a hybrid approach. Then, in Section 5, we
focus on YFilter, and compare the Inline and SP
approaches to content-based predicate evaluation.
4.1 Implementation of Alternative Approaches
In order to better understand the performance of YFilter,
we implemented two alternative approaches. The first is
XFilter [AF00], which was one of the early filtering
systems to represent XPath queries as FSMs and use
event-based parsing to drive their execution. XFilter has
been used as a point of comparison in several other
studies (e.g. [PFL01], [CFG02], [OS02]). XFilter also
served as the starting point of the work presented here.
Due to space limitations, we cannot present the details
of XFilter here. As an overview, XFilter works as
follows: Each query is decomposed into a sequence of
location steps and annotated with the relative distance
from the previous one in terms of document levels. These
steps become states in an FSM representing the query.
Matching parse events trigger transitions to a new state if
the current document level is equal to the expected
document level of the new state. When changing to the
new state, the expected level of the state is set using the
relative distance information.
To support simultaneous execution of all FSMs, a
hash table maps element names to a list of states that are
candidates for transition. When a state in a list is actually
transitioned, it is deleted from the index and its successor
in the query is inserted into the corresponding entry. We
implemented an improved version of XFilter, using “list
balancing” which was shown in [AF00] to provide
significant performance improvements.
As has been pointed out in the studies listed above, an
Achilles heel of XFilter is that it does not support the
sharing of representation or execution among path
expressions. In order to address this issue, we also
implemented an improved version of XFilter, that we call
“Hybrid”, which does use path sharing (although not as
Draft – Submitted for Publication, February, 2002
7
aggressively as YFilter). The Hybrid approach is similar
to the “minimal decomposition” and “eager TRIE”
techniques developed independently by [CPG02].5
Since full path sharing requires the index scheme and
execution algorithm to handle the non-determinism
introduced by ‘*’ and “//” operators, Hybrid simply
decomposes queries into substrings containing only ‘/’.
The collection of these substrings is put into a single
index for all queries. In addition, each query maintains
relative distance information for transitions from the end
of a substring to the start of the next substring. During the
execution, the transition between two states in a substring
is shared by all queries containing this substring. The
transition across two substrings is done on a per-query
basis using document level checking as XFilter.
For both XFilter and Hybrid, we added a simple
optimization that is extremely important in our workloads,
namely, that identical queries are represented in the
system only once. We did this by pre-processing the
queries and collecting the IDs of identical queries in an
auxiliary data structure. This structure is the same as that
used by YFilter to manage query IDs in accepting states.
4.2 Experimental Set-up
We implemented the three algorithms (YFilter, XFilter
(with list balancing), and Hybrid) using Java. All of the
experiments reported here were performed on a Pentium
III 850 Mhz processor with 384MB memory running
JVM 1.3.0 in server mode on Linux 2.4. We set the
maximum allocation pool of Java to 200MB, so that
virtual memory and other I/O-activity had no influence on
the results. This was also verified using the Linux
command vmstat.
Following the workloads of [AF00], each experiment
was generated from a single DTD. In this section, we
focus on experiments using the NITF (News Industry
Text Format) DTD [Cov99] that has been used in
previous studies [AF00, CFG02]. We did however, run
experiments using two other DTDs: The Xmark-Auction
DTD [BCF01] from the Xmark benchmark, and the
DBLP [DBL01] bibliography DTD. Some characteristics
of these DTDs are shown in Table 1.
# elements names
# attributes in total
NITF
123
510
Auction
77
16
DBLP
36
14
Table 1: Characteristics of three DTDs
Given a DTD, the tools used to generate an
experiment include a DTD parser, an XML generator, a
query generator and a SAX2.0 [SAX01] event-based
XML parser. The DTD parser which was developed using
a WUTKA DTD parser [Wut00] outputs information on
parent-child relationships between elements and statistics
for each element, etc. which is used by the query
5
Note that this approach was not the best performing one used in
[CPG02]. However, it does provide some insight into YFilter’s
expected performance relative to the other algorithms in that paper,
as described in the next section.
generator and the document generator. We wrote a query
generator, which reads the output of the DTD parser and
creates a set of XPath queries based on the workload
parameters listed in Table 2. The parameters Q, D, W,
and DS are relevant to the experiments in this section. As
reported in [AF00], the average length of generated
queries or documents does not increase much when D is
larger than 6, so this value was used for all experiments.
For document generation, we used IBM’s XML Generator
[DL99]. As a default, this generator limits the number of
times that an element can be repeated under a single
parent to 3. We also used a uniform distribution to
choose among the possible elements at each level in the
document.
Pr
Range
Description
Q
Number of Queries
D
1000 to
500000
6 fixed
W
0 to 1
DS
0 to 1
P
0 to 20
Maximum depth of XML documents
and XPath queries.
Probability of a wildcard ‘*’ occurring
at a location step
Probability of “//” being the operator at
a location step
Number of predicates per query (avg.)
Table 2: Workload Parameters
For each DTD we generated a set of 200 XML
documents. All reported experimental results are averaged
over this set. For each experiment, queries were generated
according to the workload setting. For each algorithm,
queries were preprocessed, if necessary, and then bulk
loaded to build the index and other data structures. Then
XML documents were read from disk one after another.
The execution for each document returned a bit set, each
bit of which indicates whether or not the corresponding
query has been satisfied. For each experiment run of an
algorithm (i.e., 200 documents) we began a new process,
to avoid complications from Java’s garbage collector.
Previous work [AF00, CFG02] used “filtering time”
as the performance metric, which is the total time to
process a document including parsing and outputting
results. Noticing that java parsers have varying parsing
costs, we instead report on a slightly different
performance metric we call “multi-query processing time
(MQPT)”. MQPT is simply the filtering time minus the
document parsing time.
MQPT consists of two primary components: path
navigation and result collection. The latter is the cost to
collect the identifiers of queries from the auxiliary data
structures and to mark them in the result bit set. We
measured these costs separately. Where appropriate, we
also report on other metrics such as the number of
transitions followed, the size of the various machines, and
the costs associated with maintenance, etc.
4.3 Efficiency and Scalability
Having described our experimental environment, we
begin our discussion of experimental results by presenting
Draft – Submitted for Publication, February, 2002
8
yfilter
700
hybrid
600
xfilter(lb)
MQPT (ms)
500
400
300
200
100
0
0
0.2
0.4
0.6
0.8
1
"//" Probability
Fig. 9: Varying number of queries
Fig. 10: Varying number of queries
(NITF, D=6, W=0.2, DS=0.2)
(Auction, D=6, W=0.2, DS=0.2)
the MQPT results for the three alternatives as the number
of queries in the system is increased.
4.3.1 Experiment 1: NITF.
Figure 9 shows the MQPT for the three algorithms as the
number of queries in the system is increased from 1000 to
500,000, under the NITF workload, with the probability
of ‘*’ and ‘//’ operators each set to 0.2. With this setting,
there is approximately one ‘*’ operator and one “//”
operation in each query. For each data point the bars
represent, from left to right: YFilter, Hybrid, and XFilter.
As can be seen in the figure, YFilter provides the
significantly better performance than the other two across
the entire range of query populations. XFilter is the
slowest here by far, and not surprisingly, Hybrid’s
performance lies between the two.
In the figure, MQPT is split into its component costs:
path navigation and result collection. In terms of path
navagation, YFilter exhibits a cost of around 20 ms when
Q is larger than 50,000. In contrast, the navigation cost of
XFilter increases dramatically as Q increases, to 633 ms
at 500,000, while Hybrid takes 328 ms at this point. Thus
Yfilter exhibits an order of magnitude improvement for
path navigation over these other schemes.6 The
performance benefits of YFilter come from two factors.
The first is the benefit of shared work obtained by the
NFA approach. The second is the relatively low cost of
state transition in YFilter (compared to the others) that
results from the hash-based implementation described in
Section 2.
The levelling off of YFilter’s navigation time (in fact,
all of the approaches show some degree of this) is
explained by Table 3, which shows the number of distinct
queries in the system for increasing values of Q in this
experiment. While all of the systems can exploit
completely identical queries, YFilter is the most
successful among the three at exploiting commonality
6
Recall that our Hybrid approach is similar to the Eager Trie
proposed in [CFG02]. The fastest algorithm studied there, called
Lazy Trie was shown to have about a 4x improvement over
XFilter when used with a number of distinct queries similar to
what arises with Q= 500,000 here.
Fig. 11: Varying “//” probability
(NITF, Q=500000, D=6, W=0)
among similar, but not exactly identical queries (recall,
that all common prefixes of the paths are shared).
Q (x1K)
#. distinct queries
(x1K)
1
0.6
100
17.5
200
27.1
300
34.1
400
40.1
500
45.3
Table 3: Number of Distinct Queries as Q Increases
(NITF, D=6, W=0.2, DS=0.2)
In this experiment, the MQPT of YFilter is dominated
by the cost of result collection when Q is larger than
300,000. Although we coded result collection carefully,
we believe that it can be sped up somewhat. However,
when document parsing time is also considered, it is
becomes clear YFilter’s performance in this case is such
that any further improvements in path navigation time
will have at best, a minor impact on overall performance.
The Xerces [Apa99] parser we used, set in a nonvalidating mode, took 168 ms on the average to parse a
document. It completely dominated the NFA-based
execution. We tried other publicly available java parsers
including Java XML Pack [JXP01] and Saxon XSLT
processor [Kay01] supporting SAX 2.0 [SAX01]. Saxon
gave the best performance at 81 ms, still substantially
more the NFA navigation cost.
We have also experimented with C++ parsers, which
are much faster, but even with these parsers we would
expect parsing time to be similar to the cost of path
navigation with YFilter, particularly if YFilter were also
implemented in C++!
4.3.2 Experiment 2: Other DTDs
As stated above, we also ran experiments using two other
DTDs. Space precludes us from describing these results
in detail, so we summarize them here.
Figure 10 shows the MQPT results obtained for the
three algorithms using the same parameter settings as in
the previous experiment, but with the Xmark-Auction
DTD. As can be seen in the figure, the trends observed
using NITF are also seen here: YFilter performs
substantially better than the other two for all Q values
tested.
This DTD differs from NITF in that due to its
structure, it tends to generate longer documents (on
Draft – Submitted for Publication, February, 2002
9
average 270 start-end element pairs, 3.5 times as that of
NITF). Thus all algorithms take longer to filter the
documents here.
XFilter, however, is particularly
sensitive to the length of documents because its FSM
representation and execution algorithm cause significant
memory management overhead, which in turn invokes
garbage collection much more frequently.
The results obtained using the DBLP DTD also told a
similar story, but the differences were not as great due to
the small documents and queries it generates. Details are
omitted due to space limitations.
4.4 Experiment 3: Varying the Non-determinism
The probability of “//” operators and ‘*’ operator
determine the complexity of processing of path
expressions. Their impact on the performance of the
algorithms is studied in this subsection. We used NITF in
all the following experiments.
The MQPT was first measured with “*” probability
set to zero and “//” probability varied from 0 to 1. The
results for Q=500,000 are shown in Figure 11. As can be
seen in the figure, XFilter is extremely sensitive to this
parameter, while Hybrid is less so. YFilter shows very
little sensitivity. These results are driven largely by the
effect of “//” probability on the number of distinct queries,
and the sensitivity of the algorithms to that factor.
The effect of increasing “//” probability on the number
of distinct queries is two-fold, as shown in Table 4. When
the probability is below 0.5, the “//” operators occur more
in queries, yielding a growing number of distinct queries.
When it is larger than 0.5, most operators for location
steps are “//”, resulting in the decreasing number of
distinct queries. Note that there are fewer distinct queries
when the probability equals 0 (only ‘/’ operators) than
when it equals 1 (only “//” operators). This is because
every query starts from the root element in the former
case, while a query can start at any level in the latter case.
// probability
No. of distinct
queries (x1K)
0
0.8
.2
19.2
.4
26.1
.5
26.8
.6
25.6
.8
18.9
1
4.3
Table 4: Number of Distinct Queries as Q increases
(NITF, D=6, Q=500000, W=0)
It is interesting to note that when “//” probability is 0,
Hybrid is identical to YFilter because there is no
decomposition of queries. As the probability increases
from 0 to 1, it keeps up with XFilter. At DS = 1, every
query is decomposed into single elements and its
performance is very close to XFilter.
We also ran an experiments varying the probability of
‘*’ operator and the results were similar.
Since ‘*’
operators have a smaller effect on the number of distinct
queries, the difference among algorithms was somewhat
less pronounced.
4.5 Experiment 4: Maintenance cost
The last set of experiments we report on in this section
deal with the cost of maintaining the YFilter structure,
which is expected to be one of the primary benefits of the
approach. Updates to the NFA in YFilter are handled as
follows: To insert a query, we merge its NFA
representation with the combined NFA as described in
Section 2.2. To delete a query, the accepting state of the
query is located and the query’s identifier is deleted from
the list of queries at this state. If the list becomes empty
and the state does not have a child state in the NFA, the
state is deleted by removing the link from its parent. The
deletion of this state can be propagated to its
predecessors. An update to a query is treated as a delete
of the old query followed by the insertion of the new one.
Deletion may be inefficient because of modification of
the list at the accepting state and the deletion of states. As
demonstrated in the previous sections, YFilter’s
performance is fairly insensitive to the number of queries
in the system.
Thus, instead of deleting queries
immediately, we can adopt a lazy approach where a list of
deleted queries is maintained. This list is used to filter out
such queries before results are returned. The actual
deletions can then be done asynchronously. Thus, in this
section we focus on the performance of inserting new
queries.
We measured the cost of inserting 1000 queries
varying numbers of queries already in the index. With Q
= 1000 (i.e., 1000 queries already in the NFA), it takes 80
ms to insert the new queries. At this point, the chance of a
query being new is high, requiring new states to be
created and transition functions to be expanded by adding
more hash entries to the states. However, the cost drops
dramatically as more queries are added to the system.
Beyond Q=10,000, the insertion cost remains constant
within 5 to 6 ms. This is because most path expressions,
are already present in the index, so inserting a new query
often requires only the traversal down a single path to an
existing accepting state, and the insertion of a new query
ID into the list kept at that state. Thus, the NFA-based
structure requires a very low maintenance overhead once
a reasonable number of queries exist in the system.
5
Experiments on Selection
Having shown the efficiency of NFA-based structure
matching, we now turn to our two approaches to
integrating content-based predicate processing.
The NITF DTD was used for all experiments
presented in the section. XML documents are generated
with element data and attributes using statistics obtained
from the DTD parser. The statistics for each element
include the maximum number of values the element can
take, the probability of an attribute occurring in the
element and the maximum number of values an attribute
can take. All probabilities were chosen uniformly between
0 and 1. The maximum number of values for elements
and attributes was chosen uniformly between 1 and 20.
The XML generator was then extended to generate
documents with data values and attributes according to
these probabilities.
For query generation, the parameter P (see Table 2)
was used to determine the average number of predicates
Draft – Submitted for Publication, February, 2002
10
3000
350
SP
SP(P=2)
2500
300
SP+sorting
SP(P=1)
2000
2000
1500
1000
MQPT (ms)
Inline(P=1)
2500
MQPT (ms)
MQPT (ms)
Inline(P=2)
3000
Inline
1500
SP
1000
500
500
0
100
200
300
400
Fig. 12: Varying number of queries
(D=6, W=0.2, DS=0.2)
4
8
12
16
20
No. of Predicates Per Query
No. of Queries (x1K)
150
100
0
0
500
200
50
0
0
250
Fig. 13: Varying number of predicates
(D=6, Q=50000, W=0.2, DS=0.2)
that appear in each query. Such predicates are distributed
among the location steps with uniform probability. In the
following experiments, P varies from 0 to 20. Note that
the addition of predicates to the queries makes the queries
substantially more selective than they were in the
experiments described in the previous section. Also, such
predicates make the occurrence of exactly identical
queries much less likely (although the probability of
structural equivalence remains the same as in the
experiments of the previous section).
5.1 Experiment 5: Efficiency and Scalability
In this experiment we examined the relative performance
of Inline and SP as the number of queries is varied from
1,000 to 500,000. Figure 12 shows the MQPT of the two
approaches for the cases P=1 and P=2. Note that When
P=1, 22.5% of queries on the average were satisfied by a
document. When P=2, 13.2% of the queries were
satisfied.
As can be seen in the figure, SP outperforms Inline by
a wide margin. When P=1, for example, SP runs 3.8
times faster than Inline when Q=100,000, and 4.4 times as
faster when Q=400,000. Inline suffers from the costs of
unnecessary and repeated predicate evaluation (as
described in Section 3) as well as from the overhead of
maintaining the bookkeeping information in evaluation
data structures. This extra overhead causes Inline to run
out of memory in this case, for Q = 400,000 and above.
When the number of predicates per query is dobuled
(P=2, also shown in Figure 12) the differences between
the two approaches are even more pronounced. In this
case, SP runs 4.1 times faster than Inline when
Q=100,000, and Inline runs out of memory with roughly
half as many queries as when P=1.
Figure 13 shows the MQPT of the two approaches as
the number of predicates per query is varied from 0 to 20
for a relatively small number of queries (Q =50,000). As
can be seen in the figure, SP is much less sensitive to the
number of predicates per query than Inline. This is
because of SP’s lower overhead and its ability to avoid
many of the unnecessary predicate evaluations performed
by Inline.
When comparing these results with those of the
previous experiments, it is important to note that for
YFilter, the cost of content-based matching can easily
dominate the cost of structure matching.
0
2
4
6
8
10
12
14
16
18
20
No. of Predicates Per Query
Fig. 14: Effect of predicate sorting
(D=6, Q=50000, W=0.2, DS=0.2)
5.2 Experiment 6: Predicate Sorting
The previous experiment demonstrated the benefits of
delaying content-based matching in YFilter. One of the
major benefits was seen to be the ability to “short-cut” the
evaluation process for a query when one predicate fails.
This observation raises the potential to further improve
the chances of such short-cut evaluation by modifying the
order of predicate checking to evaluate highly-selective
predicates first, as is done by most relational query
optimizers.
The selectivity of predicates on element data or on
element position can be estimated from the number of
their possible values. Estimating selectivity of predicates
on attributes requires additional information on the
probability of an attribute occurring. Some examples of
equality predicates on these properties for a location step
on an element “a” are given as follows:
• Sel([@attr]) = probability of the attribute occurring in
element ‘a’.
• Sel([@attr=”v”]) = Sel[@attr] / max. no. of values
attribute ‘attr’ can take
• Sel([.=”v”]) = 1 / max. no. of values element ‘a’ can
take.
The selectivity estimates for predicates involving
other comparison operators can be derived in a similar
way. Having a wildcard in a location step requires using
some averaged statistics. Other formulas are omitted here
due to space constraints.
We performed a simple experiment to examine the
potential performance benefits of predicate reordering in
the SP approach. In this experiment, we sorted predicates
in ascending order of selectivity for each query. Figure 14
shows the MQPT for SP with and without sorting, as P is
varied from 0 to 20 for Q=50,000. The results (shown in
Figure 14) indicate that as expected, additional benefits
can indeed be gained by predicate sorting, particularly for
cases with large numbers of predicates.
6
Related Work
As discussed in the introduction, previous work on XML
filtering, while often supporting both structure-based and
content-based matching, has not directly addressed the
interaction between these two very different processes. In
addition to this work, a number of other efforts are related
to the work described here.
Draft – Submitted for Publication, February, 2002
11
The evaluation of path expressions on streaming data
was studied in [ILW00], where queries that include
resolving IDREFs are expressed by several individual
FSMs that are generated on the fly. [OS02] proposes an
approach to combine all path expressions into a single
DFA, resulting in good performance, but with significant
limitations. As shown in Sections 4 and 5, when using an
NFA-based algorithm such as YFilter, structure matching
is no longer the dominant cost of filtering. As a result we
do not believe that trading flexibility for any further
improvements in structure match speed is worthwhile.
Continuous queries systems perform content-based
filtering using the relational model and its techniques. The
concept of "expression signatures", was introduced by
TriggerMan [HCH99]. In NiagaraCQ [CDT00],
[CDN02] such expression signatures are used to
incrementally group query plans for continuous queries.
CACQ [MSH02] combines adaptivity and grouping for
CQ. [LPT99] uses grouped triggers that are executed
incrementally in combination with cached views.
A number of event-based publish/subscribe systems
have been developed: among them, [NAC01], which uses
a tree of hashtables as its index structure, with event sets
as profiles.
7
Conclusions
In this paper, we studied integrated approaches to
handling both structure-based and content-based matching
for XML documents. We first proposed an NFA-based
structure matching engine, and showed that it is a flexible
approach that provides excellent performance by
exploiting overlap of path expressions. With such an
approach, structure matching is no longer the dominant
cost for XML filtering.
We then investigated two alternative techniques for
integrating content-based matching with the NFA. The
results of the Selection experiments provide a key insight
arising from our study, namely, that structure-based
matching and content-based matching cannot be
considered in isolation when designing a highperformance XML filtering system. In particular, our
experiments demonstrated that contrary to traditional
database intuition, pushing even simple selections down
through the query plan may not effective, and in fact, can
be quite detrimental when used in combination with an
NFA-based structure matching technique that exploits the
commonality among path expressions. This result is
particularly relevant, given that content-based matching in
such a system can very easily dominate the cost of
structure matching.
Acknowledgements
We would like to thank Raymond To for helping us
develop YFilter, and Philip Hwang for helping provide
insight into XML parsing. We would also like to thank
Mehmet Altinel for valuable comments on early drafts of
this paper.
References
[Apa99] Apache XML project. Xerces Java Parser 1.2.3 Release.
http://xml.apache.org/xerces-j/index.html, 1999.
[AF00] M. Altinel, M. Franklin. Efficient Filtering of XML
Documents for Selective Dissemination of Information. In VLDB
2000.
[BCF01] R. Busse, M. Carey, D. Florescu, et al. Benchmark DTD for a
XMark, an XML Benchmark project. http://monetdb.cwi.nl/xml/
downloads/downloads.html, April, 2001.
[CD99] J. Clark, S. DeRose. XML Path Language (XPath) Version
1.0. http://www.w3.org/TR/xpath, Nov. 1999.
[CDN02] J. Chen, D. J. DeWitt, J. F Naughton. Design and Evaluation
of Alternative Selection Placement Strategies in Optimizing
Continuous Queries. ICDE 2002, to appear.
[CDT00] J. Chen, D. J. Dewitt, F. Tian and Y. Wang. NiagaraCQ: A
Scalable Continuous Query System for Internet Databases.
In SIGMOD 2000.
[CFG02] C. Chan, P. Felber, M. Garofalakis, R. Rastogi. Efficient
Filtering of XML Documents with XPath Expressions.
ICDE 2002, to appear.
[Cov99] R.
Cover.
The
SGML/XML
Web
Page.
http://www.w3.org/TR/xslt, Nov. 1999.
[DBL01] DBLP
DTD.
http://www.acm.org/sigmod/dblp/db/about/
dblp.dtd, Sep. 2001.
[FJL01] F. Fabret, H. A. Jacobsen, F. Llirbat, et al. Filtering
Algorithms and Implementation for Very Fast Publish/Subscribe
Systems. In SIGMOD 2001.
[HCH99] E. N. Hanson, C. Carnes, L. Huang, et al. Scalable Trigger
Processing. In ICDE 1999.
[HU79] J. E. Hopcroft, J. D. Ullman. Introduction to Automata Theory,
Languages and Computation. Addition-Wesley Pub. Co., 1979.
[DL99] A.
L.
Diaz,
D.
Lovell.
XML
Generator.
http://www.alphaworks.ibm.com/tech/xmlgenerator, Sep., 1999.
[ILW00] Z. Ives, A. Levy, D. Weld. Efficient Evaluation of Regular
Path Expressions on Streaming XML Data. Technical Report,
University of Washington, 2000.
[JXP01] Java
XML
Pack,
winter
01
update
release.
http://java.sun.com/xml/downloads/javaxmlpack.html, 2001.
[Kay01] Michael
Kay.
Saxon:
the
XSLT
processor.
http://users.iclway.co.uk/mhkay/saxon/, Jul 2001.
[LPT99] L. Liu, C. Pu, W. Tang. Continual Queries for Internet Scale
Event-Driven Information Delivery. Special Issue on Web
Technologies, IEEE TKDE, Jan. 1999.
[MSH02] S. Madden, M. Shah, J. Hellerstein, V. Raman. Continuously
Adaptive Continuous queries over Streams. SIGMOD 2002,to appear.
[NAC01] B. Nguyen, S. Abiteboul, G. Cobena, M. Preda. Monitoring
XML Data on the Web. In SIGMOD 2001.
[OKA01] B. T. Ozen, O. Kilic, M. Altinel, A. Dogac. Highly
Personalized Information Delivery to Mobile Clients. In Proc. of 2nd
ACM International Workshop on Data Engineering for Wireless and
Mobile Access (MobiDE’01), May 2001.
[OS02] M. Onizka, D. Suciu. Processing XML Streams with
Deterministic
Automata
and
Stream
Indexes.
http://www.cs.washington.edu/homes/suciu/files/_F267176094.pdf,
2002.
[PFL01] J. Pereira, F. Fabret, F. Llirbat, H. A. Jacobsen. WebFilter: A
High-throughput XML-based Publish and Subscrive System.
In VLDB 2001
[SAX01] SAX: Simple API for XML. http://www. saxproject.org, 2001.
[Wat97] B. W. Watson. Practical Optimization for Automata. In
Proceedings of the 2nd International Workshop on Implementing
Automata, Sep. 1997
[Wut00] Wutka DTD Parser. http://www.wutka.com/dtdparser.html,
Jun., 2000.
Draft – Submitted for Publication, February, 2002
12