Query Containment for Conjunctive Queries with Regular Expressions
Daniela Florescu, Alon Levy, Dan Suciu. (PODS 1998)
Summary by Gala Yadgar
The paper considers the problem of query containment for StruQL0 - a query
language over semi-structured data that contains the ability to specify regular
path expressions over the data. It is shown that containment for StruQL0 queries
is decidable, and that there exists a fragment of StruQL0 for which containment is
NP-Complete.
Semi structured data is irregular – its schema is unknown in advance. Attributes
may be missing, the type and cardinality of an attribute may not be known, and
the set of attributes may not be known in advance.
StruQL is a query language for semi-structured data. It models databases as
graphs, and a result of a query is itself a graph. It is a known result that queries in
StruQL can be translated into datalog queries.
Query Containment is the problem of finding out whether the results of one query
are contained in the results of another query, for all databases. It is useful for
finding redundant sub-goals in a query, testing whether two formulations of a
query are equivalent, determining independence of database updates and
rewriting queries using views.
Some previous results for query containment include:
Query containment for first order conjunctive queries is decidable (and NPComplete)
Containment in datalog programs is undecidable
All positive results for containment so far are restricted to the case when one
of the programs is non-recursive
The Data Model
In the data model databases are represented by labeled directed graphs. Nodes
correspond to objects and labels on the edges correspond to attributes. Formally,
a database consists of a universe of constants D, and a universe of object
identifiers I (I ∩ D = Ф). A database DB is a pair (V,E): V I , E V D V
A StruQL0 query is made of the following components:
Q: q(X1…Xn) :– Y1R1Z1,…, YnRnZn
nvar(Q) ≡ {Y1,…,Yn,Z1,…,Zn} - node variables - regular variables which
range over the nodes in the graph, and are denoted by capital letters
Regular path expressions {R1,…,Rn}, which are defined by the grammar:
R := ε | a | _ | L | (R1.R2) | (R1|R2) | R*,
where ε is the empty string, a is a label constant, _ denotes any label and L is
a label variable.
avar(Q) ≡ the set of arc variables, which range over the labels of edges in the
graph, occurring in R1,…,Rn
We denote nvar(Q) U avar(Q) by var(Q). YiRiZi i=1,…n are the query’s
conjuncts, and X1…Xn are the head variables.
For example: Q1: q1(X,Z) :– XL+Z, YaZ, X(a+|(a.b*)) Z
The semantics of a StruQL0 query are defined by a substitution. A substitution is
a function φ : var(Q) I U D, where node variables are mapped to I and arc
variables are mapped to D. it is denoted φ : Q DB.
φ(YiRiZi) is the path in DB corresponding to the conjunct (YiRiZi). Each
substitution defines a tuple in the relation RQ, and the answer to Q is the
projection of RQ on the variables in X1…Xn. The result of applying Q to a
database is denoted by Q(DB).
Containment can be defined based on the above definitions:
A query Q1 is contained in a query Q2 , written Q1 Q2
if for all databases DB Q1 ( DB) Q2 ( DB)
The queries Q1 and Q2 are equivalent, written Q1≡Q2 , if Q1 Q2 and Q2 Q1
Semantic criteria for query containment
Containment can be decided by checking a finite number of canonical databases
for a query. A canonical database for Q is a pair (DB,ξ), where ξ is a substitution.
The graph for the canonical database contains a bifurcation node for each node
variable, and a corresponding internal path for each conjunct.
For a database to be canonical, the following conditions must hold:
Each internal node belongs to one internal path, with one outgoing and one
incoming edge.
The mapping of node variables to bifurcation nodes is surjective (onto)
Each arc variable L is mapped to itself
For each conjunct YiRiZi, the path ξ(YiRiZi) is internal and the mapping
is one to one
For a query Q with head variables X1,…,Xn, and canonical database (DB, ξ),
(ξ(X1),…ξ(Xn)) is the canonical tuple. It is proven in the paper that given two
queries, Q, Q’:
Q Q'
for any canonical database (DB, ξ) for Q, its canonical tuple is in
the answer of Q’
Decidability of containment
There are an infinite number of canonical databases for each query. The internal
paths can be of any length, and the number of substitutions can be infinite.
However, it is sufficient to examine only databases whose internal path is no
longer than N, where N = |nvar(Q)|x|states(Ai)|+2, Ai is the automaton for path
expression Ri. Moreover, only a set of n x N constants is sufficient, with N from
above and n the number of conjuncts in Q. (Only the constants in DQ,Q’ U
avar(Q)).
The resulting algorithm for containment is of triple exponential space.
Containment by Query mapping
Containment of two queries can be decided also in syntactic means, based on
the structure of the queries and not their semantics (represented earlier by
canonical databases).
A query mapping f:Q’ DB sends conjuncts in Q’ to some path in the canonical
database of Q. There exist only finitely many mappings between two queries,
and they can be encoded in polynomial space.
A query mapping f:Q’Q can ‘cover’ a canonical database DB for Q.
The resulting syntactic criterion for query containment is:
Q Q ' All query mappings together cover all canonical databases.
All canonical DBs for a query can be described in a regular language W Q.
For each mapping f, there is a regular expression for all databases covered
by it, Wf. Containment can be decided by:
Q Q ' iff WQ Wf
This computation requires exponential space.
Simple StruQL0 queries
Simple StruQL queries are queries where all path expressions are simple
0
regular expressions, of the form r1. r2... rn, where each ri is either * or a label
constant.
It is known that given two regular expressions, their containment can be checked
in polynomial space. The result presented in the paper is that the containment
problem of two simple queries is NP-complete. It can be proved by reduction to
conjunctive queries. This is the first subset of a query language which includes
recursion for which containment decision is no harder than for conjunctive
queries.
© Copyright 2025 Paperzz