A On the Complexity of Query Result Diversification

A
On the Complexity of Query Result Diversification
TING DENG, RCBD and SKLSDE, Beihang University
WENFEI FAN, Informatics, University of Edinburgh, and RCBD and SKLSDE, Beihang University
Query result diversification is a bi-criteria optimization problem for ranking query results. Given a database
D, a query Q and a positive integer k, it is to find a set of k tuples from Q(D) such that the tuples are as
relevant as possible to the query, and at the same time, as diverse as possible to each other. Subsets of Q(D)
are ranked by an objective function defined in terms of relevance and diversity. Query result diversification
has found a variety of applications in databases, information retrieval and operations research.
This paper investigates the complexity of result diversification for relational queries. (1) We identify three
problems in connection with query result diversification, to determine whether there exists a set of k tuples
that is ranked above a bound with respect to relevance and diversity, to assess the rank of a given k-element
set, and to count how many k-element sets are ranked above a given bound based on an objective function.
(2) We study these problems for a variety of query languages and for the three objective functions proposed
in [Gollapudi and Sharma 2009]. We establish the upper and lower bounds of these problems, all matching,
for both combined complexity and data complexity. (3) We also investigate several special settings of these
problems, identifying tractable cases. Moreover, (4) we re-investigate these problems in the presence of
compatibility constraints commonly found in practice, and provide their complexity in all these settings.
Categories and Subject Descriptors: H.2.3 [DATABASE MANAGEMENT]: Languages
General Terms: Design, Algorithms, Theory
Additional Key Words and Phrases: Result diversification, relevance, diversity, recommender systems,
database queries, combined complexity, data complexity, counting problems
1. INTRODUCTION
Result diversification for relational queries is a bi-criteria optimization problem. Given
a query Q, a database D and a positive integer k, it is to find a set U of k tuples in the
query result Q(D) such that the tuples in U are as relevant as possible to query Q,
and at the same time, as diverse as possible to each other. More specifically, we want
to find a set U ⊆ Q(D) such that |U | = k, and the value F (U ) of U is maximum. Here
F (·) is called an objective function. It is defined on sets of tuples from Q(D), in terms of
a relevance function δrel (·, ·) and a distance function δdis (·, ·), where
— for each tuple t ∈ Q(D), δrel (t, Q) is a number indicating the relevance of answer t to
query Q, such that the higher δrel (t, Q) is, the more relevant t is to Q; and
— for all tuples t1 , t2 ∈ Q(D), δdis (t1 , t2 ) is the distance between t1 and t2 , such that the
larger δdis (t1 , t2 ) is, the more diverse the answers t1 and t2 are.
Deng and Fan are supported in part by 973 Program 2014CB340302. Fan is also supported in part
by NSFC 61133002, 973 Program 2012CB316200, Guangdong Innovative Research Team Program
2011D005 and Shenzhen Peacock Program 1105100030834361, and EPSRC EP/J015377/1.
Author’s addresses: T. Deng; W. Fan, School of Informatics, Laboratory for Foundations of Computer Science,
Informatics Forum, 10 Crichton Street, Edinburgh EH8 9AB, Scotland, UK. Email: [email protected]
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights
for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component
of this work in other works requires prior specific permission and/or a fee. Permissions may be requested
from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)
869-0481, or [email protected].
c YYYY ACM 1539-9087/YYYY/01-ARTA $15.00
DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:2
Ting Deng and Wenfei Fan
In particular, three generic objective functions have been proposed and studied in [Gollapudi and Sharma 2009] based on an axiom system, namely, max-sum diversification,
max-min diversification and mono-objective formulation. Each of these functions is defined in terms of generic functions δrel (·, ·) and δdis (·, ·), with a parameter λ ∈ [0, 1]
specifying the tradeoff between relevance and diversity.
Query result diversification aims to improve user satisfaction by remedying the overspecification problem of retrieving too homogeneous answers. The diversity of query
answers is measured in terms of (1) contents, to include items that are dissimilar to
each other, (2) novelty, to retrieve items that contain new information not found in
previous results, and (3) coverage, to cover items in different categories [Drosou and
Pitoura 2010]. It has proven effective in Web search [Gollapudi and Sharma 2009;
Vieira et al. 2011], recommender systems [Yu et al. 2009b; Zhang and Hurley 2008;
Ziegler et al. 2005], databases [Demidova et al. 2010; Liu et al. 2009; Vee et al. 2008],
sponsored-search advertising [Feuerstein et al. 2007], and in operations research and
finance (see [Drosou and Pitoura 2010; Minack et al. 2009] for surveys).
This paper investigates the complexity of result diversification analysis for relational queries. While there has been a host of work on result diversification, the previous work has mostly focused on diversity and relevance metrics, and on algorithms for
computing diverse results [Drosou and Pitoura 2010; Minack et al. 2009]. Few complexity results have been developed for query result diversification, and the known results
are mostly lower bounds (NP-hardness) [Agrawal et al. 2009; Gollapudi and Sharma
2009; Liu et al. 2009; Stefanidis et al. 2010; Vieira et al. 2011]. Furthermore, these
results are established by assuming that query result Q(D) is already known. In other
words, the prior work conducts diversification in two steps: first compute Q(D), and
then rank k-element subsets of Q(D) and find a set with the maximum F (·) value. The
known complexity results are for the second step only, based on a specific objective function F (·). However, it is typically expensive to compute Q(D). To avoid the overhead,
we want to combine the two steps by embedding diversification in query evaluation,
and stop as soon as top-ranked results are found based on F (·) (i.e., early termination),
rather than to retrieve entire Q(D) in advance [Demidova et al. 2010]. Nonetheless,
the complexity of such a query result diversification process has not been studied.
This highlights the need for establishing the complexity of query result diversification, both upper bounds and lower bounds, when Q(D) is not provided, and for different query languages and various objective functions. Indeed, to develop practical
algorithms for computing diverse query results, we have to understand the impact of
query languages and objective functions on the complexity of result diversification.
Example 1.1. Consider a recommender system to help people find gifts for various
events or occasions, e.g., FindGift1 . Its underlying database D0 consists of two relations
specified by the following relation schemas:
catalog(item, type, price, inStock),
history(item, buyer, recipient, gender, age, rel, event, rating).
Here each catalog tuple specifies an item for present, its type (e.g., jewelry, book), price,
and the number of the item in stock. Purchase history is recorded by relation history: a
history tuple indicates that a buyer bought an item for a recipient specified by gender, age
and relationship with the buyer, for an event (e.g., birthday, wedding, holiday), as well
as rating given by the buyer in the range of [1,5].
1 http://www.findgift.com.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:3
Peter wants to use the engine to find a Christmas gift for his 14 year-old niece Grace,
in the price range of [$20, $30]. His request can be converted to a query Q0 defined on
database D0 . The relevance δrel (t, Q0 ) of a tuple t returned by Q0 (D0 ) can be assessed
by using the information from relation history, by taking into account previous presents
purchased for girls of 12–16 year old by the girls’ relatives for holidays, as well as the
rating by those buyers. The distance (diversity) δdis (t1 , t2 ) between two items t1 and
t2 returned by Q0 (D0 ) can be estimated by considering the differences between their
types. Peter wants the system to recommend a set of 10 items from Q0 (D0 ) such that on
one hand, those items are as fit as possible as a Christmas present for a teenage girl,
and on the other hand, are as dissimilar as possible to cover a wide range of choices.
The computational complexity of processing such requests depends on both the
queries expressing users’ requests and the objective function used by the system.
(1) Query languages. Query Q0 can be expressed as a conjunctive query (CQ). Nonetheless, if Peter wants a new gift that is different from previous gifts he gave to Grace,
we need first-order logic (FO) to express Q0 , by using negation on relation history. In
practice one cannot expect that Q0 (D0 ) is already computed when Peter submits his
request. As remarked earlier, it is too costly to compute Q0 (D0 ) first and then pick a
top set of k items from Q0 (D0 ). Instead, we want to embed result diversification in the
evaluation of Q0 , and ideally, find a satisfactory set of k items without retrieving the
entire set Q0 (D0 ) and paying its cost. A question concerns what difference CQ and FO
make on the complexity of processing such requests when Q0 (D0 ) is not necessarily
available. One naturally wants to know whether the complexity is introduced by the
query languages or is inherent to result diversification.
(2) Objective functions. Consider the objective function by max-sum diversification proposed in [Gollapudi and Sharma 2009] and revised in [Vieira et al. 2011]:
X
X
FMS (U ) = (k − 1)(1 − λ) ·
δrel (t, Q) + λ ·
δdis (t, t′ ),
t∈U
t,t′ ∈U
where U is a set of tuples in Q(D). To assess the diversity, FMS (U ) only requires to
compute δdis (t, t′ ) for t and t′ in a given k-element set U ⊆ Q(D). Similarly, the objective
function by max-min diversification is defined as [Gollapudi and Sharma 2009]:
FMM (U ) = (1 − λ) · min δrel (t) + λ ·
t∈U
min
t,t′ ∈U,t6=t′
δdis (t, t′ ).
In contrast, consider the mono-objective formulation of [Gollapudi and Sharma 2009]:
X
X
λ
Fmono (U ) =
(1 − λ) · δrel (t, Q) +
δdis (t, t′ ) .
·
|Q(D)| − 1 ′
t∈U
t ∈Q(D)
It asks for δdis (t, t′ ) for each t ∈ U and for all t′ ∈ Q(D), i.e., the average dissimilarity
w.r.t. all other results in Q(D) [Minack et al. 2009]. The question is what different
impacts FMS (·), FMM (·) and Fmono (·) have on the complexity of diversification.
2
To the best of our knowledge, no prior work has answered these questions. These
issues require a full treatment for different query languages and objective functions,
to find out where the complexity of query result diversification arises.
Contributions. We study several fundamental problems in connection with result diversification for relational queries, and establish their upper bounds and lower bounds,
all matching, for a variety of query languages and objective functions.
Diversification problems. We identify three problems for query result diversification.
Given a query Q, a database D, an objective function F (·), and a positive integer k,
(1) the query result diversification problem (QRD) is a decision problem to determine
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:4
Ting Deng and Wenfei Fan
whether there exists a k-element set U ⊆ Q(D) such that F (U ) ≥ B for a given bound
B, i.e., whether there exists a set U that satisfies the users’ need at all;
(2) the diversity ranking problem (DRP) is to decide whether a given k-element set
U ⊆ Q(D) is among top-r ranked sets, such that there exist no more than r − 1 sets S ⊆
Q(D) of k elements with F (S) > F (U ); as advocated in [Jin and Patel 2011], a decision
procedure for DRP can help us assess how well a given k-element set U satisfies the
users’ request, and help vendors evaluate their products w.r.t. users’ need; and
(3) the result diversity counting problem (RDC) is to count the number of k-element sets
U ⊆ Q(D) such that F (U ) ≥ B for a given bound B. It is a counting problem that helps
us find out how many k-element sets can be extracted from Q(D) and be suggested to
the users, and provide a guidance for recommender systems to adjust their stock.
Complexity results. For all these problems we establish their combined complexity and
data complexity (i.e., when both data D and query Q may vary, and when Q is fixed
while D may vary, respectively; see [Abiteboul et al. 1995]). We parameterize these
problems with various query languages, including conjunctive queries (CQ), unions
of conjunctive queries (UCQ), positive existential FO queries (∃FO+ ) and first-order
logic queries (FO) [Abiteboul et al. 1995], all with built-in predicates =, 6=, <, ≤, >, ≥.
These languages have been used in query result diversification tools, e.g., CQ [Chen
and Li 2007], ∃FO+ [Vee et al. 2008] and FO [Demidova et al. 2010]. For each of these
query languages, we study these problems with each of the objective functions proposed by [Gollapudi and Sharma 2009], i.e., objective functions defined in terms of
max-sum diversification, max-min diversification, and mono-objective formulation.
We provide a comprehensive account of upper and lower bounds for these problems,
all matching when the problems are intractable; that is, we show that such a problem
is C-hard and is in C for a complexity class C that is NP or beyond in the polynomial
hierarchy. We also study special cases of these problems, such as when either only
diversity or only relevance is considered, when Q is an identity query, and when k is a
predefined constant. We identify practical tractable cases. It should be remarked that
all the previous complexity results (NP-hardness) are established for a special case of
QRD studied in this work only, namely, when Q is an identity query.
Compatibility constraints. We also re-investigate these problems in the presence of
compatibility constraints [Koutrika et al. 2009; Lappas et al. 2009; Parameswaran
et al. 2010; Parameswaran et al. 2011; Xie et al. 2012]. Such constraints are defined
on a set U of top-k items, to specify what items have to be taken together and what
items have conflict with each other, among other things. The need for such constraints
is evident in practice. For instance, when Peter buys a Christmas gift for Grace, he also
wants to buy a Christmas card together; when one selects a course A for an undergraduate package, she has to include all the prerequisites of A in the package [Koutrika
et al. 2009; Parameswaran et al. 2010]; and when one forms a basketball team, he
would like to get recommendation for a center, two forwards and two point guards [Lappas et al. 2009]. No matter how important, however, few previous work has studied the
impact of compatibility constraints on the analyses of query result diversification.
We propose a class Cm of compatibility constraints for query result diversification,
which suffices to express compatibility requirements we commonly encounter in practice. The constraints of Cm are of a restricted form of tuple generating dependencies
(see, e.g., [Abiteboul et al. 1995]), and can be validated in PTIME, i.e., given a set Σ of
constraints in Cm and a dataset U , it is in PTIME to decide whether U satisfies Σ. We
investigate the impact of such constraints on the combined complexity and data complexity of QRD, DRP and RDC, for query languages ranging over CQ, UCQ, ∃FO+ and FO,
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:5
and when the objective function is FMS , FMM or Fmono . We also study these problems in
all the special settings mentioned above, when a set of compatibility constraints of Cm
is additionally imposed on the selected sets of query results.
Impact. These results tell us where the complexity arises (see Tables I, II and III in
Section 10 for a detailed summary of the complexity bounds).
(1) Query languages LQ . Query languages may dominate the combined complexity of
result diversification. For objective functions defined in terms of max-sum or maxmin diversification, QRD, DRP and RDC are NP-complete, coNP-complete and #·NPcomplete, respectively, when LQ is CQ. In contrast, when it comes to FO, these problems become PSPACE-complete, PSPACE-complete and #·PSPACE-complete, respectively. This said, the presence of disjunction in LQ does not complicate the diversification analyses. Indeed, these problems remain NP-complete, coNP-complete and #·NPcomplete, respectively, when LQ is either UCQ or ∃FO+ .
In contrast, different query languages have no impact on the data complexity of
these problems, as expected. Indeed, for max-sum or max-min diversification, QRD,
DRP and RDC are NP-complete, coNP-complete and #·NP-complete, respectively, and
for mono-objective formulation, they are in PTIME (polynomial time), PTIME and #·Pcomplete, respectively, no matter whether LQ is CQ or FO. Intuitively, a naive algorithm for QRD works in two steps: first compute Q(D), and then finds whether there
exists a k-element set U from Q(D) such that F (U ) ≥ B; similarly for DRP and RDC.
When Q is fixed as in the setting of data complexity analysis, Q(D) is in PTIME regardless of what query language LQ we use to express Q. The data complexity of the
problems arises from the second step, i.e., the diversification computation.
(2) Objective functions F (·). When F (·) is Fmono , however, the objective function dominates the complexity: QRD, DRP and RDC are PSPACE-complete, PSPACE-complete and
#·PSPACE-complete, respectively, no matter whether LQ is CQ or FO. Contrast these
with their counterparts given above for FMS and FMM . The impact of F (·) is even more
evident on the data complexity. As remarked earlier, for FMS and FMM , these problems
are NP-complete, coNP-complete and #·NP-complete, respectively, for data complexity,
whereas they are in PTIME, PTIME and #·P-complete, respectively, for Fmono .
(3) Diversity vs. relevance. The complexity is mostly introduced by the diversity requirement. This is consistent with the observation of [Vieira et al. 2011], which studied a special case of QRD when F (·) is FMS . Indeed, when the relevance function δrel (·, ·)
is absent, the combined and data complexity bounds remain unchanged for all these
problems, for any of the three objective functions. In contrast, when the distance function δdis (·, ·) is dropped, QRD and DRP become tractable when data complexity is considered. Moreover, for Fmono , when δdis (·, ·) is absent, the combined complexity of QRD,
DRP and RDC becomes NP-complete, coNP-complete and #·NP-complete, down from
PSPACE-complete, PSPACE-complete and #·PSPACE-complete, respectively. In particular, for FMS (resp. FMM ), one can draw an analogy between δrel (·, ·) and sorting with a
target weight, and between δdis (·, ·) and partitioning with dispersed objects, which are
requirements of the (resp. Maximum) Dispersion Problem [Prokopyev et al. 2009].
(4) Compatibility constraints. Although the constraints of Cm are simple enough to be
validated in PTIME, their presence complicates the analyses of QRD, DRP and RDC, to
an extent. Indeed, all tractable cases of these problems in the absence of compatibility
constraints become intractable when constraints of Cm are present. These include
(a) the data complexity analyses of QRD, DRP and RDC when F (·) is Fmono , or when
F (·) is FMS or FMM defined in terms of the relevance function δrel (·, ·) only; and (b) the
combined complexity of these problems for identity queries when F (·) is Fmono . The
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:6
Ting Deng and Wenfei Fan
only exception is the case when the bound k on the number of selected tuples is a
constant; in this case, the data complexity analyses of these problems are tractable no
matter whether the constraints of Cm are present or not.
These results reveal the impacts of various factors on the complexity of query result
diversification. In particular, the results tell us that the complexity of these problems
for CQ, UCQ and ∃FO+ may be inherent to result diversification itself, rather than a
consequence of the complexity of the query languages. From the results we can see
that these problems are intricate and mostly intractable. This highlights the need for
developing efficient heuristic (approximation whenever possible) algorithms for them.
Organization. We discuss related work in Section 2, and present a general model for
query result diversification in Section 3. Problems QRD, DRP and RDC are formulated
in Section 4, and their combined complexity and data complexity are established in
Sections 5, 6 and 7, respectively. Section 8 studies special cases of these problems, and
Section 9 revisits these problems in the presence of compatibility constraints. Finally,
Section 10 identifies directions for future work. Due to the space constraint we defer
the proofs of the results of Sections 8 and 9 to the electronic appendix.
2. RELATED WORK
This paper is an extension of our earlier work [vld ] by including the following. (1)
Detailed proofs of all the results (Sections 5, 6, 7 and 8), which were not presented
in [vld ]. A variety of techniques are used to prove these results, including counting
arguments, a wide range of reductions and constructive proofs with algorithms. (2)
An extension of the query result diversification model with compatibility constraints,
and the (combined and data) complexity of all these problems in the presence of compatibility constraints (Section 9). These are among the first results for incorporating
compatibility constraints into query results diversification.
This work is also related to prior work on result diversification (for search and
queries), recommender systems and top-k query answering, discussed as follows.
Diversification. Diversification has been studied for Web search [Agrawal et al. 2009;
Borodin et al. 2012; Capannini et al. 2011; Gollapudi and Sharma 2009; Vieira et al.
2011], recommender systems [Yu et al. 2009a; 2009b; Zhang and Hurley 2008; Ziegler
et al. 2005], structured databases [Demidova et al. 2010; Fraternali et al. 2012; Liu
et al. 2009; Vee et al. 2008] and sponsored-search advertising [Feuerstein et al. 2007]
(see [Drosou and Pitoura 2010; Minack et al. 2009] for surveys). The previous work
has mostly focused on metrics for assessing relevance and diversity, and optimization
techniques for computing diverse answers. The prior work often adopts specific objective functions based on the similarity of, e.g., taxonomy [Ziegler et al. 2005], explanations [Yu et al. 2009a], features [Vee et al. 2008] or locations [Fraternali et al. 2012]. A
general model for result diversification was proposed in [Gollapudi and Sharma 2009]
based on an axiom system, along with the three objective functions mentioned earlier.
A minor revision of max-sum diversification of [Gollapudi and Sharma 2009] was presented in [Vieira et al. 2011]. This work extends the model of [Gollapudi and Sharma
2009] by incorporating queries (and compatibility constraints). Like in [Borodin et al.
2012], we focus on the objective functions proposed in [Gollapudi and Sharma 2009].
The complexity of result diversification has been studied in [Agrawal et al. 2009;
Gollapudi and Sharma 2009; Liu et al. 2009; Stefanidis et al. 2010; Vieira et al. 2011],
which differ from this work in the following.
(1) The previous work provided lower bounds (NP-hardness) but stopped short of giving
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:7
a matching upper bound. In contrast, we provide a complete picture of matching upper
and lower bounds, for both combined and data complexity.
(2) The prior work assumed that the search space Q(D) is already computed, and is
taken as input. As remarked earlier, this assumption is not very realistic in practice.
In contrast, we treat Q and D as input instead of Q(D), and investigate the impact
of query languages on the complexity of diversification. As will be seen later, the complexity bounds of these problems when Q(D) is not available is quite different from
their counterparts when Q(D) is assumed in place (i.e., when Q is an identity query).
(3) The previous work focused on a special cases of QRD, when Q is an identity query. It
is one of the special cases studied in Section 8 of this paper. Note that the intractability
of QRD for max-sum or max-min diversification given in the prior work [Drosou and
Pitoura 2009; Gollapudi and Sharma 2009; Vieira et al. 2011] may be adapted to establish the data complexity of QRD in these settings. Nonetheless, the detailed proofs are
not given in those papers. Further, for mono-objective formulation, no previous work
has studied the complexity of QRD for identity queries, which will be shown in PTIME
in this work. Moreover, we are not aware of any complexity results for DRP and RDC
published by previous work, although DRP was advocated in [Jin and Patel 2011].
(4) This work also considers several special cases of diversification (Section 8), to identify tractable cases and the impact of diversity and relevance requirements on the complexity of the diversification analyses. Moreover, we also study diversification in the
presence of compatibility constraints, about which we are not aware of any prior work.
Recommender problems. Recommender systems (a.k.a. recommender engines and recommendation platforms) are to recommend information items or social elements that
are likely to be of interest to users (see [Adomavicius and Tuzhilin 2005] for a survey).
There has been a host of work on recommender systems [Deng et al. 2012; Amer-Yahia
2011; Lappas et al. 2009; Koutrika et al. 2009; Parameswaran et al. 2011; Xie et al.
2012], studying item and package recommendation. Given a query Q, a database D
of items and a utility (scoring) function f (·) defined on items, item recommendation
is to find top-k items from Q(D) ranked by f (·), for a given positive integer k. Package recommendation takes as additional input a set Σ of compatibility constraints, two
functions cost(·) and val(·) defined on sets of items, and a bound C. It is to find top-k
packages of items such that each package satisfies Σ, its cost does not exceed C, and
its val is among the k highest. Here a package is a set of items that has a variable size.
There is an intimate connection between recommendation and diversification: both
aim to recommend top-k (sets of) items from the result Q(D) of query Q in D. Moreover,
diversification has been used in recommender systems to rectify the problem of retrieving too homogeneous results. However, there are subtle differences between them.
(1) Item recommendation is a single-criterion optimization problem based on a utility
function f (·) defined on individual items. In contrast, query result diversification is a
bi-criteria optimization problem based on a relevance function δrel (·, ·) and a distance
function δdis (·, ·) defined on sets of items. In particular, the distance function δdis (U )
assesses the diversity of elements in a set U , and is not expressible as a utility function.
(2) Package recommendation is to find top-k sets of items with variable sizes, which
are ranked by val(·), subject to compatibility constraints Σ and aggregate constraints
defined in terms of cost(·) and bound C, where cost(·) and val(·) are generic PTIME
computable functions [Deng et al. 2012]. In contrast, query result diversification is to
find a single set of k items, based on a particular objective function F (·). When F (·) is
max-sum or max-min diversification, diversification can be viewed as a special case of
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:8
Ting Deng and Wenfei Fan
package recommendation for finding a single set of a fixed size k, based on a particular
F (·), and in the absence of aggregate constraints. As a consequence of the specific restrictions of F (·), the lower bounds developed for package recommendation do not carry
over to its counterpart for diversification, and conversely, the upper bounds for diversification may not be tight for package recommendation. When F (·) is mono-objective,
F (U ) is not even expressible in the model of recommendation, since it assesses the
diversity of elements in a set U with all tuples in Q(D), and is not in PTIME in |U |.
There has been work on the complexity of recommendation analyses [Amer-Yahia
et al. 2013; Deng et al. 2012; Lappas et al. 2009; Koutrika et al. 2009; Parameswaran
et al. 2011; Xie et al. 2012]. In addition to different settings of recommendation and
diversification remarked earlier, this work differs from the prior work in the following.
(3) Problems QRD and DRP studied in this paper have not been considered in the previous work for recommendation. This said, the results of this work on these problems
may be of interest to the study of recommendation.
(4) Problem RDC considered here is similar to a counting problem studied in [Deng
et al. 2012] for recommendation. However, given the different settings remarked earlier, RDC differs from that counting problem from complexity bounds to proofs. Indeed,
the counting problem for recommendation is #·coNP-complete when LQ is CQ, UCQ or
∃FO+ [Deng et al. 2012]. In contrast, as will be seen in Section 7, for the same query
languages, (a) RDC is #·NP-complete when F (·) is FMS or FMM , while #·coNP = #·NP
if and only if P = NP [Durand et al. 2005]; and (b) RDC is #·PSPACE-complete when
F (·) is Fmono , substantially more intriguing than the problem studied in [Deng et al.
2012]. Furthermore, the proofs of this paper have to be tailored to the three objective
functions, as opposed to the proofs of [Deng et al. 2012]. Indeed, the proofs for FMS and
FMM are quite different from their counterparts for Fmono , as indicated by the different
combined complexity bounds in these settings.
Compatibility constraints have been studied for package recommendation [AmerYahia et al. 2013; Koutrika et al. 2009; Lappas et al. 2009; Parameswaran et al. 2010;
Parameswaran et al. 2011; Xie et al. 2012]. As remarked above, the prior results do not
carry over to the diversification analysis in the presence of compatibility constraints.
Top-k query answering. Top-k query answering is to retrieve top-k tuples from query
results, ranked by a scoring function. It typically assumes that the attributes of tuples
are already sorted, and studies how to combine different ratings of the attributes for
the same tuple based on a (monotonic) scoring function. A number of top-k query evaluation algorithms have been developed (e.g., [Fagin et al. 2003; Jin and Patel 2011; Li
et al. 2005; Schnaitter and Polyzotis 2008]; see [Ilyas et al. 2008] for a survey), focusing on how to achieve early termination and reduce random access. This work differs
from the prior work in the following. (a) A scoring function for top-k query answering is
defined on individual items, as opposed to the distance function δdis (·) and the objective
function F (·) defined on sets of items. (b) We focus on the complexity of diversification
problems rather than the efficiency or optimization of query evaluation.
3. DIVERSIFICATION AND OBJECTIVE FUNCTIONS
We first present a model for query result diversification, by extending the model
of [Gollapudi and Sharma 2009]. We then review the three objective functions proposed by [Gollapudi and Sharma 2009], which are used to define diversification.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:9
3.1. Query Result Diversification
As remarked earlier, query result diversification aims to improve user satisfaction
when computing answers to a query Q in a database D. We specify database D with a
relational schema R = (R1 , . . . , Rn ), where each relation schema Ri is defined over a
fixed set of attributes. We consider query Q expressed in a query language LQ .
Diversification. Given Q, D, a positive integer k and an objective function F (·), query
result diversification aims to find a set U ⊆ Q(D) such that (a) |U | = k, and (b) F (U ) is
maximum, i.e., for all other sets U ′ ⊆ Q(D), if |U ′ | = k then F (U ) ≥ F (U ′ ). Here F (·)
is an objective function defined on sets of tuples of RQ , where RQ denotes the schema
of query result Q(D), such that given any set U of tuples of RQ , F (U ) returns a nonnegative real number. In other words, F (·) is defined on subsets U ⊆ Q(D). We write
F (·) as F when it is clear from the context.
Intuitively, query result diversification is to retrieve a set U of k answers to Q in D
such that the tuples in U are as relevant as possible to Q and meanwhile, as diverse as
possible. It extends the notion of result diversification given in [Gollapudi and Sharma
2009] by taking query Q and D as input, rather than assuming that Q(D) is already
computed. The notion of [Gollapudi and Sharma 2009] is a special case of query result
diversification, when Q is an identity query, i.e., when Q(D) = D is given as input.
Query result diversification is a bi-criteria optimization problem characterized by
objective function F which is defined in terms of a relevance function δrel (·, ·) and a
distance function δdis (·, ·); these functions are presented as follows.
Relevance functions and distance functions. A relevance function δrel (·, ·) is defined on tuples of schema RQ and queries in LQ . It specifies the relevance of a tuple t
of RQ to a query Q ∈ LQ . More specifically, δrel (t, Q) is a non-negative real number such
that the larger δrel (t, Q) is, the more relevant the answer t is to query Q.
A distance function δdis (·, ·) is a binary function defined on tuples of schema RQ . It
specifies the diversity between two tuples t, s ∈ Q(D): δdis (t, s) is a non-negative real
number such that the larger δdis (t, s) is, the more diverse (dissimilar) t and s are to
each other. We assume that δdis (·, ·) is symmetric, i.e., δdis (t, s) = δdis (s, t) for all tuples
t, s of RQ . Moreover, δdis (t, t) = 0, i.e., the distance between a tuple and itself is 0.
We simply assume that δrel (·, ·) and δdis (·, ·) are PTIME computable functions, as commonly found in practice, and focus on their generic properties. We also write δrel (·, ·)
and δdis (·, ·) as δrel and δdis , respectively, if it is clear from the context.
Example 3.1. Recall the request of Peter for shopping a gift for Grace described in
Example 1.1. The request can be expressed as a query Q0 in FO as follows:
Q0 (n) = ∃ t, p, s catalog(n, t, p, s) ∧ p ≤ 30 ∧ p ≥ 20 ∧
∀n′ , b, r, g, a, x, e, y ¬(history(n′ , b, r, g, a, x, e, y) ∧
b = idP ∧ r = “Grace” ∧ n = n′ ) ,
where idP denotes Peter’s buyer id. The query selects such gifts in the price range
[$20, $30] that have not been purchased by Peter for Grace earlier.
As remarked in Example 1.1, for each gift t ∈ Q0 (D0 ), the relevance δrel (t, Q0 ) of t to
Q0 can be assessed in terms of the rating of t if t appears in the history relation. For instance, δrel (t, Q0 ) is high if t was presented as a gift for a girl of age [11, 14] by a relative
for a holiday, and was rated high. If t is not in history, δrel (t, Q0 ) takes a default value.
For tuples t, s ∈ Q0 (D0 ), δdis (t, s) can be defined in terms of the difference between
their types, e.g., δdis (t, s) = 2 if t is in the “artsy” category and s is “educational”, and
δdis (t, s) = 1 if t is of type “jewelry” and s is of type “fashion”. The types can be classified
into various categories and brands, and δdis (t, s) is defined accordingly.
2
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:10
Ting Deng and Wenfei Fan
3.2. Objective Functions
An objective function F is defined by means of relevance function δrel and distance
function δdis . Like in [Borodin et al. 2012], we focus on the objective functions proposed
by [Gollapudi and Sharma 2009] in this work.
Consider δrel and δdis , a parameter λ ∈ [0, 1] to balance relevance and diversity, a
query Q, a database D and a positive integer k. Let U ⊆ Q(D) be a set of tuples with
|U | = k. A minor revision of max-sum diversification of [Gollapudi and Sharma 2009]
was given in [Vieira et al. 2011] by associating (1 − λ) with the relevance component,
which allows us to study two extreme cases: diversity only (i.e., when λ = 1), and relevance only (i.e., when λ = 0). Along the same line as [Vieira et al. 2011], we consider
minor variations of the max-min diversification and mono-objective functions of [Gollapudi and Sharma 2009]. We present the revised objective functions as follows.
Max-sum diversification. The first objective is to maximize the sum of the relevance
and dissimilarity of the selected set U [Gollapudi and Sharma 2009; Vieira et al. 2011]:
X
X
FMS (U ) = (k − 1)(1 − λ) ·
δrel (t, Q) + λ ·
δdis (t, t′ ).
t∈U
t,t′ ∈U
Here FMS (U ) measures both the relevance of the tuples in U to query Q, and the diversity among the k tuples in U . Following [Gollapudi and Sharma 2009], we scale up
the two components δrel and δdis by using k − 1 since the relevance sum ranges over k
numbers while the diversity sum is over k(k − 1) numbers (note that the same effect
may also be achieved by tailoring λ; we adopt k − 1 here to simplify the discussion).
As observed by [Gollapudi and Sharma 2009; Vieira et al. 2011], when the objective function is FMS , result diversification can be modeled as the Dispersion Problem
studied in operations research [Prokopyev et al. 2009], when Q is an identity query.
Max-min diversification. The second objective is to maximize the minimum relevance and dissimilarity of the selected set [Gollapudi and Sharma 2009]:
FMM (U ) = (1 − λ) · min δrel (t, Q) + λ · ′ min ′ δdis (t, t′ ).
t∈U
t,t ∈U,t6=t
Here FMM (U ) is computed in terms of both the minimum relevance of the k tuple in
U to query Q, and the minimum distance between any two tuples in U . In contrast
to FMS (U ), FMM (U ) tends to penalize U that includes a single item irrelevant to Q,
or that contains a pair of homogeneous items, although all the rest are diverse. As
shown in [Gollapudi and Sharma 2009], diversification by FMM can be expressed as
the Maxmin Dispersion Problem [Prokopyev et al. 2009] if Q is an identity query.
Mono-objective formulation. The last one is given as [Gollapudi and Sharma 2009]:
X
X
λ
Fmono (U ) =
(1 − λ) · δrel (t, Q) +
δdis (t, t′ ) .
·
|Q(D)| − 1 ′
t∈U
t ∈Q(D)
As opposed to FMS (U ) and FMM (U ) that compute intro-list diversity, Fmono (U ) measures
the “global” diversity of each tuple t ∈ U by taking the mean of its distance to all
tuples in the entire set Q(D), rather than its distances to the tuples in the selected
set U [Gollapudi and Sharma 2009]. It computes the average dissimilarity of tuples in
U by comparing them uniformly with all other results in Q(D) [Minack et al. 2009],
to assess the novelty and coverage of the items in U . In this formulation, the distance
function behaves similarly to the relevance function; hence it is named “mono”. In
contrast with FMS and FMM , Fmono (U ) does not reduce to facility dispersion.
Example 3.2. Consider the query Q0 , database D0 , and the relevance and distance
functions δrel and δdis described in Example 3.1. Assume that k = 10. Then
(1) with FMS , query result diversification aims to find a set U1 of 10 gifts from the query
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:11
result Q0 (D0 ) such that the weighted sum of the relevance values of the selected gifts
in U1 to Q0 and the dissimilarity values among the gifts in U1 is maximum.
(2) The objective function FMM is to find a set U2 of 10 gifts from Q0 (D0 ) such that the
weighted sum of the minimum relevance of the gifts in U2 to Q0 and the minimum
distance between pairs of gifts in U2 is maximum.
(3) The objective function Fmono is to find a set U3 of 10 gifts from Q0 (D0 ) such that
the weighted sum of the relevance values of the gifts in U3 to Q0 and the mean of
the distances between the selected gifts in U3 and all candidate gifts in the entire
set Q0 (D0 ) is maximized. In particular, here the diversity criterion is to assess the
coverage of various gifts in the entire set Q0 (D0 ) by the set U chosen.
2
Remarks. Observe the following.
(1) Objective functions FMS , FMM and Fmono are defined in terms of two criteria: relevance δrel and diversity δdis . The larger the parameter λ is, the more weight we place
on the diversity of the results selected. When λ = 0, FMS , FMM and Fmono measure the
relevance only. On the other hand, when λ = 1, these objective functions are defined in
terms of δdis only and assess the diversity alone.
(2) For a given set U ⊆ Q(D), FMS (U ) and FMM (U ) are PTIME computable as long
as δrel and δdis are PTIME computable. In contrast, when it comes to mono-objective,
Fmono (U ) may not be PTIME computable when Q and D are given as input but Q(D) is
not assumed available, as commonly found in practice. Indeed, for each tuple t ∈ U ,
Fmono (U ) has to compute δdis (t, t′ ) when t′ ranges over all tuples in Q(D).
4. REASONING ABOUT RESULT DIVERSIFICATION
In this section we first identify three problems in connection with query result diversification, for which the complexity will be provided in the next five sections. We then
demonstrate possible applications of the complexity analyses of these problems.
4.1. Decision and Counting Problems
Consider a database D, a query Q in a language LQ , a positive integer k, and an
objective function F defined with relevance and distance functions δrel and δdis .
The query result diversification problem. We start with a decision problem, referred to as the query result diversification problem and denoted by QRD(LQ , F ). To
formulate this problem, we need the following notations.
We call a set U ⊆ Q(D) a candidate set for (Q, D, k) if |U | = k. Given a real number B
as a bound, we refer to a candidate set U as a valid set for (Q, D, k, F, B) if F (U ) ≥ B.
That is, the F value of U is large enough to meet the objective B.
Given these, QRD(LQ , F ) is stated as follows.
QRD(LQ , F ): The query result diversification problem.
INPUT:
A database D, a query Q ∈ LQ , an objective function F , a real number
B and a positive integer k ≥ 1.
QUESTION: Does there exist a valid set for (Q, D, k, F, B)?
Observe that QRD(LQ , F ) is the decision version of the function problem for computing a top-ranked set U based on F , and is fundamental to understanding the complexity of query result diversification. As remarked earlier, we simply consider generic
PTIME functions δrel and δdis when defining F .
The diversity ranking problem. In practice, given a candidate set U picked by users
or produced by a system, we want to assess how well U meets a diversification objective and hence, satisfies the users’ need. This suggests that we study another decision
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:12
Ting Deng and Wenfei Fan
problem, referred to as the diversity ranking problem and denoted by DRP(LQ , F ), to
assess the rank of a given candidate set based on F .
To state this problem, we use the following notion of ranks. Consider a candidate set
U and a positive integer r. We say that the rank of U is r, denoted by rank(U ) = r, if
there exists a collection S of r − 1 distinct candidate sets for (Q, D, k) such that (a) for
all S ∈ S, F (S) > F (U ); and (b) for any candidate set S ′ for (Q, D, k), if S ′ 6∈ S, then
F (U ) ≥ F (S ′ ). That is, there exist exactly r − 1 candidates sets for (Q, D, k) that are
ranked above U based on F . Obviously, the less rank(U ) is, the higher U is ranked.
Assume a positive integer r that is a constant. We state DRP(LQ , F ) as follows.
DRP(LQ , F ): The diversity ranking problem.
INPUT:
A database D, a query Q ∈ LQ , an objective function F , a positive
integer k ≥ 1 and a candidate set U for (Q, D, k).
QUESTION: Does rank(U ) ≤ r?
This problem was advocated in [Jin and Patel 2011], but its complexity was not settled before. The need for studying this is evident, as will be elaborated in Section 4.2.
In practice, rank r is below a threshold (e.g., top 20) and hence, is typically treated
as a constant. One may also want to treat r as part of input rather than a constant.
We will elaborate its impact on the complexity bounds of DRP in Section 6.
The result diversity counting problem. Given an objective B, one often wants to
know how many valid sets are out there and hence, can be selected and recommended.
This suggests that we study the counting problem below, referred to as the result diversity counting problem and denoted by RDC(LQ , F ).
RDC(LQ , F ): The result diversity counting problem.
INPUT:
A database D, a query Q ∈ LQ , an objective function F , a real number
B and a positive integer k ≥ 1.
QUESTION: How many valid sets are there for (Q, D, k, F, B)?
That is, RDC(LQ , F ) is to count the number of candidate sets in D that satisfy the
users’ request. An effective counting procedure is useful in practice (see Section 4.2).
Parameters of the problems. We study these problems for (a) objective functions
F ranging over max-sum diversification FMS , max-min diversification FMM , and monoobjective formulation Fmono (Section 3), and for (b) query languages LQ ranging over
the following (see, e.g., [Abiteboul et al. 1995] for details of these languages):
(1) conjunctive queries (CQ), built up from atomic formulas with constants and variables, i.e., relation atoms in database schema R and built-in predicates (=, 6=, <, ≤,
>, ≥), by closing under conjunction ∧ and existential quantification ∃;
(2) union of conjunctive queries (UCQ) Q1 ∪ · · · ∪ Qr , where Qi is in CQ for i ∈ [1, r];
(3) positive existential FO queries (∃FO+ ), built from atomic formulas by closing under
∧, disjunction ∨ and ∃; and
(4) first-order logic queries (FO) built from atomic formulas using ∧, ∨, negation ¬, ∃
and universal quantification ∀.
That is, FO is relational algebra, CQ is the class of SPC queries supporting selection,
projection and Cartesian product, UCQ is the class of SPCU queries, and ∃FO+ is the
fragment of relational algebra with selection, projection, Cartesian product and union.
To the best of our knowledge, no prior work has studied the complexity of DRP and
RDC. When it comes to QRD, only a special case was studied, when LQ consists of
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:13
identity queries and F is FMS or FMM . No prior work has considered QRD when LQ is
CQ or beyond, or when F is Fmono .
4.2. Applications of Diversification Analyses
Before we establish the complexity of these problems, we first demonstrate how their
analyses can be used in practice. The complexity bounds of these problems are not
only of theoretical interest, but may also help practitioners when developing diversification models and algorithms. As remarked in Section 2, query result diversification is
needed for Web search, recommendation systems and advertising, among other things.
Guidance for what features should be supported in a system. When developing a system that supports query result diversification, one has to decide the following. What
query language should be supported? What diversification function should be adopted?
Would a relevance function alone suffice in the application so that one does not have to
pay the cost introduced by distance functions? Would a fixed set of queries suffice for
users to express their requests? Are compatibility constraints a must? The complexity study of diversification problems in different settings may help the system vendor
strike a balance between the cost of diversification analyses and the expressive power
needed. For instance, if users of the system are to use only a fixed set of FO queries,
adopt a mono-objective function and need no compatibility constraints, then the diversification analyses are in PTIME (see data complexity in Table I, Section 10). In
contrast, if the system allows users to issue arbitrary CQ queries, then one should be
prepared for higher cost (PSPACE-complete, Table I). In the latter case, the developers of the system should go for heuristic algorithms for diversification analyses rather
than for exact PTIME algorithms, as suggested by the lower bounds of the problems.
Assessing the stock of an e-commerce system. Suppose that a recommendation system
maintains a collection D of items for sale. When a decision procedure for QRD constantly fails to find desired item sets upon frequent users’ requests, or a procedure for
DRP finds that those items recommended by the system are often ranked low, then the
manager of the system should consider adjusting the stock in D. As another example, when a procedure for RDC finds a large stock of popular items, the manager may
consider advertising those items to particular user groups.
Assessing the choices of users. A user of an e-commerce system may employ a procedure for DRP to asses how good a set of items of her choice is, before she commits to
the purchase. Moreover, she could use a procedure for RDC to find out different variety
of choices that meet her need, and hence, choose one from them.
5. THE QUERY RESULT DIVERSIFICATION PROBLEM
We start with the decision problem QRD. We first establish its combined complexity
in Section 5.1 and data complexity in Section 5.2. We will then identify and study
its special cases in Section 8. The complexity bounds of QRD in various settings
are depicted in Fig. 1, annotated with their corresponding theorems, where each
arrow indicates how the complexity of QRD is reduced in different settings (similarly
for Figures 3 and 4 to be given in Sections 6 and 7, respectively). Both combined
complexity and data complexity are summarized, when LQ ranges over the query
languages given in Section 4 (here CQ/FO denotes either CQ or FO), objective function
F is FMS , FMM (shown in Fig. 1(a)) or Fmono (Fig. 1(b)), and when certain conditions
are imposed in special settings (here λ = 0 means that F is defined with a relevance
function only, constant k indicates the setting in which the number of items to be
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:14
Ting Deng and Wenfei Fan
PSPACE-complete
(Th. 5.1)
PSPACE-complete
(Th. 5.2)
CQ/FO, combined
FO, combined
NP-complete
(Th. 5.1)
+
CQ/∃FO ,
combined
NP-complete
(Th.5.4)
CQ/FO, data
NP-complete
(Th. 8.2)
CQ/FO, λ=0,
combined
PTIME
(Th. 8.2)
PTIME
(Cor. 8.4)
CQ/FO, constant k,
data
PTIME
(Th. 8.2)
PTIME
(Th. 5.4)
CQ/FO, λ=0, data
CQ/FO, data
CQ/FO, λ=0, data
PTIME
(Cor. 8.1)
CQ/FO, identity
queries, combined
Fig. 1. The complexity bounds of QRD
selected is a constant, and identity queries mean that LQ consists of identity queries
only). They reveal the impacts of various factors on the complexity of QRD.
5.1. The Combined Complexity of QRD
Consider a naive algorithm for QRD: first compute Q(D), and then rank k-element sets
U of Q(D) based on F (U ); after these steps we simply pick the top-ranked set U and
check whether F (U ) ≥ B. One might think that the complexity of QRD would equal
the higher complexity of the two steps. The result below tells us that this is the case
when F is FMS or FMM . However, when F is Fmono , the story is quite different.
(1) When the objective is max-sum or max-min diversification, query language LQ
dominates the combined complexity of QRD: it is NP-complete for CQ, UCQ and ∃FO+ ,
but is PSPACE-complete for FO. That is, while the presence of disjunction in UCQ and
∃FO+ does not make it harder than CQ, negation in FO complicates the analysis.
(2) When F is Fmono , the problem becomes more intricate for CQ, UCQ and ∃FO+ : it is already PSPACE-complete, the same as its complexity for FO. Note that the membership
problem for CQ is NP-complete (see the statement of the problem shortly). Moreover,
Fmono is in PTIME after Q(D) is computed (see Corollary 8.1 for details). In contrast,
QRD(CQ, Fmono ) is PSPACE-complete. Hence in this case, the complexity is inherent
to query result diversification, and is not equal to the higher complexity of the two
steps given above. This is because mono-objective formulation requires to aggregate
distances between elements in U and all tuples in Q(D), and is more costly to compute
than FMS and FMM , as remarked in Section 3.
Below we first study QRD(LQ , F ) when F is FMS or FMM .
T HEOREM 5.1. The combined complexity of QRD(LQ , FMS ) and QRD(LQ , FMM ) is
— NP-complete when LQ is CQ, UCQ or ∃FO+ ; and
— PSPACE-complete when LQ is FO.
2
To verify the lower bounds, we use reductions from the following problems.
(1) 3SAT: Given a formula ϕ = C1 ∧ . . . ∧ Cl in which each clause Ci is a disjunction of
three variables or their negations taken from X = {x1 , . . . , xm }, it is to decide whether
ϕ is satisfiable. It is known that 3SAT is NP-complete (cf. [Papadimitriou 1994]).
(2) The membership problem for FO: Given an FO query Q, a database D and a tuple
s, it is to decide whether s ∈ Q(D). This problem is PSPACE-complete [Vardi 1982].
Given these, we prove Theorem 5.1 as follows.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:15
P ROOF. We study QRD(LQ , FMS ) and QRD(LQ , FMM ) first for CQ, UCQ and ∃FO+ , and
then for FO.
(1) When LQ is CQ, UCQ or ∃FO+ . It suffices to prove that QRD(CQ, FMS ) and QRD(CQ,
FMM ) are NP-hard and that QRD(∃FO+ , FMS ) and QRD(∃FO+ , FMM ) are in NP.
(1.1) Lower bound. We show that QRD(CQ, FMS ) and QRD(CQ, FMM ) are NP-hard by
reductions from the 3SAT problem, even when λ = 1 and Q is an identity query.
We first consider QRD(CQ, FMS ). Given an instance ϕ = C1 ∧ . . . ∧ Cl of 3SAT over
variables X = {x1 , . . . , xm }, we define a database D, a fixed CQ query Q, functions δrel ,
δdis and FMS , and constants B and k. We show that ϕ is satisfiable if and only if there
exists a valid set U for (Q, D, k, FMS , B). Assume w.l.o.g. that l > 1.
(1) The database D has a single relation IC specified by RC (cid, L1 , V1 , L2 , V2 , L3 , V3 )
and populated as follows. For each i ∈ [1, l], let clause Ci be l1i ∨ l2i ∨ l3i . For any
truth assignment µi of variables in the literals in Ci that makes Ci true, we add a
tuple (i, xj , vj , xk , vk , xl , vl ) to IC , where xj = l1i if l1i ∈ X and xj = ¯l1j otherwise; and
vj = µi (xj ); similarly for xk , xl and vk and vl . Here IC encode all satisfying truth assignments for clauses separately. This does not give rise to an exponential blow up as
each clause has only three variables. At most 8 tuples are included in IC for each Ci .
(2) We define the query Q as the identity query on RC instances.
(3) We define δrel to be a constant function that returns 1 for each tuple t of RQ . For each
pair of distinct tuples t and s of RQ , we define δdis (t, s) = 1 in case that (a) t[cid] 6= s[cid]
(i.e., t and s correspond to distinct clauses of ϕ) and (b) t and s have the same value for
each variable appearing in both t and s. For any other pair of tuples t′ and s′ of RQ , we
define δdis (t′ , s′ ) = 0. Furthermore,
we use λ = 1. Then for each set U of tuples of RQ ,
P
we have that FMS (U ) = t,s∈U δdis (t, s) for each set U .
(4) We use k = l, i.e., we consider only valid sets of l tuples, one for each clause of ϕ.
Finally, let B = l · (l − 1). Obviously, for any subset U of Q(D) with |U | = l, FMS (U ) ≥ B
if and only if every clause has at least one satisfying assignment which is encoded by
one tuple in U , and all these assignments are consistent over the variables.
We verify that ϕ is satisfiable if and only if there is a valid set U for (Q, D, k, FMS , B).
⇒ Assume that ϕ is satisfiable. Then there exists a truth assignment µ0X of X variables such that every clause Cj of ϕ is true by µ0X . Let U consist of l tuples of RQ ,
one for each clause, in which the values for the variables in X agree with µ0X . Then
FMS (U ) = l · (l − 1) ≥ B by the definition of FMS .
⇐ Conversely, assume that ϕ is not satisfiable. Suppose by contradiction that there
exists a set U ⊆ Q(D) such that |U | = l and FMS (U ) ≥ B. Then U consists of l tuples
that corresponds to l pairwise distinct clauses in ϕ and all agree with values of variables in X, by the definition of δdis . Let µX be the truth assignment of variables in X
such that for each xi ∈ X, µX (xi ) equals the value of xi in tuples in U . It is easy to see
that µX satisfies all clauses in ϕ. This leads to a contradiction.
We next show that QRD(CQ, FMM ) is NP-hard, also by reduction from 3SAT. Given an
instance ϕ = C1 ∧ . . . ∧ Cl of 3SAT, we construct the same D, Q, δrel , δdis as given above,
and set k = l. Furthermore, we let λ = 1, and hence FMM (U ) = mint,s∈U,t6=s δdis (t, s),
for each set U of tuples of RQ . Finally, we let B = 1. It is easy to see that µ0X is a
truth assignment of X variables that makes ϕ true if and only if there exists a set U
consisting of l tuples, one for each clause, in which the values for the variables agree
with µ0X , such that U ⊆ Q(D), |U | = k and FMM (U ) ≥ B.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:16
Ting Deng and Wenfei Fan
(1.2) Upper bound. We show that QRD(∃FO+ , F ) is in NP, when F is FMS or FMM , by
giving an NP algorithm. Given a query Q in ∃FO+ , a database D, a positive integer k,
objective function F and a constant B, the algorithm checks whether there exists a set
U valid for (Q, D, k, F B), as follows:
1. guess k CQ queries from Q, and for each CQ query, guess a tableau from D (see,
e.g., [Abiteboul et al. 1995] for tableaux); these tableaux yield a set U ⊆ Q(D);
2. check whether |U | = k and whether F (U ) ≥ B; if so, return “yes”, and otherwise
reject the guess and go back to step 1.
Clearly, step 2 is in PTIME since FMS (U ) and FMM (U ) and PTIME computable. Thus
the algorithm is in NP, and so are QRD(∃FO+ , FMS ) and QRD(∃FO+ , FMM ).
(2) When LQ is FO. We next study QRD(FO, FMS ) and QRD(FO, FMM ).
(2.1) Lower bound. We first show that QRD(FO, FMS ) is PSPACE-hard even when λ = 0
and k is a constant, by reduction from the membership problem for FO. Given an instance (Q, D, s) of the membership problem for FO, we define a database D′ = (D, I01 ),
where I01 = {(1), (0)} is a unary relation specified by schema R01 (X), encoding the
Boolean domain. Moreover, we define a query Q′ in FO as follows:
Q′ (~x, c) = Q(~x) ∧ R01 (c).
Clearly, if s ∈ Q(D), Q′ must return two tuples (s, 1) and (s, 0). We define δrel ((s, 1), Q′ )
= 1, and for any other tuple t of RQ′ , δrel (t, Q′ ) = 0. Furthermore, we define δdis as a
constant functionP
that returns 0 for each pair of tuples of RQ . We use λ = 0. Then
FMS (U ) = (k − 1) · t∈U δrel (t, Q′ ) for a set U of k tuples. Finally, we let k = 2 and B = 1.
We show that s ∈ Q(D) if and only if there exists a valid set for (Q′ , D′ , k, FMS , B).
First assume that s ∈ Q(D). Then there is a valid set U = {(s, 1), (s, 0)} for (Q′ , D′ , k,
FMS , B). Conversely, if s 6∈ Q(D), then by the definition of Q′ , (s, 1) is not in Q′ (D). Thus,
by the definition of FMS , there is no set U ⊆ Q′ (D′ ) such that |U | = 2 and F (U ) ≥ B.
We next consider QRD(FO, FMM ). Given an instance (Q, D, s) of the membership
problem for FO, we define the same D′ , Q′ , δrel and δdis as given above. Furthermore,
we set λ = 0, B = 1 and k = 1. Then FMM (U ) = mint∈U δrel (t, Q′ ). It is easy to see that
t ∈ Q(D) if and only if there exists a set U valid for (Q′ , D′ , k, FMM , B), along the same
lines as the argument given above for QRD(FO, FMS ).
(2.2) Upper bound. We next provide a PSPACE algorithm for QRD(FO, F ), when F is
FMS or FMM . The algorithm works as follows:
1. guess a set U consisting of k distinct tuples of RQ ;
2. check whether U ⊆ Q(D) and F (U ) ≥ B; if so, return “yes”, and otherwise reject
the guess and go back to step 1.
The algorithm is in NPSPACE since step 2 is in PSPACE. Indeed, checking whether
U ⊆ Q(D) is in PSPACE when Q is an FO query, and checking F (U ) ≥ B is in PTIME
when F is FMS or FMM . Thus the algorithm is in NPSPACE = PSPACE. As a result,
QRD(FO, FMS ) and QRD(FO, FMM ) are in PSPACE.
2
We next establish the combined complexity of QRD(LQ , Fmono )
T HEOREM 5.2. The combined complexity of QRD(LQ , Fmono ) is PSPACE-complete,
no matter whether LQ is CQ, UCQ, ∃FO+ or FO.
2
In contrast to its counterparts for FMS and FMM , the lower bound proof for Fmono
needs a counting argument and is more involved. It is verified by reduction from
Q3SAT: given a sentence ϕ = P1 x1 . . . Pm xm ψ, where Pi is either ∀ or ∃ and ψ is an
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
1
1
1
0
t1
0
1
A:17
1
0
0
1
0
1
1 0 1 0 1 0 1
t2 t3 t4 t5 t6 t7 t8 t9
x1
0
0
x2
1
0
x3
0 1 0 1 0 1 0 x4
t10 t11 t12 t13 t14 t15 t16
ϕ = ∃x1 ∀x2 ∃x3 ∀x4 ψ, ψ = (x1 ∨ x2 ∨ x̄3 ) ∧ (x̄2 ∨ x̄3 ∨ x4 ).
l = 3, P4 = ∀ :
δdis (t1 , t2 ) = 0 ⇐ ψ[t1 /~x] = 1, ψ[t2 /~x] = 0,
δdis (t3 , t4 ) = 1 ⇐ ψ[t3 /~x] = 1,ψ[t4 /~x] = 1,
δdis (t5 , t6 ) = 1 ⇐ ψ[t5 /~x] = 1, ψ[t6 /~x] = 1,
δdis (t7 , t8 ) = 1 ⇐ ψ[t7 /~x] = 1, ψ[t8 /~x] = 1,
δdis (t9 , t10 ) = 0 ⇐ ψ[t9 /~x] = 1, ψ[t10 /~x] = 0,
δdis (t11 , t12 ) = 1 ⇐ ψ[t11 /~x] = 1, ψ[t12 /~x] = 1,
δdis (t13 , t14 ) = 0 ⇐ ψ[t13 /~x] = 0, ψ[t14 /~x] = 0,
δdis (t15 , t16 ) = 1 ⇐ ψ[t15 /~x] = 1, ψ[t16 /~x] = 1.
l = 2, P3 = ∃ :
δdis (ti , tj ) = 1, i ∈ [1, 2], j ∈ [3, 4] ⇐ δdis (t1 , t2 ) = 0, δdis (t3 , t4 ) = 1,
δdis (ti , tj ) = 1, i ∈ [5, 6], j ∈ [7, 8] ⇐ δdis (t5 , t6 ) = 1, δdis (t7 , t8 ) = 1,
δdis (ti , tj ) = 1, i ∈ [9, 10], j ∈ [11, 12] ⇐ δdis (t9 , t10 ) = 0, δdis (t11 , t12 ) = 1,
δdis (ti , tj ) = 1, i ∈ [13, 14], j ∈ [15, 16] ⇐ δdis (t13 , t14 ) = 0, δdis (t15 , t16 ) = 1.
l = 1, P2 = ∀ :
δdis (ti , tj ) = 1, i ∈ [1, 4], j ∈ [5, 8] ⇐ δdis (t1 , t4 ) = 1, δdis (t5 , t8 ) = 1,
δdis (ti , tj ) = 1, i ∈ [9, 12], j ∈ [13, 16] ⇐ δdis (t9 , t12 ) = 1, δdis (t13 .t16 ) = 1.
l = 0, P1 = ∃ :
δdis (ti , tj ) = 1, i ∈ [1, 8], j ∈ [9, 16] ⇐ δdis (t1 , t8 ) = 1, δdis (t9 , t16 ) = 1.
Fig. 2. Example distance function δdis when m = 4 in the lower bound proof of Theorem 5.2
instance of 3SAT over variables in X = {x1 , . . . , xm }, Q3SAT is to decide whether ϕ is
true. It is known to be PSPACE-complete (cf. [Papadimitriou 1994]).
To give the reduction, we need some notations and a lemma. Consider an instance ϕ
of Q3SAT. We want to use “Boolean” tuples t = (u1 , . . . , um ) to encode a true assignment
for variables in X of ϕ, where ui is either 0 or 1 for i ∈ [1, m]. Denote by tl the prefixes
(u1 , . . . , ul ) of length l, for l ∈ [0, m − 1]. Consider a pair of tuples t = (u1 , . . . , um )
and s = (v1 , . . . , vm ) such that tl = sl but ul+1 6= vl+1 , i.e., t and s agree on the first l
attribute values but differ in the (l + 1)-th attribute. Note that tl and sl define a truth
assignment of variables xi for i ∈ [1, l], referred to as the truth assignment encoded by
tl . In the reduction, we want to define a distance function δdis such that δdis (t, s) = 1
if and only if formula Pl+1 xl+1 . . . Pm xm ψ is satisfied by the truth assignment encoded
by prefix tl . More specifically, we define δdis inductively from l =m−1 down to l = 0.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:18
Ting Deng and Wenfei Fan
(i) When l = m − 1, i.e., t and s differ only in their last attribute, we define δdis (t, s)= 1 if
either (a) Pm = ∀ and both the truth assignments encoded by t an s make ψ true, or
(b) Pm = ∃, and there exists at least one truth assignment µX (encoded by either t
or s) such that µX satisfies ψ. For any other tuples t′ and s′ , we define δdis (t′ , s′ ) = 0.
(ii) When m − 2 ≥ l ≥ 0, we define δdis (t, s) = 1 (a) when Pl+1 = ∀, and δdis ((tl ,
1, 1, . . . , 1), (tl , 1, 0, . . . , 0)) = 1 and δdis ((sl , 0, 1 . . . , 1), (sl , 0, 0, . . . , 0)) = 1; or (b) when
Pl+1 = ∃, and at least one of δdis ((tl , 1, 1, . . . , 1), (tl , 1, 0, . . . , 0)) and δdis ((sl , 0, 1 . . . , 1),
(sl , 0, 0, . . . , 0)) is 1. For any other tuples t′ and s′ , we define δdis (t′ , s′ ) = 0.
An example δdis is depicted in Fig. 2, for a sentence ϕ with 4 variables.
Such distance functions have the following property. That is, the cases of tl+1 and
l+1
s
(branches of truth assignments tl ) given above suffice to ensure Lemma 5.3.
L EMMA 5.3. Consider any pair of Boolean tuples t = (u1 , . . . , um ) and s = (v1 ,
. . . , vm ) that encode truth assignments of ϕ = Pl+1 xl+1 . . . Pm xm ψ. If tl = sl and
ul+1 6= vl+1 for some 0 ≥ l ≥ m − 1, then δdis (t, s) = 1 if and only if ϕ is true under µlX ,
2
where µlX is the truth assignment for variables x1 , . . . , xl encoded by tl .
P ROOF. We prove Lemma 5.3 by induction on l from l = m − 1 down to l = 0. When
l = m − 1, by the definition of δdis , we have that δdis ((tm−1 , 1), (tm−1 , 0)) = 1 if and only if
(a) when Pm = ∀, the truth assignments encoded by (tm−1 , 1) and (tm−1 , 0) both satisfy
ψ; and (b) when Pm = ∃, at least one of the truth assignments encoded by (tm−1 , 1)
and (tm−1 , 0) makes ψ true. Hence the statement holds in the base case.
To give more intuition for the inductive step, we also illustrate the case when
l = m−2. For any two tuples t = (u1 , . . . , um ) and s = (v1 , . . . , vm ) such that tm−2 = sm−2
and um−1 6= vm−1 , by the definition of δdis , we have that δdis (t, s) = 1 if and only if (a)
when Pm−1 = ∀, δdis ((tm−2 , 1, 1), (tm−2 , 1, 0)) = 1 and δdis ((tm−2 , 0, 1), (tm−2 , 0, 0)) = 1;
and (b) when Pm−1 = ∃, we have that either δdis ((tm−2 , 1, 1), (tm−2 , 1, 0)) = 1 or
δdis ((tm−2 , 0, 1),(tm−2 , 0, 0)) = 1. Then case (a) holds if and only if (i) when Pm = ∀, the
truth assignments encoded by (tm−2 , 1, 1) and (tm−2 , 1, 0) both make ψ true; similarly
for (tm−2 , 0, 1) and (tm−2 , 0, 0); and (ii) when Pm = ∃, there exists at least one truth assignment (encoded by (tm−2 , 1, 1) or (tm−2 , 1, 0)) that satisfies ψ; similarly for (tm−2 , 0, 1)
and (tm−2 , 0, 0). Clearly, case (a) (resp. case (b)) holds if and only if ∀xm−1 Pm xm ψ
(resp. ∃xm−1 Pm xm ψ) is true under the truth assignment encoded by tm−2 .
Now we prove the inductive step. Assume that when l = p and p ∈ [1, m − 3],
Lemma 5.3 holds. That is, for all tuples t = (u1 , . . . , um ) and s = (v1 , . . . , vn ) such that
tp = sp but up+1 6= vp+1 , δdis (t, s) = 1 if and only if Pp+1 xp+1 . . . Pm xm ψ is true under
the truth assignment encoded by tp for variables x1 , . . . , xp . We next consider the case
when l = p − 1. Let t and s be an arbitrary pair of tuples that agree on the first p − 1
attributes but disagree on the p-th attribute. By the definition of δdis , we have that
δdis (t, s) = 1 if and only if (c) when Pp = ∀, δdis ((tp−1 , 1, 1, . . . , 1), (tp−1 , 1, 0, . . . , 0)) = 1
and δdis ((tp−1 , 0, 1, . . . , 1), (tp−1 , 0, 0, . . . , 0)) = 1; and (d) when Pp = ∃, either
δdis ((tp−1 , 1, 1, . . . , 1), (tp−1 , 1, 0, . . . , 0)) = 1 or δdis ((tp−1 , 0, 1, . . . , 1), (tp−1 , 0, 0, . . . , 0)) = 1.
By the induction hypothesis, for l = p, we know that case (c) holds if and only if
Pp+1 xp+1 . . . Pm xm ψ is true under the truth assignments of variables x1 , . . . , xp−1
encoded by ~up−1 , no matter whether xp is 1 or 0. The latter holds if and only if
∀xp Pp+1 xp+1 . . . Pm xm ψ is true under the truth assignment of x1 , . . . , xp−1 encoded by
tp−1 . Similarly, case (d) is satisfied if and only if ∃xp Pp+1 xp+1 . . . Pm xm ψ is true under
the truth assignment of x1 , . . . , xp−1 encoded by tp−1 . This proves Lemma 5.3.
2
Leveraging Lemma 5.3, we next prove Theorem 5.2.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:19
P ROOF. It suffices to show that QRD(CQ, Fmono ) is PSPACE-hard, even when λ = 1
and k is a constant, and that QRD(FO, Fmono ) is in PSPACE.
(1) Lower bound. We show that QRD(CQ, Fmono ) is PSPACE-hard by reduction from
the Q3SAT problem. Given an instance ϕ = P1 x1 . . . Pm xm ψ of Q3SAT with variables
in X = {x1 , . . . , xm }, we construct a CQ query Q, a database D, functions δrel , δdis and
Fmono , and let k = 1 and B = 1. We prove that ϕ is true if and only if there exists a
valid set U for (Q, D, k, Fmono , B).
(1) Database D has a single unary relation I01 = {(1), (0)} specified by schema R01 (X),
encoding the Boolean domain as in the proof of Theorem 5.1 for the FO case.
(2) We define the CQ query Q as follows:
Q(~x) = R01 (x1 ) ∧ . . . ∧ R01 (xm ),
where ~x = (x1 , . . . , xm ). Intuitively, query Q generates all truth assignments for
variables in X. Note that |Q(D)| = 2m .
(3) The function δrel is a constant function that returns 1 for each tuple of RQ . We use
the distance function δdis given above, and set λ = 1.
We next verify that there exists a valid set U for (Q, D, k, Fmono , B) if and only if ϕ
is true, by giving a counting argument.
⇒ Assume that there exists a valid set U = {t} for (Q, D, k = 1, Fmono , B = 1). Then
P
P
by λ = 1, Fmono (U )= 2m1−1 · s∈Q(D) δdis (t, s) ≥ 1. Thus s∈Q(D) δdis (t, s) ≥ 2m − 1. Since
|Q(D)| = 2m , for each tuple s ∈ Q(D), if s 6= t, δdis (t, s) must be 1 by the definition of
δdis . In particular, there exists a tuple s′ such that s′ differs from t in the first attribute
and δdis (t, s′ ) = 1. Then by Lemma 5.3, we have that ϕ = P1 x1 . . . Pm xm ψ is true.
⇐ Conversely, assume that ϕ is true. We show that there must exist a valid set U . By
Lemma 5.3, for each pair of tuples t and s such that t1 6= s1 , we have that δdis (t, s) = 1.
Moreover, since ϕ is true, no matter what P1 is, there must exist a truth assignment µx1
for variable x1 such that P2 x2 . . . Pm xm ψ is true under µx1 . Then again by Lemma 5.3,
′
for each pair of tuples t′ = (u′1 , . . . , u′m ) and s′ = (v1′ , . . . , vm
), δdis (t′ , s′ ) = 1 if t′1 = s′1 =
′
′
µx1 but u2 6= v2 . Furthermore, since P2 x2 . . . Pm xm ψ is true under the truth assignment
µx1 for variable x1 , there must exist a truth assignment µx2 of variable x2 such that
P3 x3 . . . Pm xm ψ is true under µx1 and µx2 . Similarly, we get truth assignments µxl for
variable xl , where l ∈ [3, m], such that for each pair of tuples t and s, δdis (t, s) = 1, if
tl = sl = (µx1 , µx2 , . . . , µxl ), and ul+1 6= vl+1 for l ∈ [0, m − 1]. Let t∗ = (µx1 , . . . , µxm )
and U = {t∗ }. Then
of Q, t∗ ∈ Q(D). Obviously, |U | = 1. To see that
P by the definition
1
∗
Fmono (U ) = 2m −1 s∈Q(D) δdis (t , s) ≥ B = 1, we compute the number of tuples s such
that δdis (t∗ , s) = 1. As discussed earlier, for each tuple s = (v1 , . . . , vm ), δdis (t∗ , s) = 1
if there exists l ∈ [0, m − 1] such that (t∗ )l = sl but µxl+1 6= vl+1 . It is easy to see that
there are 2m /2l+1 = 2m−l−1 tuples s such that δdis (t∗ , s) = 1. Thus the number of tuples
P
m−l−1
s such that δdis (t∗ , s) = 1 is m−1
= 2m−1
1 = 2m − 1. Recall that we
l=0 2
P + ... + 2 +
1
∗
set k = 1 and B = 1. Hence Fmono (U ) = 2m −1 s∈Q(D) δdis (t , s) = 1 ≥ B.
(2) Upper bound. We show that QRD(FO, Fmono ) is in PSPACE. Let MQ be the binary
representation of tuples t of RQ such that each of its attributes has the largest
constants from the active domain. Note that tuple t is bounded by arity(RQ ) and
adom(Q, D), where arity(RQ ) denotes the arity of the schema RQ , and adom(Q, D) is the
set of constants appearing in D or Q. We present an NPSPACE algorithm as follows:
1. guess a set U of k tuples of RQ ;
2. if U ⊆ Q(D), continue; otherwise, reject the guess and go back to step 1;
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:20
Ting Deng and Wenfei Fan
3. let n denote the number of tuples in Q(D), and variable dist keep track of the sum
of distances between each pair of tuples t ∈ Q(D) and s ∈ Q(D), both represented
in binary; initially, we set n = 0 and dist = 0;
4. for each tuple s ∈ [0, MQ ] (encoded in binary with necessary delimiters) do:
a. check whether s ∈ Q(D); if so, continue; otherwise, go back to step 4;
b. for each tuple t ∈ U , dist := dist + δdis (t, s);
c. increase n by 1;
P
5. compute Fmono (U ) = (1 − λ) · t∈U δrel (t, Q) + (λ/(n + 1)) · dist; check whether
Fmono (U ) ≥ B; if so, return “yes”.
Observe the following. (a) Step 2 is in PSPACE for FO queries. (b) Steps 4 and 5 can
also be done in polynomial space when tuples t, counter n and sum dist are encoded
in binary. Indeed, step 4(a) is in PSPACE for FO queries. Moreover, steps 4(b), 4(c)
and 5 can be done in PSPACE because we consider tuples in binary coding. Hence the
algorithm is in NPSPACE = PSPACE. Thus QRD(FO, Fmono ) is in PSPACE.
2
5.2. The Data Complexity of QRD
We next re-investigate QRD for data complexity. That is, when query Q is predefined
and fixed, but database D may vary, i.e., the complexity of evaluating a fixed query for
variable database inputs (see [Abiteboul et al. 1995] for details). Data complexity is
also of practical interest since in many real-life applications, one often uses a fixed set
of queries, e.g., Web forms, while the database may be frequently updated.
From the result below we can see that when the objective is given by FMS and FMM ,
fixing query Q does not reduce the complexity of QRD for CQ, UCQ and ∃FO+ : the
problem remains NP-complete, the same as its combined complexity. In contrast, fixed
queries do simplify the analysis of the problem.
(1) when LQ is FO and F is FMS or FMM , QRD(LQ , F ) becomes NP-complete; or
(2) when F is Fmono , the problem becomes tractable in this case.
Contrast these with the PSPACE-completeness of their combined complexity (Theorem 5.1 and 5.2). These demonstrate that when Q is fixed, objective functions
determine the data complexity, while query languages have no impact.
T HEOREM 5.4. The data complexity of QRD(LQ , FMS ) and QRD(LQ , FMM ) is NPcomplete, while that of QRD(LQ , Fmono ) is in PTIME, when LQ is CQ, UCQ, ∃FO+ or FO. 2
P ROOF. We first study the data complexity of QRD(LQ , FMS ) and QRD(LQ , FMM ) for
CQ, UCQ, ∃FO+ and FO, and then investigate it of QRD(LQ , Fmono ).
(1) When F is FMS or FMM . It suffices to prove that QRD(CQ, F ) is NP-hard and that
QRD(FO, F ) is in NP. Recall that the lower bounds of QRD(CQ, FMS ) and QRD(CQ, FMM )
of Theorem 5.1 are shown by using a fixed identity query. Thus the lower bounds
hold here, i.e., QRD(CQ, FMS ) and QRD(CQ, FMM ) are NP-hard. For the upper bound,
note that the algorithm given in the proof of Theorem 5.1 for FO can carry over here.
Clearly, its step 2 is in PTIME since Q(D) is PTIME computable for a fixed FO query
Q, and F is also in PTIME when F is FMS or FMM . Thus the algorithm is in NP for fixed
FO queries, and hence so is the data complexity of QRD(FO, FMS ) and QRD(FO, FMM ).
(2) When F is Fmono . We give a PTIME algorithm to check whether there exists a set U
valid for (Q, D, k, Fmono , B). Given a fixed Q, D, Fmono , k and B, the algorithm first
computes the entire set Q(D) of query answers. It then checks whether Q(D) contains
at least k items at all, and if so, it picks k tuples from Q(D) with the largest “score” in
terms of the relevance and diversity. Finally, it checks whether the sum of the scores of
these tuples is beyond the bound B. It returns “yes” if so and “no” otherwise. When Q
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:21
PSPACE-complete
(Th. 6.1)
PSPACE-complete
(Th. 6.2)
FO, combined
CQ/FO, combined
coNP-complete
(Th. 6.1)
CQ/∃FO+, combined
PTIME
(Th. 8.2)
CQ/FO, λ=0, data
coNP-complete
(Th. 6.4)
CQ/FO, data
coNP-complete
(Th. 8.2)
CQ/FO, λ=0,
combined
PTIME
(Cor. 8.4)
CQ/FO, constant k,
data
PTIME
(Th. 8.2)
PTIME
(Th. 6.4)
CQ/FO, λ=0, data
CQ/FO, data
(a) When F is FMS or FMM
PTIME
(Cor. 8.4)
CQ/FO, identity
queries, combined
(b) When F is Fmono
Fig. 3. The complexity bounds of DRP
is fixed, all these can be done in PTIME. In contrast to FMS and FMM , Fmono allows us to
compute the scores by inspecting each tuple in Q(D) individually, rather than by checking each k-element subset of Q(D). More specifically, the algorithm works as follows:
1. compute Q(D), in PTIME;
2. check whether |Q(D)| ≥ k; if so, continue; otherwise return
P “no”;
λ
· s∈Q(D) δdis (t, s) in PTIME;
3. for each tuple t, compute v(t) = (1−λ)·δrel (t, Q)+ |Q(D)|−1
4. let vsum be the sum of the k largest values in U , where U = {v(t) | t ∈ Q(D)}, and
check whether vsum ≥ B; if so, return “yes”; otherwise, return “no”.
The algorithm is in PTIME. Indeed, since Q is fixed, Q(D) is PTIME computable. Thus
steps 1-3 are all in PTIME. Hence so is the data complexity of QRD(FO, Fmono ).
2
6. THE DIVERSITY RANKING PROBLEM
We next study the decision problem DRP. We give its combined complexity in Section 6.1 and data complexity in Section 6.2, and will investigate its special cases in
Section 8. Along the same lines as Fig. 1, we summarize the complexity bounds of DRP
in various setting in Fig. 3, which depicts the impact of various parameters.
All the complexity bounds of this section remain intact when rank r is taken as
input of DRP instead of a constant, except the PTIME bound of the data complexity for
Fmono (Theorem 6.4; see details there).
6.1. The Combined Complexity of DRP
While DRP does not reduce to QRD (or vice versa), the two decision problems are
connected. Given an instance D, Q, F, k and U ⊆ Q(D) of DRP, one can construct an
instance D, Q, F, k and B = F (U ) of QRD such that rank(U ) ≤ r if and only if there
exist no more r − 1 subsets U ′ of Q(D) with F (U ′ ) > B. From this an algorithm for
DRP immediately follows, by using an QRD oracle: first guess r subsets U ′ of Q(D)
with k elements each, and then check whether F (U ′ ) > B by calling (a minor revision
of) a procedure for QRD; return “no” if so. In light of the connection, it is not surprising
that DRP and QRD behave similarly in their combined complexity analyses.
(1) When F is FMS or FMM , LQ dominates the combined complexity of DRP.
(2) When F is Fmono , LQ has no impact on the combined complexity of DRP.
We first study the combined complexity of DRP(LQ , FMS ) and DRP(LQ , FMM ).
T HEOREM 6.1. The combined complexity of DRP(LQ , FMS ) and DRP(LQ , FMM ) is
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:22
Ting Deng and Wenfei Fan
— coNP-complete when LQ is CQ, UCQ or ∃FO+ , and
— PSPACE-complete when LQ is FO.
2
The proof is similar to the proof of Theorem 5.1 for QRD. In particular, the lower
bounds are verified by reductions from the complements of 3SAT and the membership
problem for FO. Note that, however, the connection between QRD to DRP is not strong
enough for us to carry (the dual of) the complexity bounds of QRD to DRP: (1) DRP does
not inherit the lower bound of QRD since QRD does not reduce to DRP, and (2) for the
upper bound, the algorithm outlined above gives us Σp2 by calling the QRD, not coNP.
Here Σp2 is the class of languages recognized by a nondeterministic Turing machine
with an NP oracle, i.e., NPNP [Papadimitriou 1994].
P ROOF. We first show that DRP(LQ , FMS ) and DRP(LQ , FMM ) are coNP-complete
when LQ is CQ, UCQ or ∃FO+ , and then prove that they are PSPACE-complete for FO.
(1) When LQ is CQ, UCQ or ∃FO+ . It suffices to show that DRP(LQ , FMS ) and DRP(LQ ,
FMM ) are coNP-hard for CQ and that they are in coNP for ∃FO+ .
(1.1) Lower bound. We show that DRP(CQ, FMS ) and DRP(CQ, FMM ) are already coNPhard for fixed identity queries, λ = 1 and r = 1, by reductions from the complement
of 3SAT. We first verify this for DRP(CQ, FMS ). Given an instance ϕ of 3SAT, where
ϕ = C1 ∧ . . . ∧ Cl is defined over variables in X = {x1 , . . . , xm }, we define a database
D, a CQ query Q, functions δrel , δdis and FMS , a set U and a positive integer k. We show
that ϕ is not satisfiable if and only if rank(U ) ≤ r.
To give the reduction, we construct a new formula ϕ′ such that ϕ′ is satisfiable if and
only if ϕ is satisfiable. Let z be a fresh variable that is not in the set X of variables in ϕ.
V
We define ϕ′ = (ϕ ∨ z) ∧ z̄ = li=1 (Ci ∨ z) ∧ z̄. It is easy to see that for a truth assignment
µX of X variables, µX satisfies ϕ if and only if µX makes ϕ′ true with z = 0. Moreover,
there always exists a truth assignment that makes ϕ′ false. Indeed, when setting z to
be 1, ϕ′ is false under any truth assignments of X. We next give the reduction.
(1) The database D includes a single relation IC specified by the following schema:
RC (cid, L1 , V1 , L2 , V2 , L3 , V3 , Z, VZ , A), populated as follows. Let Ci′ = l1i ∨ l2i ∨ l3i ∨ z be the
i-th clause of ϕ′ , where i ∈ [1, l]. For any possible truth assignment µi of variables in the
literals of Ci′ , we add a tuple (i, xk , vk , xl , vl , xm , vm , z, vz , a), where xk = l1i if l1i ∈ X, and
xk = l̄1i if l1i = x̄k . We set vk = µi (xk ); similarly for xl , xm , z and vl , vm and vz . Moreover,
we set a = 1 if the truth assignment µi satisfies Ci′ ; otherwise we set a = 0. Further, for
z̄, we add two tuples (l + 1, e1 , f1 , e2 , f2 , e3 , f3 , z, 1, 0) and (l + 1, e1, f1 , e2 , f2 , e3 , f3 , z, 0, 1),
where all ei and fi are distinct constants that are not in X ∪ {z, 0, 1}.
(2) We define Q as the identity query on RC instances.
(3) Let k = l + 1. We let the set U consist of l + 1 tuples from D, one for each clause in
ϕ′ such that all variables in X and z are set to be 1. Obviously, U ⊆ Q(D).
(4) We define the relevance function δrel to be a constant function that returns 1 for
each tuple t of RQ . Moreover, for each pair of tuples t and s of RQ , we define δdis (t, s) = 1
if (i) t[cid] 6= s[cid], i.e., t and s encode two different clauses in ϕ′ ; (ii) t and s have the
same value for each variable appearing in both t and s; and (iii) t[A] = s[A] = 1, i.e., t
and s encode truth assignments that satisfy their corresponding clauses, respectively.
′ ′
Furthermore, for any other pair of tuples t′ and s′ , we define
P δdis (t , s ) = 0. Moreover,
we let λ = 1. Then for each set S of tuples of RQ , FMS (S) = t,s∈S δdis (t, s).
We next show that rank(U ) = 1 ≤ r if and only if ϕ is not satisfiable.
⇒ Assume that ϕ is satisfiable. Then there exists a truth assignment µ0X for the X
variables that satisfies ϕ. We show that rank(U ) ≥ 2 > r = 1. Let U 0 consist of l + 1
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:23
tuples, one for each clause in ϕ′ , such that the values of all variables in X agree with
µ0X and z is set to be 0. Obviously, for any two different tuples t and s in U 0 , we have
that δdis (t, s) = 1 by the definition of δdis . Then FMS (U 0 ) = (l + 1) · l. Note that for each
tuple t ∈ U , t[A] = 1 if t 6= (l + 1, e1 , f1 , e2 , f2 , e3 , f3 , z, 1, 0), and moreover, for any two
tuples t, s ∈ U such that t[A] = s[A] = 1, we have that δdis (t, s) = 1, by the definition of
δdis . Then FMS (U ) = l · (l − 1). Putting these together, we have that rank(U ) ≥ 2 > r = 1.
⇐ Conversely, assume that ϕ is not satisfiable. Then there exists no truth assignment µX of the X variables that satisfies ϕ. It is easy to see that for each candidate
set S for (Q, D, k), there exist at most l tuples t ∈ S such that t[A] = 1, and thus
FMS (S) ≤ l · (l − 1) = FMS (U ). Therefore, rank(U ) = 1 ≤ r = 1.
We show that DRP(CQ, FMM ) is also coNP-hard by reduction from the complement of
3SAT. Given an instance ϕ of 3SAT, we construct the same ϕ′ , D, Q, k, r, δrel (·, ·), U and
′
λ = 1 as above. We define distance function δdis
such that for any pair of distinct tuples
′
′
t and s of RQ , (i) δdis
(t, s) = 2 if δdis (t, s) = 1 and t, s ∈
/ U ; (ii) δdis
(t, s) = 1 if t, s ∈ U ; and
′
′
′ ′
(iii) for any other t and s , δdis (t , s ) = 0. Then for each k-element set S of tuples of
schema RQ , FMM (S) = mint,s∈S,t6=s δdis (t, s). We show that this is indeed a reduction.
′
By the definitions of δdis
and FMM , for each candidate set S for (Q, D, k), (i) FMM (S) = 2
if S encodes a truth assignment of X that satisfies ϕ with z = 0; (ii) FMM (S) = 1 if
S = U ; and otherwise (iii) FMM (S) = 0. Then along the same line as for DRP(CQ, FMS ),
one can verify that ϕ is not satisfiable if and only if rank(U ) = 1 ≤ r = 1.
(1.2) Upper bound. We show that when F is FMS or FMM , DRP(∃FO+ , F ) is in coNP, by
giving an NP algorithm for its complement. Given Q, D, F , U and k, the algorithm
checks whether rank(U ) > r, i.e., whether there exist r candidate sets for (Q, D, k) such
that for each such set S, F (S) > F (U ).
1. make r guess such that each time, guess k CQ queries from Q, and for each CQ
query, guess a tableau from D; these tableaux yield a subset of Q(D), collected in S;
2. check whether S contains r distinct sets and whether for each S ∈ S, |S| = k;
3. for each set S ∈ S, check if F (S) > F (U ); if so, return “no”.
The algorithm is in NP since step 2 is in PTIME; moreover, step 3 is also in PTIME
since F is PTIME computable when F is FMS or FMM . Hence the problem is in coNP.
(2) When LQ is FO. We show that DRP is PSPACE-complete for FMS and FMM .
(2.1) Lower bound. We show that for FO, DRP is already PSPACE-hard when λ = 0 and
r = 1, by reductions from the complement of the membership problem for FO.
We first consider DRP(FO, FMS ). Given an instance (Q, D, s) of the membership problem for FO, we define a database D′ = (D, I01 ), where I01 = {(1), (0)} is an instance of
schema R01 (X) encoding the Boolean domain. We define a query Q′ as follows:
Q′ (~x, z, c) = Q(~x) ∨ (R01 (z) ∧ z = 1) ∧ R01 (c).
Clearly, no matter whether s ∈ Q(D) or not, (s, 1, 1) and (s, 1, 0) are in Q′ (D′ ). Moreover, when s ∈ Q(D), (s, 0, 1) and (s, 0, 0) must be in Q′ (D′ ). We define δrel ((s, 0, 1), Q′ ) =
δrel ((s, 0, 0), Q′ ) = 3, δrel ((s, 1, 1), Q′ ) = δrel ((s, 1, 0), Q′ ) = 2, and for any other tuple t of
RQ′ , δrel (t, Q′ ) = 1. Furthermore, we define δdis as a constant function that returns 0 for
′
each pair of tuples of RQ . We set λ = 0 and k = 2; hence, for each set S of tuples of RQ
P
′
with k tuples, FMS (S) = (k − 1) · t∈S δrel (t, Q ). Note that FMS ({(s, 0, 1), (s, 0, 0)}) = 6,
FMS ({(s, 1, 1), (s, 1, 0)}) = 4, FMS ({t, t′ }) = 5 if t ∈ {(s, 0, 1), (s, 0, 0)} and t′ ∈
{(s, 1, 1), (s, 1, 0)}, and for any other pair of tuples t and t′ , FMS ({t, t′ }) = 2. Finally, let
r = 1 and U = {(s, 1, 1), (s, 1, 0)}. As remarked earlier, U ⊆ Q(D).
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:24
Ting Deng and Wenfei Fan
We show that s ∈
/ Q(D) if and only if rank(U ) ≤ r. First assume that s ∈
/ Q(D). Then
(s, 0, 1) and (s, 0, 0) are not in Q′ (D′ ). Thus rank(U ) = 1 ≤ r by the definition of FMS .
Conversely, if s ∈ Q(D), (s, 0, 1) and (s, 0, 0) are both in Q′ (D′ ), and rank(U ) > r = 1.
We next show that DRP(FO, FMM ) is PSPACE-hard, also by reduction from the
complement of the membership problem for FO. Given an instance (Q, D, s) of that
problem, we define the same Q′ , D′ , δrel , λ = 0 and r = 1 as above. Then for each set S of
′
tuples of RQ
, FMM (S) = mint∈S δrel (t, Q′ ). Let U = {(s, 1, 1)}. Then along the same line
as above, one can verify that s ∈
/ Q(D) if and only if rank(U ) = 1 ≤ r for (Q′ , D′ , k, FMM ).
(2.2) Upper bound. We next present a PSPACE algorithm for the complement of
DRP(FO, F ) when F is FMS or FMM . The algorithm works as follows:
1. guess r distinct sets such that each such set S consists of k different tuples of RQ ;
2. denote by S the collection of the r sets; for each S ∈ S, check whether S ⊆ Q(D);
if so, continue;
3. check whether for all S ∈ S, F (S) > F (U ); if so, return “no”.
The algorithm is in NPSPACE = PSPACE since step 2 is in PSPACE for FO queries and
step 3 is in PTIME since FMS or FMM are in PTIME; hence so is the problem.
2
We next study the combined complexity of DRP(LQ , Fmono )
T HEOREM 6.2. The combined complexity of DRP(LQ , Fmono ) is PSPACE-complete
when LQ is CQ, UCQ, ∃FO+ or FO.
2
The proof is a variation of the one for Theorem 5.2. The lower bound is also verified
by reduction from Q3SAT. Given an instance ϕ = P1 x1 . . . Pm xm ψ of Q3SAT over
∗
variables X = {x1 , . . . , xm }, we define a distance function δdis
similar to its counterpart
given for QRD. More specifically, let t = (u1 , . . . , um ) and s = (v1 , . . . , vm ) be any
pair of tuples encoding two truth assignments of X variables, such that tl = sl but
ul+1 6= vl+1 , where tl and sl are prefixes of t and s of length l, respectively. We want to
∗
∗
use δdis
to ensure that δdis
(t, s) > 0 if and only if formula Pl+1 xl+1 . . . Pm xm ψ is satisfied
by the truth assignment encoded by prefix tl , for variables x1 , . . . , xl . To do so, denote
∗
by t̂ the m-arity tuple (1, . . . , 1). We define δdis
by revising δdis given in the proof of
Theorem 5.2 for QRD(CQ, Fmono ), such that
∗
(i) δdis
(t̂, s) = (1/2) · δdis(t̂, s) for all tuples s = (1, v2 , . . . , vm ) with vi ∈ {0, 1} for i ∈ [2, m];
∗
(ii) δdis (t̂, s) = 2 · δdis (t̂, s) for all s = (0, v2 , . . . , vm ) with vi ∈ {0, 1} for i ∈ [2, m]; and
∗
(t′ , s′ ) = δdis (t′ , s′ ).
(iii) for any other pair of tuples t′ and s′ of RQ , δdis
It is easy to see that for each pair t and s of m-arity tuples given above, δdis (t, s) = 1 if
∗
and only if δdis
(t, s) > 0. Along the same line as Lemma 5.3, one can show the following.
L EMMA 6.3. Consider any pair of tuples t = (u1 , . . . , um ) and s = (v1 , . . . , vm ) that
encode truth assignments of ϕ = Pl+1 xl+1 . . . Pm xm ψ, where tl = sl and ul+1 6= vl+1 for
some 0 ≥ l ≥ m − 1. Let µlX be a truth assignment of variables x1 , . . . , xl encoded by
tl . Then the following statements are equivalent: (1) Pl+1 xl+1 . . . Pm xm ψ is satisfied by
∗
′
(t, s) > 0, and (3) there exist two tuples t′ = (u′1 , . . . , u′m ) and s′ = (v1′ , . . . , vm
)
µlX , (2) δdis
′
′
′ ′
′l
′l
l
∗
2
for ϕ such that δdis (t , s ) > 0, where t = s = µX but ul+1 6= vl+1 .
Capitalizing on Lemma 6.3, we next prove Theorem 6.2.
P ROOF. We show that DRP(CQ, Fmono ) is PSPACE-hard DRP(FO, Fmono )is in PSPACE.
The lower bound holds even when λ = 1 and k is a constant.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:25
(1) Lower bound. We show the lower bound by reduction from Q3SAT. Given an
instance ϕ = P1 x1 . . . Pm xm ψ of Q3SAT, we construct the same query Q, database
D and relevance function δrel as their counterparts for QRD(CQ, Fmono ), and let
k = 1 and r = 1. Furthermore, let U = {t̂}, where t̂ is the m-arity tuple (1, . . . , 1)
in Q(D) given above.
Finally, we set λ = 1. Then for each set S of tuples of RQ ,
P
∗
∗
Fmono (S) = 2m1−1 t∈S,s∈Q(D) δdis
(t, s), where δdis
is defined as above.
We show that ϕ is true if and only if rank(U ) = 1 ≤ r = 1.
∗
⇒ Assume first that ϕ is true. Then by Lemma 6.3 we have that δdis
(t̂, s) = 2 for
each tuple s = (0, v2 , . . . , vm ), where
v
∈
{0,
1}
for
i
∈
[2,
m].
Note
that
there exist
i
P
∗
2m−1 such tuples s in total. Thus s∈Q(D) δdis
(t̂, s) ≥ 2 · 2m−1 = 2m . We next prove
P
∗
(t, s) ≤ 2m ,
that for any other tuple t that is distinct from t̂, we have that s∈Q(D) δdis
and thus rank(U ) = 1 by the definition of Fmono . Indeed, for any tuple t 6= t̂, there
∗
exist at most 2m − 1 tuples s 6= t such that δdis
(t, s) > 0 (recall that |Q(D)| = 2m ).
Consider the following two cases. For
t = (u
t 6= t̂, (a) if
P1 , . . . , um ) and
P any tuple
∗
∗
∗
∗
(t, s) + δdis
(t, t̂)
(t, s) = s∈Q(D)\{t̂} δdis
u1 = 0, then by the definition of δdis
, s∈Q(D) δdis
P
P
∗
∗
∗
m
m
≤ (2 −2)+2 = 2 ; and (b) otherwise, s∈Q(D) δdis (t, s) = s∈Q(D)\{t̂} δdis (t, s)+δdis (t, t̂)
≤ (2m − 2) + 1/2 = 2m − 3/2 < 2m . As a result, we have that rank(U ) = 1 ≤ r = 1.
⇐ Conversely, assume that ϕ is false. Let l0 be the minimum value in [0, m − 1]
such that there exists a tuple s = (v1 , . . . , vm ) with δdis (t̂, s) = 1, where sl0 = (1, . . . , 1)
and vl0 +1 6= 1. Then by Lemma 6.3, l0 is also the minimum value in [0, m − 1] such
′
that δdis (t′ , s′ ) = 1 for any pair of tuples t′ = (u′1 , . . . , u′m ) and s′ = (v1′ , . . . , vm
), where
′l0
′l0
′
′
m−l0
t = s = (1, . . . , 1) but ul0 +1 6= vl0 +1 . Note that there exist at most 2
− 1 tuples
∗
s such that δdis
(t̂, s) > 0, where sl0 = (1, . . . , 1) and vl0 +1 6= 1 (recall that there are
m−l0
tuples s such that sl0 = (1, . . . , 1) and vl0 +1 6= 1). Then we have that
at most 2
P
m−l0
− 1) = 2m−l0 −1 − 1/2. We next show that there exists
s∈Q(D) δdis (t̂, s) ≤ (1/2)· (2
P
a tuple t∗ that is distinct from t̂ such that s∈Q(D) δdis (t∗ , s) ≥ 2m−l0 −1 , and hence
rank(U ) > r = 1. Indeed, let t∗ = (u∗1 , . . . , u∗m ) be a tuple such that (t∗ )l0 −1 = (1, . . . , 1)
but u∗l0 6= 1. Then there exist at least 2m−l0 −1 tuples s = (v1 , . . . , vm ) such that
δdis (t∗ , s) > 0, where (t∗ )l0 = sl0 but u∗l0 +1 6= vl0 +1 , and moreover, for each such tuple s,
P
∗
∗
δdis
(t∗ , s) = 1 by the definition of δdis
. Thus we have that s∈Q(D) δdis (t∗ , s) ≥ 2m−l0 −1 .
Hence by the definition of Fmono , Fmono (U ) < Fmono (U ′ ), for each set U ′ consisting of a
single tuple t∗ given above. Hence rank(U ) > r = 1.
(2) Upper bound. We show that DRP(FO, Fmono ) is in PSPACE. Recall the algorithm
given earlier for DRP(FO, F ) for FMS and FMM . Obviously, the algorithm also works
here. We show that it is in PSPACE. Indeed, since Fmono is PSPACE computable (as
argued in the proof of Theorem 5.2), step 3 of that algorithm is in PSPACE here;
moreover, step 2 is in PSPACE. Thus the algorithm is in NPSPACE = PSPACE.
2
6.2. The Data Complexity of DRP
One might be tempted to think that fixing Q would make DRP(LQ , F ) easier. Nevertheless, DRP(LQ , F ) becomes simpler only (1) when F is Fmono , or (2) when LQ is FO,
and F is FMS or FMM . Its data complexity remains the same as its combined complexity
when LQ is CQ, UCQ or ∃FO+ , for FMS and FMM (see Theorem 6.4). This is consistent
with the data complexity analysis of QRD(LQ , F ) (Theorem 5.4).
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:26
Ting Deng and Wenfei Fan
T HEOREM 6.4. For CQ, UCQ, ∃FO+ and FO, the data complexity of DRP(LQ , FMS )
and DRP(LQ , FMM ) are coNP-complete, while that of DRP(LQ , Fmono ) is in PTIME.
2
As remarked earlier, all the results of this section remain valid even when r is treated
as part of the input for DRP rather than a constant, except the data complexity of
DRP for Fmono . Indeed, if r is not a constant and if the numeric value is encoded in
binary instead of unary, the PTIME algorithm to be presented below for the Fmono case
becomes pseudo-polynomial time rather than PTIME. The other proofs of Theorem 6.4
are quite similar to their counterparts for Theorem 5.4.
P ROOF. We first investigate the data complexity of DRP when F is FMS or FMM . We
then study it when F is Fmono .
(1) When F is FMS or FMM . It suffices to prove that DRP(CQ, FMS ) and DRP(CQ, FMM )
are coNP-hard and that DRP(FO, FMS ) and DRP(FO, FMM ) are in coNP for fixed queries.
Recall that the lower bounds of DRP(CQ, FMS ) and DRP(CQ, FMM ) for Theorem 6.1
are established by using a fixed identity query. Thus the lower bounds hold here.
Hence DRP(CQ, FMS ) and DRP(CQ, FMM ) are both coNP-hard. For the upper bound, the
algorithm given in the proof of Theorem 6.1 for FO and for FMS and FMM works here,
which is to check whether rank(U ) > r. We show that the algorithm is in NP. Indeed,
Q(D) is PTIME computable when Q is a fixed FO query. Hence its step 2 is PTIME.
Moreover, step 3 is in PTIME for FMS and FMM since F is PTIME computable. Thus the
algorithm is in NP, and as a result, DRP(FO, FMS ) and DRP(FO, FMM ) are in coNP.
(2) When the objective is mono-objective. We show that DRP(FO, Fmono ) is in PTIME by
giving a PTIME algorithm. Given Q, D, k, r, δdis , δdis , Fmono , and a set U , the algorithm
returns “yes” if rank(U ) ≤ r, and returns “no” otherwise.
Intuitively, the algorithm first finds a collection S of top-r candidate sets for (Q, D, k)
based on Fmono . Let S = {S1 , . . . , Sr }, where Fmono (S1 ) ≥ . . . ≥ Fmono (Sr ). Then we
simply need to check whether Fmono (U ) < Fmono (Sr ); the algorithm returns no if so,
and yes otherwise. To see this, observe that for each candidate set V ∈
/ S, there
exists no set S ∈ S such that Fmono (V ) > Fmono (S). After S is found, consider the
following cases. (a) If U ∈ S, then there exist at most r − 1 candidate sets V such that
Fmono (V ) > Fmono (U ), and thus rank(U ) ≤ r. (b) If U ∈
/ S but Fmono (U ) = Fmono (Sr ),
then similarly, rank(U ) ≤ r. (c) If U ∈
/ S and Fmono (U ) < Fmono (Sr ), then rank(U ) > r
since there exist at least r candidate sets V such that Fmono (V ) > Fmono (U ). Thus the
algorithm checks which condition of (a), (b) or (c) is satisfied by S and U . It returns
“yes” in both cases (a) and (b), and returns “no” in case (c).
We next show how to compute collection S. We construct S by adding sets S1 , . . . , Sr
to S, one set or multiple sets at a time. Let FindNext(Q, D, Fmono , S, S ′ , k, l, l′ ) be a
procedure that given Q, D, Fmono , k, l and a collection S that consists of top-l candidate
sets for (Q, D, k), returns a collection S ′ such that S ⊂ S ′ and S ′ is a collection of
top-l′ candidate sets for (Q, D, k), where 1 ≤ l < l′ ≤ r, by finding one or more
candidate sets S ∈
/ S. Procedure FindNext(Q, D, Fmono , S, S ′ , k, l, l′ ) works as follows.
Let S = {S1 , . . . , SP
l } (l > 0). For each tuple t ∈ Q(D), let v(t) = (1 − λ) · δrel (t, Q)
+(λ/(|Q(D)| − 1)) t′ ∈Q(D) δdis (t, t′ ). Then it carries out the following steps.
1. For each Si ∈ S, find sets V by replacing one tuple t in Si with another tuple s in
Q(D) \ Si , where v(s) ≤ v(t), such that sets V have the highest Fmono values among
all such new sets. More specifically, do the following:
a. For each t ∈ Si and s ∈ Q(D) such that s ∈
/ Si and v(s) ≤ v(t), we get the candidate set Si (s, t) by replacing t in Si with s. Obviously, Fmono (Si (s, t)) ≤ Fmono (Si ).
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:27
Denote by Ei (t) the collection of such sets Si (s, t) with the highest Fmono value
such that Si (s, t) ∈
/ S. Note that |Ei (t)| ≥ 0. When all tuples t in Si are processed, S
we get k sets Ei (t) for each t ∈ Si . Let Ei be the collection of candidate
sets in t∈Si Ei (t) with the highest Fmono value. Note that for each set V ∈ Ei ,
we have that Fmono (V ) ≤ Fmono (Si ) and V ∈
/ S.
b. Let E be the collection of sets in E1 ∪ . . . ∪ El with the highest Fmono value. We
will show that S ∪E must consist of the top-l′ candidate sets. Note that |S ∪E|
may be greater than r while |S| < r.
c. Check whether |S ∪E| > r. If so, we get S ′ by adding only r − |S| sets from E to
S, picked randomly as they have the same F value; otherwise, let S ′ be S ∪E.
2. Return S ′ .
Capitalizing on FindNext(), the algorithm for DRP with fixed FO queries is given as
follows, which returns “yes” if rank(U ) ≤ r, and returns “no” otherwise:
1. compute
Q(D); for each t ∈ Q(D), compute v(t) = (1 − λ) · δrel (t, Q) +(λ/(|Q(D)| − 1))
P
′
t′ ∈Q(D) δdis (t, t ); sort tuples in Q(D) in descending order based on their v() values;
2. let S be the collection consisting of top-l candidate sets for (Q, D, k); initially, S
contains only the set S1 that consists of the first k tuples in the sorted Q(D);
obviously, S is the top-1 candidate set; moreover, let l = 1;
3. while |S| < r, do the following:
a. let S ′ = FindNext(Q, D, Fmono , S, S ′ , k, l, l′ );
b. let S be S ′ ;
4. when |S| = r, check whether condition (a) or (b) given above is satisfied by S and
U ; if so, return “yes”; otherwise return “no”.
We show that the algorithm is correct and is in PTIME. Obviously it is correct if and
only if FindNext(Q, D, Fmono , S, S ′ , k, l, l′ ) finds top-l′ candidate sets. Assume w.l.o.g.
that |Q(D)| > k. We show that FindNext finds top-l′ candidate sets by induction on l
such that when FindNext finds top-l′ candidate sets S ′ based on the top-l candidate
sets S (where 1 < l < l′ ), then FindNext can find the top-l′′ candidate sets S ′′ based on
S ′ for some l′′ > l′ . Obviously, we need only to consider the case when l′′ ≤ r, i.e., S ′′
contains all the new candidate sets with the highest Fmono value found by FindNext.
When l = 1, S contains only the set S1 that consists of the first k tuples in the sorted
Q(D). Based on S, FindNext finds all sets V by replacing tuples t ∈ S1 one at a time with
a tuple s ∈ Q(D)\S1 , where v(s) ≤ v(t). Denote by E the collection consisting of all such
sets V that have the highest Fmono value. Let S ′ = S∪E. We show that for any candidate
set V ′ ∈
/ S ′ , Fmono (V ′ ) ≤ Fmono (S) for each set S ∈ S ′ . Note that Fmono (V ′ ) ≤ Fmono (S1 )
since S1 is the top-1 candidate set. Now we only need to prove that Fmono (V ′ ) ≤ Fmono (S)
for each set S ∈ E. Consider the following two cases. (a) There exist tuples t ∈ S1 and
s ∈ Q(D) \ S1 such that v(s) ≤ v(t) and V ′ is obtained by replacing t in S1 with s.
Recall that V ′ ∈
/ E since V ′ ∈
/ S ′ . Thus we have that Fmono (V ′ ) ≤ Fmono (S) for each set
S ∈ E, since all sets in E have the highest Fmono value in sets obtained by an one-tuple
replacement. (b) The set V ′ is not one given in (a) above. Then V ′ must contain no more
than k − 2 tuples in S1 . Assume that V ′ is obtained by replacing tuples t1 , . . . , tj in S1
with s1 . . . , sj , where j ∈ [2, k], such that s1 , . . . , sj are not in S1 and for each i ∈ [1, j],
v(si ) ≤ v(ti ). Let V ′′ be the set obtained by replacing only one tuple t1 in S1 with s1 .
Obviously, Fmono (V ′ ) ≤ Fmono (V ′′ ). Moreover, we have that Fmono (V ′′ ) ≤ Fmono (S) for
each set S ∈ E, since all the sets in E have the highest Fmono value. Thus Fmono (V ′ ) ≤
Fmono (S) for each S ∈ E. Hence S ′ contains the top-(1 + |E|) candidate sets.
Assume that when l = n and n ∈ [2, r], FindNext finds the top-l′ candidate sets S ′
based on the top-l candidate sets S for some l′ > l. For the inductive step, we consider
the case when l = l′ . Let the collection E ′ consist of all sets V with the highest Fmono
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:28
Ting Deng and Wenfei Fan
value found by FindNext based on the top-l′ candidate sets S ′ , following the one-tupleat-a-time replacement strategy. Let S ′′ = S ′ ∪ E ′ . We show that S ′′ contains the
top-(l′ + |E ′ |) candidate sets. It suffices to prove that for each candidate set V ′ ⊆ Q(D)
such that V ′ ∈
/ S ′′ , Fmono (V ′ ) ≤ Fmono (S) for each S ∈ S ′′ . Note that by the induction
hypothesis, for each set S ∈ S ′ , Fmono (V ′ ) ≤ Fmono (S) since S ′ contains the top-l′ candidate sets. Thus we need only to show that Fmono (V ′ ) ≤ Fmono (S) for each set S ∈ E ′ .
This can be verified along the same lines as the argument for expanding l = 1 above,
examining cases (a) and (b) given there. This verifies the correctness of the algorithm.
We next show that the algorithm is in PTIME. Note that procedure FindNext is called
at most r − 1 times. In each call, for each set S in S, there are at most k · |Q(D)|
replacements of tuples in S. Thus, at most O(r · k · |Q(D)|) time is needed for each
call of FindNext. Putting these together, the algorithm is in O((r − 1) · r · k · |Q(D)|)
time. Since Q is fixed, r is a constant and Q(D) is bounded by a polynomial in |D|, the
algorithm is in PTIME. Therefore, the data complexity of DRP(FO, Fmono ) is in PTIME.
We remark that the algorithm given above is in pseudo-polynomial time if r is not
a constant and if r is encoded in binary. It is in PTIME when k is a constant.
2
7. THE RESULT DIVERSITY COUNTING PROBLEM
We now study the counting problem RDC. We establish its combined complexity in
Section 7.1 and data complexity in Section 7.2; we will also identify and investigate
its special cases in Section 8. Along the same lines as Fig. 1, we depict in Fig. 4 the
connections between complexity bounds of RDC in various settings.
7.1. The Combined Complexity of RDC
We first study the combined complexity of RDC. Here we use the framework of
predicate-based counting classes introduced in [Hemaspaandra and Vollmer 1995].
For a complexity class C of decision problems, #·C is the class of all counting problems
associated with a predicate RL that satisfies the following conditions:
— RL is polynomially balanced, i.e., there exists a polynomial q such that for all strings
x and y, if RL (x, y) is true then |y| ≤ q(|x|); and
— the following decision problem is in class C: “given x and y, it is to decide whether
RL (x, y)”.
A counting problem is to compute the cardinality of the set {y | RL (x, y)}, i.e., it is
to find how many y there are such that predicate RL (x, y) is satisfied.
We show that when the objective is for max-sum or max-min diversification, the
problem becomes harder for FO than for CQ, UCQ and ∃FO+ . In contrast, when the
objective is for mono-objective formulation, Fmono has greater impact on the complexity
than LQ : it remains #·PSPACE-complete when LQ ranges over CQ, UCQ, ∃FO+ and FO.
This is consistent with its counterparts for QRD and DRP.
The results are verified by parsimonious reductions. A parsimonious reduction from
a counting problem #A to a counting problem #B is a PTIME function σ such that for
all x, |{y | (x, y) ∈ A}| = |{z | (σ(x), z) ∈ B}|, i.e., σ is a bijection [Durand et al. 2005].
Below we first study RDC(LQ , F ) when F is FMS or FMM .
T HEOREM 7.1. The combined complexity of RDC(LQ , FMS ) and RDC(LQ , FMM ) is
— #·NP-complete when LQ is CQ, UCQ or ∃FO+ , and
— #·PSPACE-complete when LQ is FO.
All the results hold under parsimonious reductions.
2
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
#
A:29
PSPACE-complete
(Th. 7.1)
#PSPACE-complete
(Th. 7.2)
FO, combined
# NP-comple
(Th. 7.1)
CQ/∃FO +,
combined
#P-complete
(Turing)
(Th. 8.2)
CQ/FO, λ=0,
F MS, data,
CQ/FO, combined
#P-complete
(parsimonious )
(Th. 7.4)
#P-complete
(Turing)
(Th. 7.5)
CQ/FO, data
CQ/FO, data
FP
(Th. 8.2)
FP
(Cor. 8.4)
FP
(Cor. 8.4)
CQ/FO, λ=0,
FMM, data
CQ/FO,
constant k, data
CQ/FO,
constant k, data
# NP-complete
(Th. 8.2)
CQ/∃FO+,
λ=0, combined
#P-complete
(Turing)
(Cor. 8.1)
CQ/FO, identity
queries, combined
#P-complete
(Turing)
(Th. 8.2)
CQ/FO, λ=0,
data
(b) When F is Fmono
(a) When F is FMS or FMM
Fig. 4. The complexity bounds of RDC
I01
X
= 1
0
B
0
I∨ = 1
1
1
A1
0
0
1
1
A2
0
1
0
1
B
0
I∧ = 0
0
1
A1
0
0
1
1
A2
0
1
0
1
A Ā
I¬ = 0 1
1 0
Fig. 5. Relation instances used in the lower bound proofs of Theorem 7.1.
The lower bounds are verified by reductions from the following problems.
(1) #Σ1 SAT: Given an existentially quantified Boolean formula of the form
ϕ(X, Y ) = ∃Xψ(X, Y ), where ψ(X, Y ) is of the form C1 ∧ . . . ∧ Cl , and Ci is disjunctions
of variables or negated variables taken from X = {x1 , . . . , xm } and Y = {y1 , . . . , yn },
it is to count the number of truth assignments of Y that satisfy ϕ. It is known that
#Σ1 SAT is #·NP-complete [Durand et al. 2005].
(2) #QBF: Given a Boolean formula of the form ϕ = ∃X ∀y1 P2 y2 · · · Pn yn ψ, where
Pi ∈ {∃, ∀} for i ∈ [2, n], and ψ is quantifier-free Boolean formula over the variables in
X ={x1 , . . . , xm } and Y = {y1 , . . . , yn }, it is to count the number of truth assignments
of X variables that satisfy ϕ. It is known to be #·PSPACE-complete [Ladner 1989].
Given these, we prove Theorem 7.1 as follows.
P ROOF. We start with RDC for CQ, UCQ and ∃FO+ , and then study them for FO.
(1) When LQ is CQ, UCQ or ∃FO+ . It suffices to show that RDC(CQ, FMS ) and RDC(CQ,
FMM ) are #·NP-hard and that RDC(∃FO+ , FMS ) and RDC(∃FO+ , FMM ) are in #·NP.
(1.1) Lower bound. We first show that RDC(CQ, FMS ) is #·NP-hard even when λ = 0
and k is a constant, by parsimonious reductions from #Σ1 SAT. Given an instance
ϕ(X, Y ) = ∃Xψ(X, Y ) of #Σ1 SAT, we define a database D, a CQ query Q, functions δrel ,
δdis and FMS , a positive integer k and a real number B. We show that the number of
valid sets for (Q, D, k, FMS , B) equals the number of truth assignments of Y variables
that satisfy ϕ. In particular, we set k = 2 and B = 3.
(1) The database consists of four relations I01 , I∨ , I∧ and I¬ as shown in Fig. 5, specified by schemas R01 (X), R∨ (B, A1 , A2 ), R∧ (B, A1 , A2 ) and R¬ (A, Ā), respectively. Here
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:30
Ting Deng and Wenfei Fan
I01 encodes the Boolean domain, and I∨ , I∧ and I¬ encode disjunction, conjunction and
negation, respectively, such that we can express ϕ′ below in CQ with these relations.
(2) To define the CQ query Q, we first construct a new formula ϕ′ as follows:
ϕ′ (~y ) = ∃~x, z (ψ(~x, ~y) ∨ z) ∧ z̄ ,
where z is a new variable not in X ∪ Y . It can be verified that a truth assignment µY
of Y variables satisfies ϕ if and only if µY makes ϕ′ true when z is set to be 0.
We now define the CQ query Q as follows:
Q(~y , z, a) = ∃~x QX (~x) ∧ QY (~y ) ∧ R01 (z) ∧ Q1 (~x, ~y , z, a) .
Here ~x = (x1 , . . . , xm ) and ~
y = (y1 , . . . , yn ). Queries QX and QY generate all truth
assignments of variables in X and Y , respectively, by means of Cartesian products of
R01 . Sub-query Q1 leverages R∨ , R∧ and R¬ to encode the formula ϕ′ . The semantics
of Q1 is that for a given truth assignment µX of variables in X, µY for Y and µz for z,
which are encoded by tuples tX , tY and tZ , respectively, Q1 (tX , tY , tz , a) returns a = 1
if (ψ ∨ z) ∧ z̄ is satisfied by µX , µY and µz ; and it returns a = 0 otherwise. Intuitively,
Q returns all tuples (tY , tz , a) such that Q1 returns a = 1 if the truth assignment µY
of Y variables (encoded by tY ) and µz of z (represented by tz ) satisfy ϕ′ .
(3) We define (a) δrel ((tY , 0, 1), Q) = 1, (b) δrel ((1, . . . , 1, 0), Q) = 2 and (c) δrel (t, Q) = 0
for any other tuple t of RQ . Furthermore, we define δdis as a constant function that
returns 0 for each pair of tuples of RP
Q , and let λ = 0. Then for each set U of tuples of
RQ with k tuples, FMS (U ) = (k − 1) · t∈U δrel (t, Q).
We show that the construction above is indeed a parsimonious reduction. To see
this, observe that for each set U ⊆ Q(D) such that |U | = 2, FMS (U ) ≥ 3 if and only
if U = {(tY , 0, 1), (1, . . . , 1, 0)}, by the definition of FMS and δrel given above, where tY
encodes a truth assignment of Y variable that satisfies ϕ′ with z = 0. Moreover, as
discussed earlier, a truth assignment µY makes ϕ true if and only if µY satisfies ϕ′
when z = 0. Thus, the number of truth assignments of Y variables that satisfy ϕ
equals the number of valid sets U for (Q, D, k, FMS , B).
We next show that RDC(CQ, FMM ) is #·NP-hard, also by parsimonious reduction from
#Σ1 SAT. Given an instance ϕ(X, Y ) = ∃Xψ(X, Y ) of #Σ1 SAT, we define the same
query ϕ′ , database D, query Q and function δdis as given above. Furthermore, we define
δrel ((tY , 0, 1), Q) = 1 and for any other tuple t of RQ , δrel (t, Q) = 0. Let λ = 0 and k = 1.
Then for each set U of tuples of RQ , FMM (U ) = mint∈U δrel (t, Q). Finally, we set B = 1.
We show that this also makes a parsimonious reduction. For each set U ⊆ Q(D) such
that |U | = k = 1 and FMM (U ) ≥ B = 1, U consists of a single tuple (tY , 0, 1) such that
tY encodes a truth assignment of Y variables that satisfies ϕ. So the number of valid
sets for (Q, D, k, FMM , B) equals the number of truth assignments of Y that satisfies ϕ.
(1.2) Upper bound. We show that RDC(∃FO+ , F ) is in #·NP, when F is FMS or FMM .
It suffices to show that it is in NP to verify whether a given set U is valid for
(Q, D, k, F, B), by the definition of #·NP. Indeed, given a set U , it is in NP to check
whether U ⊆ Q(D) for ∃FO+ , in PTIME to check whether |U | = k, and in PTIME to
check whether F (U ) ≥ B since F is PTIME computable when F is FMS or FMM . Thus
RDC(∃FO+ , F ) is in #·NP.
(2) When LQ is FO. We next study RDC(FO, FMS ) and RDC(FO, FMM ).
(2.1) Lower bound. We verify that RDC(FO, FMS ) and RDC(FO, FMM ) are #·PSPACE-hard
even when λ = 0 and k is a constant, by parsimonious reductions from #QBF.
We start with RDC(FO, FMS ). Given an instance ϕ of #QBF as described above, we
construct a database D, a query Q, and functions δrel , δdis and FMS , and set k = 2 and
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:31
B = 3. We show that the number of valid sets for (D, Q, k, FMS , B) equals the number
of truth assignments of X that satisfies ϕ.
(1) The database consists of the four relations I01 , I∨ , I∧ and I¬ shown in Fig. 5,
specified by schemas R01 (X), R∨ (B, A1 , A2 ), R∧ (B, A1 , A2 ) and R¬ (A, Ā), respectively.
(2) To define the query Q, we construct a new formula ϕ′ from ϕ as before:
ϕ′ = ∃X ∀y1 P2 y2 · · · Pn yn (ψ ∨ z) ∧ z̄ .
where z is a new variable not in X ∪ Y . A truth assignment µX of X satisfies ϕ if and
only if µX make ϕ′ true with z = 0. Let ψ ′ = (ψ∨z)∧z̄. We define the query Q as follows:
Q(~x, z, b) = ∀y1 P2 y2 · · · Pn yn QX (~x) ∧ QY (~y ) ∧ R01 (z) ∧ Qψ′ (~x, ~y, z, b) .
Here ~x = (x1 , . . . , xm ) and ~y = (y1 , . . . , yn ). Queries QX and QY generate all truth
assignments of variables in X and Y , respectively, by means of Cartesian products
of R01 . Furthermore, query Qψ′ (~x, ~y , z, b) encodes the truth value of ψ ′ for given truth
assignments µX of X variables, µY of Y variables and µz of variable z, such that it
returns b = 1 if ψ ′ is true under µX , µY and µz ; otherwise it returns b = 0. Intuitively,
Q returns all tuples (tX , tz , b) such that Q returns b = 1 if the truth assignments µX
(encoded by tX ) and µz (encoded by tz ) satisfy ϕ′ ; and Q returns b = 0 otherwise.
(3) We define (a) δrel ((tX , 0, 1), Q) = 1, (b) δrel ((1, . . . , 1, 0), Q) = 2 and (c) for any other
tuple t of RQ , δrel (t, Q) = 0. Furthermore, we define δdis as a constant function that
returns 0 for each pair of tuples of RQ . Finally,
P we set λ = 0. Then for each set U of
tuples of RQ with k tuples, FMS (U ) = (k − 1) · t∈U δrel (t, Q).
Then along the same line as the proof of RDC(CQ, FMS ) given earlier, one can readily
verify that the number of truth assignments of X variables that satisfy ϕ is equal to
the number of sets U valid for (Q, D, FMS , k, B).
We now show that RDC(FO, FMM ) is #·PSPACE-hard, also by parsimonious reduction
from #QBF. Given an instance ϕ of #QBF, we construct the same formula ϕ′ , database
D, query Q, and function δdis as above. Moreover, we define δrel ((tX , 0, 1), Q) = 1 for
each m-arity tuple tX that encodes a truth assignment of X variables, and δrel (t, Q) = 0
for any other tuple t of RQ . Let λ = 0, k = 1 and B = 1. Then for each set U of
consisting of k tuples of RQ , FMM (U ) = mint∈U δrel (t, Q). Then following the proof for
RDC(FO, FMS ), one can verify that the number of truth assignments of X variables
that satisfy ϕ equals the number of sets U valid for (Q, D, FMM , k, B).
(2.2) Upper bound. To verify that RDC(FO, F ) is in #·PSPACE for FMS and FMM , we
only need to show that verifying whether a given set U is valid is in PSPACE. Indeed,
given a set U , it is in PSPACE to check whether U ⊆ Q(D) and it is in PTIME to check
whether |U | = k and F (U ) ≥ B when F is FMS or FMM .
2
We next investigate the combined complexity of RDC(LQ , F ) when F is Fmono .
T HEOREM 7.2. The combined complexity of RDC(LQ , Fmono ) is #·PSPACE-complete
under parsimonious reductions, when LQ is CQ, UCQ, ∃FO+ or FO.
2
We verify the lower bound by parsimonious reduction from #QBF. The proof
extends the counting argument given in the proof of Theorem 5.2. Given an instance ϕ = ∃X∀y1 P2 y2 . . . Pn yn ψ(X, Y ) of #QBF over variables X = {x1 , . . . , xm } and
Y = {y1 , . . . , yn }, we define a distance function δ ∗∗ similar to its counterpart δdis given
for QRD. More specifically, let t = (u1 , . . . , um+n ) and s = (v1 , . . . , vm+n ) be any pair of
m + n-arity tuples that encode two truth assignments of X ∪ Y variables, such that
tm+l = sm+l and um+l+1 6= vm+l+1 , where tm+l and sm+l are prefixes of t and s of length
m + l, respectively. We want to use δ ∗∗ to ensure that δ ∗∗ (t, s) > 0 if and only if formula
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:32
Ting Deng and Wenfei Fan
Pl+1 yl+1 . . . Pn yn ψ is satisfied by the truth assignment encoded by tm+l , for variables
∗∗
∗∗
x1 , . . . , xm , y1 , . . . , yl . We define δdis
by revising δdis as follows. (a) We let δdis
(t, s) = 0 if
m
m
m
m
t 6= s , i.e., t and s encode distinct truth assignments of X variables. Otherwise,
let tm be an arbitrary m-arity tuple that encodes a truth assignment of X, and
∗∗
t̆ = (tm , 1, . . . , 1). For any tuple s with sm = tm , we define (b) δdis
(t̆, s) = (1/2) · δdis (t̆, s)
m
∗∗
m
if s = (t , 1, v2 , . . . , vn ); (c) δdis (t̆, s) = 4 · δdis (ŭ, s) if s = (t , 0, v2 , . . . , vn ); and moreover,
∗∗ ′ ′
(d) for any other pair of tuples t′ and s′ , we let δdis
(t , s ) = δdis (t′ , s′ ).
It is easy to see that for each pair of t and s of (m + n)−arity tuples given above,
∗∗
δdis (t, s) = 1 if and only if δdis
(t, s) > 0, for l ∈ [0, n − 1]. Moreover, along the same line
∗∗
as Lemma 5.3, one can readily verify the following property of δdis
(t, s).
L EMMA 7.3. Consider an instance ϕ = ∃X∀y1 P2 y2 . . . Pn yn ψ(X, Y ) of #QBF, a truth
l
assignment µm
X of X variables and a truth assignment µY of variables in {y1 , . . . , yl },
for some l ∈ [0, n − 1]. For any pair of tuples t = (u1 , . . . , um+n ) and s = (v1 , . . . , vm+n )
l
that encode truth assignments of variables in ϕ, if tm+l = sm+l = (µm
X , µY ) but
m
l
um+l+1 6= vm+l+1 , then Pl+1 yl+1 . . . Pn yn ψ is true under µX and µY if and only if
∗∗
l
δdis
(t, s) > 0, where tm+l and sm+l both encode µm
2
X and µY .
Based on Lemma 7.3, we next prove Theorem 7.2.
P ROOF. It suffices to show that RDC(CQ, Fmono ) is #·PSPACE-hard and RDC(FO,
Fmono ) is in #·PSPACE. The lower bounds holds even when λ = 1 and k = 1.
(1) Lower bound. We verify the lower bound by parsimonious reduction from #QBF.
Given an instance ∃X∀y1 P2 y2 . . . Pn yn ψ(X, Y ) of #QBF, where X = {x1 , . . . , xm } and
∗∗
Y = {y1 , . . . , yn }, we construct a database D, a CQ query Q, functions δrel , δdis
and Fmono ,
a positive integer k and a real number B, such that the number of truth assignments
of X that satisfy ϕ equals the number of valid sets U for (Q, D, k, Fmono , B). Intuitively,
the reduction assures that for each truth assignment µX of X variables that satisfies
ϕ, µX corresponds to a tuple (tX , 1 . . . , 1) such that U = {(tX , 1 . . . , 1)} is valid for
(Q, D, k, Fmono , B), where tX is an m-arity tuple encoding µX ; moreover, there exists no
other valid set for (Q, D, k, Fmono , B).
(1) The database D consists of a single relation I01 = {(0), (1)} of Fig. 5, specified by
schema R01 (X) and encoding the Boolean domain.
(2) We define the CQ query Q as follows:
^
^
R01 (yj ).
R01 (xi ) ∧
Q(~x, ~y ) =
i∈[1,m]
j∈[1,n]
Here ~x = (x1 , . . . , xm ) and ~y = (y1 , . . . , yn ). That is, Q generates all truth assignments
of variables in X ∪ Y . Obviously, |Q(D)| = 2m+n .
(3) We define relevance function δrel to be a constant function that returns 1 for any set
U of tuples of RQ , and use the distance function δ ∗∗ given above. We set λ = 1, k = 1
and B = 2n+1 /(2m+n − 1). ThenPfor any set U consisting of k tuples of RQ , we have
∗∗
(t, s).
that Fmono (U ) = (1/(2m+n − 1))· t∈U,s∈Q(D) δdis
We next show that the number of truth assignments of X that satisfy ϕ is the same
as the number of valid sets U for (Q, D, k, Fmono , B). More specifically, we verify that
m
if a truth assignment µm
X of X variables satisfies ϕ, then {(t , 1 . . . , 1)} is a valid set
m
m
for (Q, D, k, Fmono , B), where t encodes µX , and moreover, there exists no other tuple
s such that sm encodes µm
X and {s} is valid for (Q, D, k, Fmono , B). For if it holds, then
the encoding makes a parsimonious reduction from #QBF.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:33
Assume first that the truth assignment µm
X of X variables satisfies ϕ. Let t̆ =
(t̆ , 1, . . . , 1) such that t̆m encodes µm
.
Then
by Lemma 7.3, for each tuple s =
X
∗∗
(t̆m , 0, v2 , . . . , vn ), we have that δdis
(t̆, s) = 4. Note that there exist 2n−1 such tuP
∗∗
n−1
= 2n+1 . Thus Fmono ({t̆}) ≥
ples s. Then we have that
s∈Q(D) δdis (t̆, s) ≥ 4 · 2
n+1
m+n−1
2
/(2
− 1) ≥ B. Therefore, {t̆} is a valid set for (Q, D, k, Fmono , B). We next
prove that for any other tuple t that is distinct from t̆ such that tm = t̆m (i.e., tm
∗∗
encodes µm
X ), we have that Fmono ({t}) < B. Indeed, by the definition of δdis , for each
m
n
tuple t = (t̆ , um+1 , . . . , um+n ) such that t 6= t̆, there are at most 2 − 1 tuples s such
∗∗
that δdis
(t, s) > 0, and moreover, for each such tuple s, sm = tm . Consider the followm
m
ing two cases. For any tuple t = (u1 , .P
. . , um+n ) such that
P t 6= t̆ and ∗∗t = t̆ ,∗∗(a) if
∗∗
∗∗
um+1 = 0, then by the definition of δdis , s∈Q(D) δdis (t, s) = s∈Q(D)\{t̆} δdis (t, s) + δdis (t, t̆)
n+1
≤ 2n − 2P+ 4 = 2n + 2 < 2n+1
when n = 1); and (b) othP (resp. = 2 ∗∗ ), when∗∗n > 1 (resp.
n
∗∗
erwise s∈Q(D) δdis (t, s) = s∈Q(D)\{t̆} δdis (t, s)+δdis (t, t̆) ≤ 2 −2+1/2 = 2n −3/2 < 2n+1 .
As a result, for each truth assignment µm
X that satisfies ϕ, there exists one and only one
tuple t̆ = (t̆m , 1, . . . , 1) such that {t̆} is valid for (Q, D, k, Fmono , B), where t̆m encodes µm
X.
m
(2.2) Upper bound. We show that RDC(FO, Fmono ) is in #·PSPACE. It suffices to prove
that it is in PSPACE to verify whether a given set U is valid for (Q, D, k, Fmono , B).
Indeed, consider the algorithm given in the proof of Theorem 5.2 for QRD(FO, Fmono ).
Then its steps 4 and 5 can be used to check whether a given set U is valid. As shown
there, steps 4 and 5 are in PSPACE. Therefore, RDC(FO, Fmono ) is in #·PSPACE.
2
7.2. The Data Complexity of RDC
We next show that fixing queries reduces the complexity of RDC, to an extent:
(1) when F is FMS or FMM , the problem becomes #P-complete under parsimonious reductions, down from #·NP-complete (for CQ, UCQ and ∃FO+ ) and #·PSPACE-complete
(for FO), as opposed to Theorem 7.1; and
(2) when F is Fmono , RDC(LQ , F ) is #P-complete under polynomial Turing reductions,
rather than #·PSPACE-complete, in contrast to Theorem 7.2.
Here #P is the class of functions that count the number of accepting paths of
nondeterministic PTIME Turing machines, in the machine-based framework of
[Valiant 1979]. It is known that #P = #·P [Durand et al. 2005], where #·P is the
predicate-based counting class defined with a PTIME predicate (see Section 7.1).
Recall that a counting problem #A is polynomial Turing reducible to #B if there
exists a PTIME function σ such that for all x, |{y | (x, y) ∈ A}| is PTIME computable
by making multiple calls to an oracle that computes |{z | (σ(x), z) ∈ B}|. We have so
far used parsimonious reductions (Section 7.1), which are stronger than polynomial
Turing reductions, i.e., a parsimonious reduction from #A to #B is also a polynomial
Turing reduction from #A to #B, but not necessarily vice versa.
Note that a parsimonious reduction from #A to #B is also a PTIME reduction from
its decision problem A to the decision problem B of #B. Hence if the decision problem
B of #B is in P, #B cannot be #P-complete under parsimonious reductions since for
many NP-complete problems, e.g., 3SAT, their counting problems are in #P. This is
precisely the case for RDC(LQ , Fmono ) when LQ is CQ, UCQ, ∃FO+ or FO (data complexity). Indeed, its decision problem QRD(LQ , Fmono ) is in PTIME (Theorem 5.4). Thus
RDC(LQ , Fmono ) is #P-complete under polynomial Turing reductions, but not under
parsimonious reductions. In contrast, for FMS or FMM , RDC(LQ , F ) is #P-complete
under parsimonious reductions.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:34
Ting Deng and Wenfei Fan
We first study the data complexity of RDC(LQ , FMS ) and RDC(LQ , FMM )
T HEOREM 7.4. the data complexity of RDC(LQ , FMS ) and RDC(LQ , FMM ) is
#P-complete under parsimonious reductions for CQ, UCQ, ∃FO+ and FO.
2
The lower bound is verified by parsimonious reduction by #SAT: given an instance
ϕ(X) = C1 ∧ · · · ∧ Cl of 3SAT over variables X = {x1 , . . . , xm }, #SAT is to count
the number of truth assignments of X that satisfy ϕ. It is known that #SAT is
#P-complete (cf. [Papadimitriou 1994]). Given these, we prove Theorem 7.4 as follows.
P ROOF. It suffices to show that RDC(CQ, FMS ) and RDC(CQ, FMM ) are #P-hard
under parsimonious reductions, and that RDC(FO, FMS ) and RDC(FO, FMM ) are in #P.
(1) Lower bound. We show that RDC(CQ, FMS ) is #P-hard, even when λ = 1, by
parsimonious reduction from #SAT. Given an instance ϕ(X) of #SAT, we define the
same D, Q, λ, δrel , δdis and FMS as their counterparts given in the proof of Theorem 5.4
for QRD(CQ, FMS ), and let k = l and B = l · (l − 1). Recall that in that proof, for each
set U ⊆ Q(D) such that |U | = l, FMS (U ) ≥ l · (l − 1) if and only if tuples in U encode a
truth assignment of X variables that satisfies ϕ. Thus the number of valid sets for (D,
Q, k, FMS , B) equals the number of truth assignments of X that satisfy ϕ.
We next show that RDC(CQ, FMM ) is #P-hard, also by parsimonious reduction from
#SAT. Given an instance ϕ of #SAT, we construct the same D, Q, λ, δrel , δdis and FMM
as their counterparts used in the proof of Theorem 5.4 for QRD(CQ, FMM ), and let
k = l and B = 1. Recall that in that proof, for each set U ⊆ Q(D) such that |U | = l,
FMM (U ) ≥ 1 if and only if the tuples in U encode a truth assignment of X variables
that satisfies ϕ. Thus the number of valid sets for (D, Q, k, FMM , B) is equal to the
number of truth assignments of X variables that satisfy ϕ.
(2) Upper bound. We show that RDC(FO, F ) is in #P for max-sum and max-min diversification. To do this, we only need to show that it is in PTIME to verify whether a given
set U is valid for (Q, D, k, F, B), when F = FMS or F = FMM , by the definition of #·P
(recall that #·P = #P). Indeed, given a set U , it is in PTIME to check whether U ⊆ Q(D)
for a fixed query Q in FO, and is in PTIME to check whether |U | = k. Moreover, checking
whether F (U ) ≥ B is also in PTIME when F is FMS or FMM . Thus RDC(FO, F ) is in #P. 2
We next investigate the data complexity of RDC(LQ , Fmono ).
T HEOREM 7.5. The data complexity of RDC(LQ , Fmono ) is #P-complete under
Turing reductions for CQ, UCQ, ∃FO+ and FO.
2
To show the lower bound, we reduce from the following problems, and use a lemma.
(1) #SSP (the #subset sum problem): Given a finite set W , a function π : W → N
and aPnatural number d ∈ N, it is to count the number of subsets T ⊆ W such
that
w∈T π(w) = d. It is known that #SSP is #P-complete under parsimonious
reduction [Berbeglia and Hahn 2010].
(2) #SSPk: Given a finite set W , a function π : W → N and natural numbers
d, l ∈ N,
P
#SSPk is to count the number of subsets T ⊆ W such that |T | = l and w∈T π(w) = d.
We first verify that #SSPk is #P-hard by parsimonious reduction from #SSP and
then show that RDC(CQ, Fmono ) is #P-hard by Turing reduction from #SSPk.
L EMMA 7.6. The #SSPk problem is #P-complete under parsimonious reductions. 2
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:35
P ROOF. To see that #SSPk is in #P, observe that given a finite set W , a function
π : WP→ N, a subset T ⊆ W , d ∈ N and l ∈ N, it is in PTIME to check whether |T | = l
and w∈W π(w) = d. Thus #SSPk is in #P by the definition of #·P (recall #·P = #P).
We next show that #SSPk is #P-hard by parsimonious reduction from #SSP. Given
an instance W , π andP
d of #SSP, we construct W ′ , π ′ , d′ and l such that the number of
′
′
′
subsets
T
⊆
W
with
w∈T π(w) = d equals the number of T ⊆ W with |T | = l and
P
′
′
′
Let W = {w1 , . . . , wn } (hence |W | = n). Denote
w ′ ∈T ′ π (w ) = d . P
P by m the number
of decimal digits in w∈W π(w). Then for each subset T of W , w∈T π(w) can be also
represented as an m-digit integer. We next give the reduction.
(1) We define W ′ = {(wi , 1), (wi , 0) | i ∈ [1, n]}. That is, we include two elements (wi , 1)
and (wi , 0) in W ′ for each wi ∈ W , where i ∈ [1, n].
(2) We define π ′ as a function from W ′ to N. Intuitively, for each w′ ∈ W ′ , we define
π ′ (w′ ) to be an n + m-digit integer, where the first n digits in π ′ (w′ ) encode wi for
i ∈ [1, n], where w′ = (wi , 1) or w′ = (wi , 0); moreover, the last m-digit integer in
π ′ (w′ ) either equals π(wi ) for some wi if w′ = (wi , 1), or equals 0 when w = (wi , 0).
More specifically, for each w′ ∈ W ′ , we define π ′ (w′ ) to be an n + m-digit integer
u1 . . . un v1 . . . vm , such that (a) if w′ = (wi , 1) for some wi ∈ W , then ui = 1, and for each
j ∈ [1, n], if j 6= i, then uj = 0; and moreover, the m-digit integer v1 . . . vm equals π(wi );
and (b) if w′ = (wj , 0) for some wj ∈ W , then uj = 1, and for each i ∈ [1, n], if i 6= j, then
ui = 0, and furthermore, for all i ∈ [1, m], vi = 0.
∗
(3) We define d′ to be an n + m-digit integer u∗1 . . . u∗n v1∗ . . . vm
, where for each i ∈ [1, n],
∗
∗
∗
ui = 1, and moreover,
the
m-digit
integer
v
.
.
.
v
equals
d.
Indeed, d is an m-digit
1
m
P
integer since d ≤ w∈W π(w). Finally, we define l = n, i.e., l = |W |.
We next show that thisPis indeed a parsimonious reduction, i.e., the number of
subsets T ⊆ W such
π(w) = d is equal to the number of subsets T ′ ⊆ W ′
P that w∈T
′
′
′
with |T | = l and w′ ∈T
that there exists a set T ′ ⊆ W ′ such
P′ π(w ) ′= d′ . Assume
′
′
that |T | = l = n and w′ ∈T ′ π (w ) = d . Then by the definitions of d′ and π ′ , the
sum P
of last m-digit integers in all π ′ (w′ ) for w′ ∈ T ′ is equal to d, and moreover,
d = (wi ,1)∈T ′ π(wi ). Let T be the set consisting of all wi such that (wi , 1) ∈ T ′ . Then
P
P
(wi ,1)∈T ′ π(wi ) = d. Conversely, assume that there exists a subset
wi ∈T π(wi ) =
P
T ⊆ W such that w∈T π(w) = d. Define T ′ = {(wi , 1) | wi ∈ T } ∪ {(wj , 0) | wj ∈ W \ T }.
Obviously,
|T ′ | P
= l = n. Again by the definitions of π ′ and d′ , one can verify that
P
′
′
′
π
(w
)
=
(wi ,1)∈T ′ π(wi ) = d . Obviously, the reduction is parsimonious.
w ′ ∈T ′
Putting these together, we have that #SSPk is #P-complete.
2
Using Lemma 7.6, we next prove Theorem 7.5.
P ROOF. It suffices to show that RDC(CQ, Fmono ) is #P-hard under polynomial
Turing reductions, and that RDC(FO, Fmono ) is in #P, even when λ = 0.
(1) Lower bounds. We show that RDC(CQ, Fmono ) is #P-hard by polynomial Turing
reduction from #SSPk, which has been shown #P-complete by Lemma 7.6. Denote
by COUNTRDC (Q, D, k, Fmono , B) the oracle that given Q, D, k, Fmono and B, returns
the number of valid sets U for (Q, D, k, Fmono , B). To show that #SSPk is polynomial
time Turing reducible to RDC(CQ, Fmono ), it suffices to show that there exists a PTIME
algorithm for computing #SSPk by calling COUNTRDC (Q, D, k, Fmono , B) a polynomial
number of times, by the notion of polynomial Turing reductions given earlier.
Observe the following. Given a finite set W , a function π : WP→ N and natural
numbers d and l, let X be the number of subsets T of P
W such that w∈T π(w) = d and
|T | = l, Y be the number of subsets T ′ of W such that w∈T ′ π(w) ≥ d and |T ′ | = l, and
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:36
Ting Deng and Wenfei Fan
P
Z be the number of subsets T ′′ of W such that w∈T ′′ π(w) ≥ d + 1 and |T ′′ | = l. Then
X = Y − Z. Based on this, we construct the polynomial Turing reduction from #SSPk to
RDC(CQ, Fmono ) as follows. Given an instance W , π, l and d of #SSPk, we first construct
Q, D, δrel , δdis
P, Fmono , k and B in PTIME, such that the number of subsets T of W with
|T | = l and w∈T π(w) ≥ d equals the number of valid sets U for (Q, D, k, Fmono , B). As
discussed above, P
we can find the solution for #SSPk, i.e., the number of subsets T of W
with |T | = l and w∈T π(w) = d, by calling the oracle COUNTRDC twice, for computing
the numbers X and Y of valid sets for (Q, D, k, Fmono , B) and (Q, D, k, Fmono , B + 1),
respectively.
We next give the transformation from #SSPk to RDC(CQ, Fmono ).
(1) The database D consists of a single relation IW = {(w) | w ∈ W } of schema RW (W ).
(2) We define query Q as the identity query on RW instances.
(3) We define δrel as follows. For each tuple t = (w) ∈ Q(D), we let δrel (t, Q) = π(w).
Furthermore, for any other tuple t′ of RQ , we define δrel (t′ , Q) = 0. Moreover, we take
δdis as a constant function that returns 0 for each pair
P of tuples of RQ . We set λ = 0,
and hence for each set U of tuples of RQ , Fmono (U ) = t∈U δrel (t, Q).
(4) Finally, we set k = l and B = d.
PWe next show that the number of subsets T of W such that |T | = l and
w∈T π(w) ≥ d equals the number of valid sets U for (Q, D, k, Fmono , B). Note
that k = l and B = d.PThen by the definition of δrel , for each set U ⊆ Q(D) such that
|U | = k, Fmono (U ) = (w)∈U π(w) ≥ B if and only if for the set T = {w | (w) ∈ U },
P
we have that |T | = l and w∈T π(w) ≥ d. From this we have a PTIME algorithm
for computing #SSPk by calling COUNTRDC (Q, D, k, Fmono , B) (denoted by X) and
COUNTRDC (Q, D, k, Fmono , B + 1) (denoted by Y ). Then X − Y is the solution for #SSPk.
(2) Upper bound. We verify that RDC(FO, Fmono ) is in #P, by showing that it is in
PTIME to verify whether a given set U is valid for (Q, D, k, Fmono , B). Indeed, Q(D)
and Fmono are both PTIME computable since Q is fixed. Thus it is in PTIME to check
whether U ⊆ Q(D), |U | = k and Fmono (U ) ≥ B. Hence RDC(FO, Fmono ) is in #P.
2
Summary. Taking the results of Sections 5, 6 and 7 together, we can find the following.
(1) Both query languages and objective functions have impact on the combined
complexity of query result diversification. More specifically, (a) when F is FMS or FMM ,
the diversification problems for FO have a higher combined complexity than their
counterparts for CQ, UCQ and ∃FO+ ; and (b) when LQ is CQ, UCQ or ∃FO+ , Fmono makes
the diversification problems harder than FMS and FMM .
(2) When the objective is given by mono-objective formulation, the objective function
dominates the combined complexity. Indeed, the combined complexity bounds of these
problems are independent of whether we take CQ, UCQ, ∃FO+ or FO as LQ .
(3) When it comes to data complexity, query languages make no difference, while
objective functions determine the complexity. Indeed, the data complexity bounds of
these problems remain unchanged when LQ is CQ, UCQ, ∃FO+ or FO. Moreover, when
the objective is given by Fmono , QRD and DRP become tractable, but it is not the case
when the objective is for FMS or FMM . That is, the data complexity is inherent to result
diversification itself, rather than a consequence of the complexity of query languages.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:37
8. SPECIAL CASES OF QUERY RESULT DIVERSIFICATION
In this section we identify and investigate several special cases of QRD, DRP and RDC.
The reason for studying these is twofold. (1) The results of Sections 5, 6 and 7 tell
us that these problems have rather high complexity. This suggests that we find their
special yet practical cases that are tractable. (2) We want to further understand the
impact of various parameters of these problems on their complexity, including query
languages with low complexity, relevance functions δrel , distance functions δdis , and the
bound k for selecting query answers.
Due to the space constraint, we refer the interested reader to the electronic appendix
for the proofs of the results to be presented in this section and Section 9.
Identity queries. We first consider the case when LQ consists of identity queries
only, i.e., when query Q is of the following form:
Q(x̄) = R(x̄),
where R is a relation atom, and |x̄| is the arity of R. Note that for any instance D of
schema R, D = Q(D). As remarked early, in this setting QRD was shown to be NP-hard
by [Gollapudi and Sharma 2009] when the objective is for max-sum diversification
and max-min diversification. No previous work has studied QRD for mono-objective
formulation, or DRP and RDC for any of the three objective functions.
We show that identity queries reduce the complexity of these problems to an extent.
(1) When the objective is given by mono-objective formulation, QRD and DRP are
tractable, as opposed to the NP-hardness of QRD for FMS and FMM [Gollapudi and
Sharma 2009], and RDC becomes #P-complete. In contrast, these problems are
PSPACE-complete, PSPACE-complete, and #·PSPACE-complete (Theorems 5.2, 6.2 and
7.2), respectively, when LQ is CQ. This further verifies that query languages have
impact on the complexity of diversification.
(2) In contrast, when the objective is for max-sum or max-min diversification, the
combined complexity and data complexity of these problems are the same as their
counterparts when LQ is CQ. In other words, in this setting, query languages with a low
complexity (for its membership problem) do not simplify the analyses of diversification.
C OROLLARY 8.1. For identity queries, the combined complexity and data complexity of QRD, DRP and RDC coincide. More specifically,
— QRD(LQ , FMS ) and QRD(LQ , FMM ) are NP-complete,
— DRP(LQ , FMS ) and DRP(LQ , FMM ) are coNP-complete, and
— RDC(LQ , FMS ) and RDC(LQ , FMM ) are #P-complete under parsimonious reductions,
for both combined complexity and data complexity, while
— QRD(LQ , Fmono ) is in PTIME,
— DRP(LQ , Fmono ) is in PTIME, and
— RDC(LQ , Fmono ) is #P-complete under polynomial Turing reductions,
for both combined complexity and data complexity, which are the same as their data
complexity given in Theorems 5.4, 6.4 and 7.5, respectively.
2
When λ = 0. We next focus on the impact of the relevance and diversity requirements
on the complexity of query result diversification analyses. We first consider the case
when λ = 0, i.e., the objective function F is defined in terms of the relevance function
δrel only. We find that the diversity requirement has higher impact on the complexity
than relevance. Indeed, dropping distance functions δdis simplifies the analyses of these
problems to an extent. This is consistent with the observation of [Vieira et al. 2011].
(1) When the objective function is FMS or FMM , QRD and DRP become tractable for
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:38
Ting Deng and Wenfei Fan
fixed Q. Moreover, RDC is in FP for FMM , where FP is the class of all functions that can
be computed in PTIME (cf. [Papadimitriou 1994]). That is, these problems have lower
data complexity.
(2) When the objective is given by Fmono , the combined complexity analyses of these
problems become simpler, when LQ is CQ, UCQ or ∃FO+ .
T HEOREM 8.2. When λ = 0, For FMS and FMM , the combined complexity bounds of
QRD, DRP and RDC remain the same as their counterparts given in Theorems 5.1, 6.1
and 7.1, respectively. In contrast, when LQ is CQ, UCQ, ∃FO+ or FO, the data complexity
bounds of these problems are
— in PTIME for QRD(LQ , FMS ) and QRD(LQ , FMM ),
— in PTIME for DRP(LQ , FMS ) and DRP(LQ , FMM ), and
— #P-complete for RDC(LQ , FMS ) under polynomial Turing reductions, but in FP for
RDC(LQ , FMM ).
For Fmono , the combined complexity becomes
— NP-complete for QRD(LQ , Fmono ) when LQ is CQ, UCQ or ∃FO+ , and PSPACE-complete
when LQ is FO;
— coNP-complete for DRP(LQ , Fmono ) when LQ is CQ, UCQ or ∃FO+ , and PSPACEcomplete for FO; and
— #·NP-complete for RDC(LQ , Fmono ) when LQ is CQ, UCQ or ∃FO+ , and #·PSPACEcomplete for FO.
The data complexity bounds of these problems remain the same as their counterparts
given in Theorems 5.4, 6.4, 7.4 and 7.5, respectively, when LQ is CQ, UCQ, ∃FO+ or FO. 2
When λ = 1. In contrast to Theorem 8.2, we show below that dropping the relevance
function δrel does not simplify the analyses. Indeed, when the objective function is
defined with only the diversity function δdis , both the combined complexity and data
complexity of QRD, DRP and RDC remain the same as their counterparts when both
relevance and diversity are taken into account. This further verifies that the diversity
requirement δdis dominates the complexity of these problems. These results, however,
need new proofs and are not corollaries of the previous results.
T HEOREM 8.3. When λ = 1, the combined complexity of Theorems 5.1, 5.2, 6.1,
6.2, 7.1 and 7.2 and the data complexity of Theorems 5.4, 6.4, 7.4 and 7.5 remain
unchanged for QRD, DRP and RDC, respectively.
2
When k is a predefined constant. Finally, we study the impact of the cardinality
|U | of selected sets U of query answers on the analyses of query result diversification.
When |U | is fixed to be a predefined constant k, the result below tells us the following.
(1) When Q is also fixed, QRD, DRP and RDC are all tractable. That is, fixing the size
of U simplifies their data complexity analyses.
(2) In contrast, fixing k does not simplify the combined complexity analyses of these
problems. Indeed, all the combined complexity bounds of these problems remain the
same as their counterparts when k is not required to be a constant.
C OROLLARY 8.4. For a predefined constant k,
— the combined complexity bounds given in Theorems 5.1, 5.2, 6.1, 6.2, 7.1 and 7.2 are
unchanged for QRD, DRP and RDC, respectively; and
— the data complexity is in
— PTIME for QRD,
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:39
— PTIME for DRP, and
— FP for RDC,
no matter whether for FMS , FMM or Fmono , and for CQ, UCQ, ∃FO+ or FO.
Summary. From the results that we have got so far, we can see the following.
(1) The impact of LQ . When the objective function is given by FMS or FMM , QRD,
DRP and RDC have higher combined complexity for FO than for CQ, UCQ and
∃FO+ (Theorems 5.1, 6.1 and 7.1). In contrast, the complexity bounds remain intact
for CQ or the class of identity queries (Corollary 8.1). When considering Fmono , the
combined complexity bounds of these problems remain unchanged for all query
languages CQ, UCQ, ∃FO+ and FO (Theorems 5.2, 6.2 and 7.2). In contrast, when for
the class of identity queries, these problems become simpler (Corollary 8.1). Note
that the query languages have no impact on the data complexity of these problems
(Theorems 5.4, 6.4, 7.4, 7.5 and Corollary 8.1).
(2) The impact of δrel and δdis . The complexity of diversification also arises from the
diversity requirement. The absence of the distance function δdis simplifies (a) the data
complexity analyses of QRD and DRP for FMS or FMM , (b) the data complexity of RDC for
FMM , and (c) the combined complexity analyses of all these problems for Fmono (Theorem 8.2). In contrast, the absence of δrel has no impact on the complexity (Theorem 8.3).
(3) The impact of k. When k is a fixed constant, the data complexity analyses of QRD,
DRP and RDC become tractable, no matter whether objective function is given by FMS ,
FMM or Fmono (Corollary 8.4).
9. INCORPORATING COMPATIBILITY CONSTRAINTS
In this section, we study the impact of compatibility constraints on the analyses of
query result diversification. We first introduce a class of compatibility constraints,
and extend the diversification model of Section 3 by incorporating these constraints.
In the presence of such constraints, we then re-investigate QRD, DRP and RDC in all
the settings of the previous sections (Sections 5, 6, 7 and 8).
Compatibility constraints. We first define a class of compatibility constraints.
Consider a database D, a query Q in a language LQ , and a predefined constant m ≥ 2.
We define a class Cm of compatibility constraints on subsets U ⊆ Q(D). Let RQ denote
the schema of query results Q(D). A constraint ϕ in Cm is of the form:
∀t1 , . . . , tl : RQ χ(t1 , . . . , tl ) → ∃s1 , . . . , sh : RQ ξ(t1 , . . . , tl , s1 , . . . , sh ) .
Here (1) l and h are in the range [0, m], (2) ti and sj are tuple variables denoting a
tuple of RQ , and (3) χ and ξ are conjunctions of predicates of the form (a) ρ[A] = ̺[B]
or ρ[A] 6= ̺[B], or (b) ρ[A] = c or ρ[A] 6= c, where A and B are attributes in RQ , ρ and
̺ range over tuples ti and sj for i ∈ [0, l] and j ∈ [0, h], and c is a constant.
We say that a set U ⊆ Q(D) satisfies ϕ, denoted by U |= ϕ, if for all tuples t1 , . . . , tl
in U that satisfy the predicates in χ following the standard semantics of first-order
logic, there must exist tuples s1 , . . . sh in U such that all the predicates in ξ are also
satisfied. We say that D satisfies a set Σ of constraints in Cm if for each ϕ ∈ Σ, U |= ϕ.
Class Cm suffices to express compatibility constraints commonly found in practice,
to specify what items should be picked together when we select top-k tuples, and what
items have conflict with each other, as illustrated by the following example.
Example 9.1. Consider a query Q1 posed on database D1 to find items for shopping.
The selected items are specified by a relation schema RQ1 with attributes item, price,
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:40
Ting Deng and Wenfei Fan
etc (see Example 1.1). The compatibility constraint ρ1 below is defined on subsets of
results Q1 (D1 ). It assures that if one buys items a and b, then she also needs to buy c.
That is, in the set U of top-k items recommended, if a and b are in U , then so is c.
ρ1 = ∀ t1 , t2 : RQ1 (t1 [item] = a ∧ t2 [item] = b) → ∃ s : RQ1 (s[item] = c) .
As another example, consider a query Q2 posed on a database D2 for course selection
[Koutrika et al. 2009; Parameswaran et al. 2010]. The schema of query result Q2 (D2 ) is
denoted by RQ2 , including attributes id and title, among other things. The compatibility
constraint ρ2 below is defined on instances of RQ2 , i.e., sets U ⊆ Q2 (D2 ). It asserts that
if course CS450 is taken, then so must be its prerequisites CS220 and CS350.
ρ2 = ∀ t1 : RQ2 t1 [id] = CS450 → ∃ s1 , s2 : RQ2 (s1 [id] = CS220 ∧ s2 [id] = CS350) .
Now consider a query Q3 posed on a database D3 for basketball team formation [Lappas et al. 2009]. The schema of Q3 (D3 ) is RQ3 , including attributes id, position, etc. We
use the following constraint ρ3 to assure that at most two centers are needed for the
team, i.e., no more than three centers may be included in any top-k sets U ⊆ Q3 (D3 ).
ρ3 = ∀ t1 , t2 , t3 : RQ3 t1 [position] = center ∧ t2 [position] = center ∧
t1 [id] 6= t2 [id] ∧ t1 [id] 6= t3 [id] ∧ t2 [id] 6= t3 [id] → t2 [position] 6= center) .
Observe that compatibility constraints ϕ1 , ϕ2 and ϕ3 may not be expressible in the
query languages LQ for Q1 , Q2 and Q3 , when, e.g., LQ is CQ.
2
One can see that constraints of Cm have a form similar to tuple generating dependencies (TGDs) that have been well studied for databases (see, e.g., [Abiteboul et al. 1995]
for TGDs), except that the number of tuples in each constraint of Cm is bounded by a
predefined constant m, such as our familiar functional dependencies and inclusion dependencies, which are bounded by constant 2 and can be expressed in Cm when m ≥ 2.
Moreover, one can readily verify that constraints of Cm are in PTIME, i.e., for any set
Σ ⊆ Cm and any set U ∈ Q(D), it takes at most PTIME in |U | and |Σ| to determine
whether U |= Σ, because the number of tuples in each ϕ ∈ Σ is bounded by m.
Query result diversification revisited. We are now ready to revise query result
diversification by incorporating constraints of Cm . Given a query Q in a query language
LQ , a database D, a positive integer k, an objective function F , and in addition, a
set Σ of compatibility constraints in Cm defined on subsets of Q(D), query result
diversification in the presence of compatibility constraints aims to find a set U ⊆ Q(D)
such that (a) |U | = k, (b) F (U ) is maximum, and moreover, (c) U |= Σ. Compared to
diversification in the absence of compatibility constraints (see Section 3), the top-k set
U selected is additionally required to satisfy all the constraints in Σ.
In the presence of Σ, we revise the following notions introduced in Section 4.
(1) Given Q, D, k and Σ, we say that a set U ⊆ Q(D) is a candidate set for (Q, D, Σ, k)
if |U | = k and U |= Σ.
(2) Given a real number B and an objective function F , we say that a set U ⊆ Q(D) is
valid for (Q, D, Σ, k, F, B) if |U | = k, U |= Σ and F (U ) ≥ B.
(3) We say that rank(U ) = r for a positive integer r if there exists a collection S of r − 1
distinct candidate sets for (Q, D, Σ, k) such that (a) for all S ∈ S, F (S) > F (U ); and (b)
for any candidate set S ′ for (Q, D, Σ, k), if S ′ 6∈ S, then F (U ) ≥ F (S ′ ).
Based on these, we revise the statements of problems QRD, DRP and RDC as follows.
(1) Problem QRD(LQ , F ) is to decide, given D, Q ∈ LQ , F , B and in addition, a set Σ of
compatibility constraints in Cm , whether there exists a valid set for (Q, D, Σ, k, F, B).
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:41
(2) Problem DRP(LQ , F ) is to decide, given D, Q ∈ LQ , F , B, Σ, and a candidate set U
for (Q, D, Σ, k), whether rank(U ) ≤ r, where r is a positive integer constant.
(3) Problem RDC(LQ , F ) is to count the number of valid sets for (Q, D, Σ, k, F, B).
We next investigate these problems in the presence of compatibility constraints.
We first establish their combined complexity, and then provide their data complexity.
Finally, we study these problems in the special cases identified in Section 8.
Combined complexity. The good news is that compatibility constraints do not
complicate the combined complexity analyses of the result diversification problems.
Indeed, constraints of Cm can be validated in PTIME, and hence all the upper bounds
given in Sections 5, 6 and 7 for combined complexity remain intact.
C OROLLARY 9.2. In the presence of compatibility constraints of Cm , the combined
complexity bounds of Theorems 5.1, 5.2, 6.1, 6.2, 7.1 and 7.2 remain unchanged for
QRD, DRP and RDC.
2
Data complexity. In the presence of compatibility constraints, we study the data
complexity of these problems, i.e., when query Q and compatibility constraints Σ are
predefined and fixed, while database D may vary. We show that the presence of Σ
makes the problems harder, to an extent.
(1) When the objective is given by mono-object formulation, QRD and DRP become
NP-complete and coNP-complete, respectively, as opposed to their tractability in
the absence of compatibility constraints (Theorem 5.4 and 6.4), and DRP becomes
#P-complete under parsimonious reductions, rather than under polynomial Turing
reductions (Theorem 7.5). That is, although compatibility constraints of Cm can be
validated in PTIME, they impose additional requirements on the selection of top-k sets
based on Fmono and hence, complicate the analyses of these problems when query Q is
fixed. These data complexity results hold no matter whether for CQ, UCQ, ∃FO+ and FO.
(2) In contrast, when the objective is for max-sum or max-min diversification, the data
complexity results of these problems remain the same as their counterparts in the
absence of compatibility constraints.
T HEOREM 9.3. In the presence of compatibility constraints of Cm , the data complexity bounds of Theorems 5.4, 6.4 and 7.4 remain unchanged for QRD, DRP and RDC,
respectively, for FMS and FMM . However, for Fmono ,
— QRD becomes NP-complete;
— DRP becomes coNP-complete; and
2
— RDC becomes #P-complete under parsimonious reductions.
Special cases. We next investigate the special cases of Section 8 in the presence of
compatibility constraints of Cm . We show that compatibility constraints make the analyses of query result diversification more complicated: all the tractable cases we have
seen in Section 8 except one (when the bound k is a constant) become intractable, although the compatibility constraints of Cm are simple enough to be validated in PTIME.
Identity queries. We first consider the case when LQ consists of identity queries (see
Section 8 for the details of identity queries). In this setting, compatibility constraints
complicate the combined and data complexity analyses of query result diversification
when F is Fmono . Indeed, QRD(LQ , Fmono ), DRP(LQ , Fmono ) and DRP(LQ , Fmono ) become NP-complete, coNP-complete and #P-complete under parsimonious reductions,
respectively, for both their combined complexity and data complexity, as opposed to
PTIME, PTIME and #P-complete under polynomial Turing reductions, respectively, in
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:42
Ting Deng and Wenfei Fan
the absence of such constraints (Corollary 8.1). This is because for an identity query
Q and a database D, while Q(D) = D and Fmono (U ) for U ⊆ Q(D) is computable
in PTIME, the additional requirements imposed by compatibility constraints make
checking and counting valid sets more intricate, similar to the complication introduced
by the constraints to the data complexity analyses of these problems (Theorem 9.3).
In contrast, when F is FMS or FMM , even the data complexity analyses of these
problems are already intractable in the absence of compatibility constraints, and the
compatibility constraints do not increase their complexity bounds.
C OROLLARY 9.4. For identity queries, in the presence of compatibility constraints
of Cm , both the combined complexity and data complexity of Corollary 8.1 remain
unchanged for QRD, DRP, and RDC for FMS and FMM .
However, when it comes to Fmono ,
— QRD becomes NP-complete;
— DRP becomes coNP-complete; and
— RDC becomes #P-complete under parsimonious reductions,
for both combined and data complexity.
2
When λ = 0. We next study the impact of the compatibility constraints on the complexity of query result diversification analyses when λ = 0, i.e., when the objective
function F is defined in terms of the relevance function δrel only. The results below
tell us the following. The presence of compatibility constraints has no impact on the
combined complexity of QRD(LQ , F ), DRP(LQ , F ) and DRP(LQ , F ), but the constraints
do make the data complexity analyses harder.
C OROLLARY 9.5. For λ = 0, in the presence of compatibility constraints of Cm , the
combined complexity bounds given in Theorem 8.2 remain unchanged for QRD, DRP
and RDC, while the data complexity becomes
— NP-complete for QRD;
— coNP-complete for DRP; and
— #P-complete for RDC under parsimonious reductions,
no matter for FMS , FMM and Fmono , and for CQ, UCQ, ∃FO+ and FO.
2
When λ = 1. Similarly, when F is defined in terms of distance function δdis only,
compatibility constraints complicate the data analyses of QRD, DRP and RDC for
Fmono . For FMS and FMM , these problems are already NP-complete, coNP-complete
and #P-complete under parsimonious reductions, respectively, in the absence of
compatibility constraints (Theorem 8.3), and these data complexity bounds remain
intact in the presence of the constraints.
C OROLLARY 9.6. For λ = 1, in the presence of compatibility constraints of Cm , the
combined complexity bounds given in Theorem 8.3 remain unchanged for QRD, DRP
and RDC.
The data complexity bounds of Theorem 8.3 remain unchanged for FMS and FMM . In
contrast, for Fmono and for CQ, UCQ, ∃FO+ and FO,
— QRD is NP-complete;
— DRP is coNP-complete; and
— RDC is #P-complete under parsimonious reductions.
2
When k is a predefined constant. In contrast, compatibility constraints do not complicate the analyses of query result diversification when the bound k is a constant.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:43
Table I. Combined complexity and data complexity ((∗) : known for the lower bound)
Objective functions
Languages
QRD(LQ , F )
(Th. 5.1, 5.2, 5.4)
Fmono
CQ, UCQ, ∃FO+
FO
CQ, UCQ, ∃FO+ , FO
NP-complete(∗)
PSPACE-complete
PSPACE-complete
FMS and FMM
CQ,UCQ,∃FO+ ,FO
NP-complete(∗)
Fmono
CQ,UCQ,∃FO+ ,FO
PTIME
FMS and FMM
Problems
DRP(LQ , F )
RDC(LQ , F )
(Th. 6.1, 6.2, 6.4)
(Th. 7.1, 7.2, 7.4,7.5)
Combined complexity
coNP-complete
#·NP-complete
PSPACE-complete
#·PSPACE-complete
PSPACE-complete
#·PSPACE-complete
Data complexity
#P-complete
coNP-complete
(parsimonious)
#P-complete
PTIME
(Turing)
C OROLLARY 9.7. For a predefined constant k, in the presence of compatibility
constraints of Cm , the combined complexity and data complexity of Corollary 8.4
remain unchanged for QRD, DRP and RDC, respectively.
2
Summary. From the results of this section, we find the impact of compatibility
constraints of Cm on the complexity of query results diversification as follows.
(1) Although the compatibility constraints of Cm can be validated in PTIME, their
presence complicates the data complexity analyses of QRD(LQ , F ), DRP(LQ , F ) and
RDC(LQ , F ), to an extent. More specifically, when these problems are tractable in the
absence of the constraints, they become intractable when the constraints are present.
The impact is particularly evident when F is Fmono (Theorem 9.3), or when λ = 1 and
F is either FMS or FMM (Corollary 9.5).
(2) When it comes to the combined complexity, the presence of compatibility constraints makes QRD(LQ , F ), DRP(LQ , F ) and RDC(LQ , F ) harder when F is Fmono and
when LQ consists of identity queries only (Corollary 9.4). The constraints have no
impact on the combined complexity analyses when F is FMS or FMM .
(3) When the bound k on the cardinality |U | of selected sets U is a constant, the
complexity results are quite robust: both the combined complexity and data complexity of QRD(LQ , F ), DRP(LQ , F ) and RDC(LQ , F ) remain intact no matter whether
compatibility constraints of Cm are present or absent (Corollary 9.7).
10. CONCLUSIONS
We have extended the result diversification model of [Gollapudi and Sharma 2009]
by incorporating queries Q, without assuming the entire set Q(D) of query answers
as input. We have identified three decision and counting problems in connection with
query result diversification, namely, QRD(LQ , F ), DRP(LQ , F ) and RDC(LQ , F ). We
have established the upper and lower bounds of these problems, all matching, for both
combined complexity and data complexity, when the query language LQ is CQ, UCQ,
∃FO+ or FO, and when F ranges over all three objective functions FMS , FMM and Fmono
given in [Gollapudi and Sharma 2009]. We have also studied special cases of these
problems, and identified tractable cases. In addition, we have investigated the impact
of compatibility constraints on the analyses of query result diversification.
The main complexity results are summarized in Table I, annotated with their
corresponding theorems. The complexity bounds of special cases are shown in Table II,
followed by Table III for the complexity results in the presence of compatibility constraints that differ from their counterparts in the absence of constraints. The tables
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:44
Ting Deng and Wenfei Fan
Table II. Special cases
Conditions
Complexity
Identity queries;
F is Fmono
λ = 0;
F is FMS
λ = 0;
F is FMM
λ = 0; CQ, UCQ, ∃FO+ ;
F is Fmono ;
k is a constant;
F is FMS ,FMM or Fmono
Combined
(Cor. 8.1)
Data
(Th. 8.2)
Data
(Th. 8.2)
Combined
(Th. 8.2)
Data
(Cor. 8.4)
QRD(LQ , F )
Problems
DRP(LQ , F )
PTIME
PTIME
PTIME
PTIME
PTIME
PTIME
FP
NP-complete
coNP-complete
#·NP-complete
PTIME
PTIME
FP
RDC(LQ , F )
#P-complete
(Turing)
#P-complete
(Turing)
Table III. Complexity results in the presence of compatibility constraints
Conditions
F is Fmono
Identity queries;
F is Fmono
λ = 0;
F is FMS ,FMM ,Fmono
λ = 1;
F is Fmono
Complexity
Data
(Th. 9.3)
Combined / Data
(Cor. 9.4)
Data
(Cor. 9.5)
Data
(Cor. 9.6)
QRD(LQ , F )
Problems
DRP(LQ , F )
NP-complete
coNP-complete
NP-complete
coNP-complete
NP-complete
coNP-complete
NP-complete
coNP-complete
RDC(LQ , F )
#P-complete
(parsimonious)
#P-complete
(parsimonious)
#P-complete
(parsimonious)
#P-complete
(parsimonious)
tell us the impact of various factors on the complexity of diversification analyses, such
as query languages LQ , objective functions F , relevance and distance functions δrel and
δdis , bound k on the number of answers, and compatibility constraints. As annotated
in Table I, among all these results, only the NP lower bound of QRD(LQ , F ) was known
prior to this work, when F is FMS or FMM , for CQ, UCQ and ∃FO+ .
Several extensions are targeted for future work. First, diversification analyses are
mostly intractable. We need to identify more special cases that are practical and
tractable. Second, we need to develop heuristic algorithms (approximation whenever
possible) for those intractable cases. Third, the study should be extended to other
objective functions. Fourth, we have only considered a simple class Cm of compatibility
constraints that can be validated in PTIME. More expressive constraint languages
should be developed if the need for such languages emerges from practice. Finally, in
practice one may want to incorporate user preferences [Chen and Li 2007; Stefanidis
et al. 2010] into the diversification model. While we may encode certain preferences
in, e.g., the relevance and distance functions, this issue deserves a full treatment.
REFERENCES
A BITEBOUL , S., H ULL , R., AND V IANU, V. 1995. Foundations of Databases. Addison-Wesley.
A DOMAVICIUS, G. AND T UZHILIN, A. 2005. Towards the next generation of recommender systems: a survey
of the state-of-the-art and possible extensions. IEEE Trans. Knowl. and Data Eng. 17, 6, 734–749.
A GRAWAL , R., G OLLAPUDI , S., H ALVERSON, A., AND L EONG, S. 2009. Diversifying search results. In Proc.
Int. Conf. Web Search and Web Data Mining.
A MER -YAHIA , S. 2011. Recommendation projects at Yahoo! IEEE Data Eng. Bull. 34, 2, 69–77.
A MER -YAHIA , S., B ONCHI , F., C ASTILLO, C., F EUERSTEIN, E., M ÉNDEZ -D ÍAZ , I., AND Z ABALA , P. 2013.
Complexity and algorithms for composite retrieval. In Proc. Int. Conf. on World Wide Web.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
A:45
B ERBEGLIA , G. AND H AHN, G. 2010. Counting feasible solutions of the traveling salesman problem with
pickups and deliveries is #P-complete. Discrete Applied Mathematics 157, 11, 2541–2547.
B ORODIN, A., L EE , H. C., AND Y E , Y. 2012. Max-sum diversification, monotone submodular functions and
dynamic updates. In Proc. ACM SIGACT-SIGMOD Symp. on Principles of Database Systems.
C APANNINI , G., N ARDINI , F. M., P EREGO, R., AND S ILVESTRI , F. 2011. Efficient diversification of Web
search results. In Proc. Int. Conf. on Very Large Data Bases.
C HEN, Z. AND L I , T. 2007. Addressing diverse user preferences in SQL-query-resul navigation. In Proc.
ACM SIGMOD Int. Conf. on Management of Data.
D EMIDOVA , E., FANKHAUSER , P., Z HOU, X., AND N EJDL , W. 2010. DivQ: Diversification for keyword
search over structured databases. In Proc. Annual Int. ACM SIGIR Conf. on Research and Development
in Information Retrieval.
D ENG, T., FAN, W., AND G EERTS, F. 2012. On the complexity of package recommendation problems. In
Proc. ACM SIGACT-SIGMOD Symp. on Principles of Database Systems.
D ROSOU, M. AND P ITOURA , E. 2009. Diversity over continuous data. IEEE Data Eng. Bull. 32, 4.
D ROSOU, M. AND P ITOURA , E. 2010. Search result diversification. SIGMOD Record 39, 1, 41–47.
D URAND, A., H ERMANN, M., AND K OLAITIS, P. G. 2005. Subtractive reductions and complete problems for
counting complexity classes. Theor. Comp. Sci. 340, 3, 496–513.
FAGIN, R., L OTEM , A., AND N AOR , M. 2003. Optimal aggregation algorithms for middleware. J. Comp. and
System Sci. 66, 4, 614–656.
F EUERSTEIN, E., H EIBER , P. A., M ART ÍNEZ -V IADEMONTE , J., AND B AEZA -YATES, R. A. 2007. New
stochastic algorithms for scheduling ads in sponsored search. In Proc. Latin American Web Congress.
F RATERNALI , P., M ARTINENGHI , D., AND T AGLIASACCHI , M. 2012. Top-k bounded diversification. In Proc.
ACM SIGMOD Int. Conf. on Management of Data.
G OLLAPUDI , S. AND S HARMA , A. 2009. An axiomatic approach for result diversification. In Proc. Int. Conf.
on World Wide Web.
H EMASPAANDRA , L. A. AND V OLLMER , H. 1995. The satanic notations: Counting classes beyond #P and
other definitional adventures. SIGACT News 26, 1, 2–13.
I LYAS, I. F., B ESKALES, G., AND S OLIMAN, M. A. 2008. A survey of top-k query processing techniques in
relational database systems. ACM Comput. Surv. 40, 4, 11:1–11:58.
J IN, W. AND PATEL , J. M. 2011. Efficient and generic evaluation of ranked queries. In Proc. ACM SIGMOD
Int. Conf. on Management of Data.
K OUTRIKA , G., B ERCOVITZ , B., AND G ARCIA -M OLINA , H. 2009. FlexRecs: expressing and combining
flexible recommendations. In Proc. ACM SIGMOD Int. Conf. on Management of Data.
L ADNER , R. E. 1989. Polynomial space counting problems. SIAM J. Comput. 18, 6, 1087–1097.
L APPAS, T., L IU, K., AND T ERZI , E. 2009. Finding a team of experts in social networks. In Proc. Int. Conf.
on Knowledge Discovery and Data Mining.
L I , C., S OLIMAN, M. A., C HANG, K. C.-C., AND I LYAS, I. F. 2005. RankSQL: Supporting ranking queries
in relational database management systems. In Proc. Int. Conf. on Very Large Data Bases.
L IU, Z., S UN, P., AND C HEN, Y. 2009. Structured search result differentiation. In Proc. Int. Conf. on Very
Large Data Bases.
M INACK , E., D EMARTINI , G., AND N EJDL , W. 2009. Current approaches to search result diversification. In
Proc. Int. Workshop on Living Web.
PAPADIMITRIOU, C. H. 1994. Computational Complexity. Addison-Wesley.
PARAMESWARAN, A. G., G ARCIA -M OLINA , H., AND U LLMAN, J. D. 2010. Evaluating, combining and
generalizing recommendations with prerequisites. In Proc. Int. Conf. on Information and Knowledge
Management.
PARAMESWARAN, A. G., V ENETIS, P., AND G ARCIA -M OLINA , H. 2011. Recommendation systems with
complex constraints: A course recommendation perspective. ACM Trans. Information Syst. 29, 4.
P ROKOPYEV, O. A., K ONG, N., AND M ARTINEZ -T ORRES, D. L. 2009. The equitable dispersion problem.
European Journal of Operational Research 197, 1, 59–67.
S CHNAITTER , K. AND P OLYZOTIS, N. 2008. Evaluating rank joins with optimal cost. In Proc. ACM
SIGACT-SIGMOD Symp. on Principles of Database Systems.
S TEFANIDIS, K., D ROSOU, M., AND P ITOURA , E. 2010. Perk: Personalized keyword search in relational
databases through preferences. In Proc. Int. Conf. on Extending Database Technology.
VALIANT, L. 1979. The complexity of computing the permanent. Theoretical Computer Science 8, 2, 189–201.
VARDI , M. Y. 1982. The complexity of relational query languages. In Proc. Annual ACM Symp. on Theory of
Computing.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
A:46
Ting Deng and Wenfei Fan
V EE , E., S RIVASTAVA , U., S HANMUGASUNDARAM , J., B HAT, P., AND YAHIA , S. A. 2008. Efficient
computation of diverse query results. In Proc. Int. Conf. on Data Engineering.
V IEIRA , M. R., R AZENTE , H. L., B ARIONI , M. C. N., H ADJIELEFTHERIOU, M., S RIVASTAVA , D., J R ., C. T.,
AND T SOTRAS, V. J. 2011. On query result diversification. In Proc. Int. Conf. on Data Engineering.
X IE , M., L AKSHMANAN, L. V. S., AND W OOD, P. T. 2012. Composite recommendations: From items to
packages. Frontiers of Computer Science 6, 3, 264–277.
Y U, C., L AKSHMANAN, L., AND A MER -YAHIA , S. 2009a. It takes variety to make a world: Diversification in
recommender systems. In Proc. Int. Conf. on Extending Database Technology. 368–378.
Y U, C., L AKSHMANAN, L. V., AND A MER -YAHIA , S. 2009b. Recommendation diversification using
explanations. In Proc. Int. Conf. on Data Engineering.
Z HANG, M. AND H URLEY, N. 2008. Avoiding monotony: Improving the diversity of recommendation lists.
In Proc. ACM Conf. on Recommender Systems.
Z IEGLER , C.-N., M C N EE , S. M., K ONSTAN, J. A., AND L AUSEN, G. 2005. Improving recommendation lists
through topic diversification. In Proc. Int. Conf. on World Wide Web.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
Online Appendix to:
On the Complexity of Query Result Diversification
TING DENG, RCBD and SKLSDE, Beihang University
WENFEI FAN, Informatics, University of Edinburgh, and RCBD and SKLSDE, Beihang University
A. PROOFS OF SECTION 8
C OROLLARY 8.1. For identity queries, the combined complexity and data complexity of QRD, DRP and RDC coincide. More specifically,
— QRD(LQ , FMS ) and QRD(LQ , FMM ) are NP-complete,
— DRP(LQ , FMS ) and DRP(LQ , FMM ) are coNP-complete, and
— RDC(LQ , FMS ) and RDC(LQ , FMM ) are #P-complete under parsimonious reductions,
for both combined complexity and data complexity, while
— QRD(LQ , Fmono ) is in PTIME,
— DRP(LQ , Fmono ) is in PTIME, and
— RDC(LQ , Fmono ) is #P-complete under polynomial Turing reductions,
for both combined complexity and data complexity, which are the same as their data
complexity given in Theorems 5.4, 6.4 and 7.5, respectively.
2
P ROOF. For identity queries, we first study the combined and data complexity of
QRD, DRP and RDC for FMS or FMM . We then investigate these problems for Fmono
(1) When F is FMS or FMM . The data complexity proofs of Theorems 5.4, 6.4 and 7.4
for QRD(LQ , F ), DRP(LQ , F ) and RDC(LQ , F ) when F is FMS or FMM use a fixed
identity query as Q. Hence the lower bounds hold here. Moreover, for the upper bound,
Theorems 5.1 and 6.1 tell us that QRD(LQ , F ) and DRP(LQ , F ) are in NP and coNP,
respectively, for CQ. Since CQ subsumes identity queries, QRD(LQ , F ) and DRP(LQ , F )
are in NP and coNP, respectively, for identity queries. Furthermore, the proof of
Theorem 7.1 shows that if Q(D) is PTIME computable, such as when Q is an identity
query, the problem for verifying whether a given set is valid for (Q, D, k, F, B) is in
PTIME. Hence, RDC(LQ , F ) is in #·P (i.e., #P) for identity queries.
(2) When F is Fmono . Consider the PTIME-algorithms given in the proofs of Theorems 5.4 and 6.4 for the data complexity analyses of QRD(FO, Fmono ) and
DRP(FO, Fmono ), respectively. Note that Q(D) and Fmono are PTIME computable
when Q is an identity query. Thus these PTIME algorithms also work here, and their
combined complexity and data complexity coincide to be in PTIME for identity queries.
We next consider RDC(LQ , Fmono ). It suffices to show that RDC(LQ , Fmono ) is #P-hard
for fixed identity queries, and that it is in #P for identity queries that are not
necessarily fixed. Observe the following. (a) The lower bound proof of Theorem 7.5 for
RDC(CQ, Fmono ) uses a fixed identity query as Q. Thus the lower bound holds here. (b)
It is in PTIME to check whether a given set is valid for (Q, D, k, Fmono , B) as Q(D) and
Fmono are PTIME computable for identity queries. Thus RDC(LQ , Fmono ) is in #P.
This completes the proof of Corollary 8.1.
2
c YYYY ACM 1539-9087/YYYY/01-ARTA $15.00
DOI 10.1145/0000000.0000000 http://doi.acm.org/10.1145/0000000.0000000
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
App–2
Ting Deng and Wenfei Fan
T HEOREM 8.2. When λ = 0, For FMS and FMM , the combined complexity bounds of
QRD, DRP and RDC remain the same as their counterparts given in Theorems 5.1, 6.1
and 7.1, respectively. In contrast, when LQ is CQ, UCQ, ∃FO+ or FO, the data complexity
bounds of these problems are
— in PTIME for QRD(LQ , FMS ) and QRD(LQ , FMM ),
— in PTIME for DRP(LQ , FMS ) and DRP(LQ , FMM ), and
— #P-complete for RDC(LQ , FMS ) under polynomial Turing reductions, but in FP for
RDC(LQ , FMM ).
For Fmono , the combined complexity becomes
— NP-complete for QRD(LQ , Fmono ) when LQ is CQ, UCQ or ∃FO+ , and PSPACE-complete
when LQ is FO;
— coNP-complete for DRP(LQ , Fmono ) when LQ is CQ, UCQ or ∃FO+ , and PSPACEcomplete for FO; and
— #·NP-complete for RDC(LQ , Fmono ) when LQ is CQ, UCQ or ∃FO+ , and #·PSPACEcomplete for FO.
The data complexity bounds of these problems remain the same as their counterparts
given in Theorems 5.4, 6.4, 7.4 and 7.5, respectively, when LQ is CQ, UCQ, ∃FO+ or FO. 2
P ROOF. When λ = 0, we first study QRD(LQ , F ), DRP(LQ , F ) and RDC(LQ , F ) for
FMS and FMM . We then investigate them when F is Fmono .
(1) When F is FMS or FMM . We start with the combined complexity analyses.
(1.1) Combined complexity. In these settings, we first prove that QRD(LQ , F ),
DRP(LQ , F ) and RDC(LQ , F ) are NP-complete, coNP-complete and #·NP-complete
for CQ, UCQ and ∃FO+ , respectively. We then show that they are PSPACE-complete,
PSPACE-complete and #·PSPACE-complete for FO, respectively.
(1.1.1) When LQ is CQ, UCQ or ∃FO+ .
(A) QRD(LQ , F ). It suffices to show that QRD(LQ , FMS ) and QRD(LQ , FMM ) are NP-hard
for CQ, and that they are in NP for ∃FO+ .
Lower bound. We show that QRD(CQ, FMS ) and QRD(CQ, FMM ) are NP-hard by
reductions from the 3SAT problem, even when λ = 0 and k is a constant.
We first consider QRD(CQ, FMS ). Given an instance ϕ of 3SAT over variables
{x1 , . . . , xm }, we define a database D, a CQ query Q, a real number B, a positive
integer k and two functions δrel and δdis (for FMS ), such that ϕ is satisfiable if and only
if there exists a valid set U for (Q, D, k, FMS , B). In particular, we take k = 2 and
B = 1. That is, U consists of two tuples only and FMS (U ) must be no less than 1.
(1) The database D is specified by a single relation schema R01 , with its corresponding
instance I01 = {(1), (0)}, encoding the Boolean domain.
(2) We define the query Q in CQ as follows:
Q(~x) = R01 (x1 ) ∧ . . . ∧ R01 (xm ).
Here ~x = (x1 , . . . , xm ) and Q generates all truth assignments of X variables. Let RQ
denote the schema of query result Q(D).
(3) For each tuple t of RQ , we define δrel (t, Q) = 1 if the truth assignment µX encoded
by tuple t makes ϕ true, and let δrel (t, Q) = 0 otherwise. Furthermore, we take δdis as
a constant function that returns 0 for each pair of tuplesP
t and t′ of RQ . We set λ = 0.
Then for each set U of k tuples of RQ , FMS (U ) = (k − 1) · t∈U δrel (t, Q) (see Section 3).
That is, FMS is defined in terms of δrel alone (and hence we use k = 2).
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
App–3
We show that ϕ is satisfiable if and only if there is a valid set U for (Q, D, k, FMS , B).
⇒ First assume that ϕ is satisfiable. Then there exists a truth assignment µ0X of X
variables satisfying ϕ. Let U = {t0 , t}, where tuple t0 encodes µ0X and t is an arbitrary
tuple in Q(D) that is distinct from t0 . Obviously, U ⊆ Q(D), |U | = 2, and moreover,
F (U ) ≥ 1 = B since δrel (t0 , Q) = 1. That is, U is a valid set for (Q, D, k, FMS , B).
⇐ Conversely, assume that ϕ is not satisfiable. Then for any tuple t of RQ , δrel (t, Q) =
0 by the definition of δrel . Thus for each set U of two tuples t and t′ of RQ , FMS (U ) =
0 < B by the definition of FMS . Hence there exists no valid set for (Q, D, k, FMS , B).
For QRD(CQ, FMM ), given an instance ϕ of 3SAT, we construct the same D, Q, δrel , δdis
as their counterparts given above, and let k = 1 and B = 1. Furthermore, for each set
U of tuples of RQ , we set λ = 0 and hence have that FMM (U ) = mint∈U δrel (t, Q). Then
along the same lines as above, one can readily verify that ϕ is satisfiable if and only if
there exists a valid set U for (Q, D, k, FMM , B).
Upper bound. The algorithms given in the proof of Theorem 5.1 remain intact in the
special case when λ = 0. Thus QRD(∃FO+ , FMS ) and QRD(∃FO+ , FMM ) are in NP.
(B) DRP(LQ , F ). It suffices to show that DRP(LQ , FMS ) and DRP(LQ , FMM ) are
coNP-hard for CQ and that they are in coNP for ∃FO+ .
Lower bound. We verify that DRP(CQ, FMS ) and DRP(CQ, FMM ) are coNP-hard, when
λ = 0 and k is a constant, by reductions from the complement of 3SAT, which is known
to be coNP-complete (cf. [Papadimitriou 1994]).
We first study DRP(CQ, FMS ). Given an instance ϕ = C1 ∧ . . . ∧ Cl of 3SAT over
variables X, we define a database D, a CQ query Q, functions δrel , δdis and FMS , a set
U ⊆ Q(D), and a positive integer k. We prove that rank(U ) ≤ r if and only if ϕ is not
satisfiable. In particular, we set k = 2 and r = 1. That is, the set U consists of two
tuples only and has the highest rank.
V
Before giving the reduction, we first define ϕ′ = (ϕ ∨ z) ∧ z̄ = li=1 (Ci ∨ z) ∧ z̄, where
z is a fresh variable that is not in the set X of variables in ϕ. As discussed in the
proof of Theorem 6.1 for DRP(CQ, FMS ), for a truth assignment µX of X variables, µX
satisfies ϕ if and only if µX makes ϕ′ true with z = 0, and moreover, when setting z to
be 1, ϕ′ is false under any truth assignments in X.
We next give the reduction as follows.
(1) The database consists of four relations I01 , I∨ , I∧ and I¬ as shown in Fig. 5, specified by schemas R01 (X), R∨ (B, A1 , A2 ), R∧ (B, A1 , A2 ) and R¬ (A, Ā), respectively. Here
I01 encodes the Boolean domain, and I∨ , I∧ and I¬ encode disjunction, conjunction and
negation, respectively, such that ϕ and ϕ′ can be expressed in CQ with these relations.
(2) We define the CQ query Q as follows:
Q(b, c) = ∃~x ∃z (QX (~x ) ∧ Qϕ′ (~x, z, b)) ∧ R01 (c) .
Here ~x = (x1 , . . . , xm ). Query QX (~x ) generate all truth assignments of X variables, by
means of Cartesian products of R01 . The sub-query Qϕ′ (~x, z, b) encodes the truth value
of ϕ′ (i.e., b), for a given truth assignment µX represented by ~x and truth assignment µz
for z, such that b = 1 if (µX , µz ) satisfies ϕ′ , and b = 0 otherwise. Obviously, Qϕ′ (~x, z, b)
can be expressed in CQ in terms of R∨ , R∧ , and R¬ . Observe that given D, Q(D) is a
subset of {(1, 1), (1, 0), (0, 1), (0, 0)}. Let U = {(0, 1), (0, 0)}. As remarked above, ϕ′ is
false under the truth assignments when z = 1; hence we have that U ⊆ Q(D).
(3) We define δrel ((1, 0), Q) = δrel ((1, 1), Q) = 2 and δrel ((0, 1), Q) = δrel ((0, 0), Q) = 1. Furthermore, we use a constant function δdis that returns 0 for each pair of tuples t and s of
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
App–4
Ting Deng and Wenfei Fan
P
RQ . We set λ = 0. Then for each set S of k tuples of RQ , FMS (S) = (k − 1) · t∈S δrel (t, Q)
(see Section 3; recall k = 2). Thus FMS ({(1, 1), (1, 0)}) = 4, FMS ({(0, 1), (0, 0)}) = 2, and
FMS ({t, s}) = 3 if t ∈ {(1, 1), (1, 0)} and s ∈ {(0, 0), (0, 1)}.
We next show that ϕ is not satisfiable if and only if rank(U ) ≤ r.
⇒ Assume that ϕ is not satisfiable. Then there exists no truth assignment µX of X
variables that satisfies ϕ. Thus, tuples (1, 1) and (1, 0) cannot be in the answer Q(D)
to query Q in D. Therefore, rank(U ) = 1 ≤ r, by the definition of FMS .
⇐ Conversely, assume that ϕ is satisfiable. Then there exists a truth assignment µ0X
of X variables that satisfies ϕ. Thus (1, 1) and (1, 0) must be in Q(D), by the definition
of query Q. Hence rank(U ) > 1 = r, by the definition of FMS .
We next show that DRP(CQ, FMM ) is coNP-hard, also by reduction from the complement of 3SAT. Given an instance ϕ of 3SAT, we construct the same ϕ′ , D, Q,
δrel , and δdis as their counterparts for FMS given above, and let U = {(0, 1)} and
k = r = 1. Furthermore, we set λ = 0. Then for each set S of tuples of RQ , FMM (S)
= mint∈S δrel (t, Q). Then along the same lines as the argument for FMS given above,
one can easily verify that rank(U ) ≤ r if and only if ϕ is not satisfiable.
Upper bound. The algorithms given in the proof of Theorem 6.1 obviously work in the
special case when λ = 0. Thus QRD(∃FO+ , FMS ) and QRD(∃FO+ , FMM ) are in coNP.
(C) RDC(LQ , F ). Recall the lower bounds of RDC(LQ , F ) given in Theorems 7.1 for FMS
and FMM when LQ ranges over CQ, UCQ or ∃FO+ . Those bounds are established by
taking λ = 0 and hence, hold here. For the upper bounds, the algorithms given there
obviously remain intact in the special case when λ = 0.
(1.1.2) When LQ is FO. The lower bounds of QRD(FO, F ), DRP(FO, F ) and RDC(FO, F )
given in Theorems 5.1, 6.1 and 7.1 for FMS and FMM are established by taking λ = 0.
As a result, those lower bounds hold here. For the upper bounds, the algorithms given
there obviously remain intact in the special case when λ = 0.
(1.2) Data complexity. It suffices to show that QRD(FO, F ) and DRP(FO, F ) are in
PTIME when F is FMS or FMM , RDC(LQ , FMS ) is #P-complete under polynomial Turing
reductions for CQ, UCQ, ∃FO+ and FO, and RDC(FO, FMM ) is in FP.
(1.2.1) QRD(FO, FMS ). We develop
P a PTIME algorithm for QRD(FO, FMS ) when λ = 0.
Recall that FMS (U ) = (k − 1) · t∈U δrel (t, Q) for each set U of k tuples of schema RQ in
this setting. Hence we develop an algorithm that works as follows:
1. compute Q(D) and sort the tuples in Q(D) in descending order based on δrel ;
2. check whether |Q(D)| ≥ k; if so, continue; otherwise, return “no”;
3. let U be the set consisting of the first k tuples in the sorted Q(D); check whether
FMS (U ) ≥ B; if so, return “yes”; otherwise, return “no”.
It is easy to verify that the algorithm is correct. Moreover, the algorithm is in PTIME.
Indeed, steps 1 and 2 are in PTIME since it is in PTIME to compute Q(D) for a fixed
FO query Q, and step 3 is in PTIME because FMS is PTIME computable in this setting.
(1.2.2) QRD(FO, FMM ). When λ = 0, FMM (U ) = mint∈U δrel (t, Q), and it is PTIME
computable. It is easy to see that the PTIME algorithm for QRD(FO, FMS ) given above
also works here. Hence QRD(FO, FMM ) is also in PTIME.
(1.2.3) DRP(FO, FMS ) and DRP(FO
P, FMM ). When λ = 0, for each k-tuple set U of schema
RQ of Q(D), FMS (U ) = (k − 1) · t∈U δrel (t, Q), and FMM (U ) = mint∈U δrel (t, Q). Recall
the PTIME-algorithm given in the proof of Theorem 6.4 for DRP(FO, Fmono ). Obviously,
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
App–5
if we define v(t) = δrel (t, Q) for each tuple t ∈ Q(D) and use FMS instead of Fmono in
the algorithm, the algorithm can be used for DRP(FO, FMS ), and is still in PTIME.
Similarly, when using function v given above and FMM instead of Fmono , the algorithm
also works for DRP(FO, FMM ), and is also in PTIME.
(1.2.4) RDC(FO, FMS ). It suffices to show that RDC(CQ, FMS ) is #P-hard under
polynomial Turing reductions and that RDC(FO, FMS ) is in #P.
lower bound. We show that RDC(CQ, FMS ) is #P-hard by polynomial Turing reduction
from #SSPk, which is #P-complete by Lemma 7.6. Along the same line as the proof
of Theorem 7.5 for RDC(CQ, Fmono ), we construct a polynomial Turing reduction as
follows. Given an instance W , π, l and d of #SSPk, we define the same transformation
from #SSPk to RDC(CQ, FMS ) as given in the proof of Theorem 7.5 for RDC(CQ, Fmono ),
except the following:
P (a) when λ = 0, for each set U consisting of k tuples of RQ ,
= (l − 1) · d. It is easy
FMS (U ) = (k − 1) · t∈U δrel (t, Q); and (b) we let k = l and B P
to see that the number of subsets T of W with |T | = l and w∈T π(w) ≥ d equals
the number of valid sets U for (Q, D, k, FMS , B). As discussed there, we
P can find the
solution to #SSPk, i.e., the number of subsets T of W with |T | = l and w∈T π(w) = d,
by calling the oracle COUNTRDC twice, to compute the numbers X and Y of valid sets
for (Q, D, k, FMS , B) and (Q, D, k, FMS , B + 1), respectively, and the solution to #SSPk
is simply X − Y . Here COUNTRDC (Q, D, k, FMS , B) is the oracle that given Q, D, k, FMS
and B, returns the number of valid sets U for (Q, D, k, FMS , B).
Upper bound. To see that RDC(FO, FMS ) is in #P, we only need to show that it is in
PTIME to verify whether a given set U is valid for (Q, D, k, FMS , B). Indeed, Q(D) is
PTIME computable since Q is fixed, and moreover, FMS is also PTIME computable.
(1.2.5) RDC(FO, FMM ). When λ = 0, we show that RDC(FO, FMM ) is in FP for fixed
queries Q, by giving an FP algorithm. Given Q, D, k, FMM , and B, the algorithm returns
the number of valid sets for (Q, D, k, FMM , B). Recall that FMM (U ) = mint∈U δrel (t, Q)
when λ = 0 (see Section 3). The algorithm works as follows:
1. compute Q(D) and sort tuples in Q(D) in descending order based on their δrel
values; let Q(D) = {t1 , . . . , t|Q(D)| }, where δrel (ti , Q) ≥ δrel (tj , Q) when i ≤ j;
2. check whether δrel (ti , Q) < B for all i ∈ [1, |Q(D)|]; if so, return 0; otherwise continue;
3. let ti be the tuple such that δrel (ti , Q) ≥ B and δrel (ti+1 , Q) < B; check whether
i ≥ k; if so, return = Cki , represented in binary; otherwise return 0.
Here step 1 is in PTIME since Q(D) is PTIME computable when Q is fixed. Step 2
is obviously in PTIME when λ = 0. Moreover, step 3 is in PTIME since the output is
represented in binary. Thus the algorithm is in PTIME.
(2) When F is Fmono . We first study the combined complexity of QRD(LQ , Fmono ),
DRP(LQ , Fmono ) and RDC(LQ , Fmono ). We then consider their data complexity.
P
(2.1) Combined complexity.
Note that when λ = 0, Fmono (U ) =
t∈U δrel (t, Q), and
P
FMS (U ) = (k − 1)· t∈U δrel (t, Q) for each set U of k tuples of RQ . As a result, when
λ = 0 and k = 2, Fmono and FMS are the same function. Recall that in the lower bounds
proofs of Theorem 8.2 (for QRD(LQ , FMS ) and DRP(LQ , FMS )) and Theorem 7.1 (for
RDC(LQ , FMS )), we set λ = 0 and k = 2. Thus the lower bounds given there hold here
for Fmono . Moreover, the algorithms given in those upper bound proofs carry over to
the special case for λ = 0. Hence the upper bounds hold here.
(2.2) Data complexity. We show that when λ = 0, the data complexity bounds
of QRD(LQ , Fmono ), DRP(LQ , Fmono ) and RDC(LQ , Fmono ) remain the same as their
counterparts given in Theorems 5.4, 6.4 and 7.5, respectively. Obviously, the PTIME
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
App–6
Ting Deng and Wenfei Fan
algorithms given for Theorems 5.4 and 6.4 carry over to the special case when λ = 0.
For RDC(LQ , Fmono ), recall that its lower bound proof for Theorem 7.5 uses λ = 0.
Hence the lower bound remains intact here. Moreover, the algorithm given there
obviously also works in the special case when λ = 0.
This completes the proof of Theorem 8.2.
2
T HEOREM 8.3. When λ = 1, the combined complexity of Theorems 5.1, 5.2, 6.1,
6.2, 7.1 and 7.2 and the data complexity of Theorems 5.4, 6.4, 7.4 and 7.5 remain
unchanged for QRD, DRP and RDC, respectively.
2
P ROOF. When λ = 1, we first study QRD, DRP and RDC for FMS and FMM . We then
investigate these problems when F is Fmono .
(1) When F is FMS or FMM . We first study the combined complexity, and then investigate the data complexity of these problems in these settings.
(1.1) Combined complexity. We first consider CQ, UCQ and ∃FO+ , and then FO.
(1.1.1) When LQ is CQ, UCQ or ∃FO+ . The lower bounds of QRD(CQ, F ) and DRP(CQ, F )
given in Theorems 5.1 and 6.1 for FMS and FMM are established by taking λ = 1. As
a result, these lower bounds hold here. For the upper bounds, the algorithms given
there obviously remain intact in the special case when λ = 0.
We next only need to show that RDC(LQ , FMS ) and RDC(LQ , FMM ) are #P-complete
for CQ, UCQ and ∃FO+ , when λ = 1.
Lower bound. We first show that RDC(CQ, FMS ) is #·NP-hard by parsimonious reduction from #Σ1 SAT (see Section 7.1). Given an instance ϕ(X, Y ) = ∃X(C1 ∧ . . . ∧ Cl )
of #Σ1 SAT, we use the same reduction given in the proof of Theorem 7.1 for
RDC(CQ, FMS ), except the following: (i) we define δdis ((tY , 0, 1), (1, . . . , 1, 0)) = 1, and
for any other pair of tuples t and s, we define δdis (t, s) =
P 0; (ii) we set λ = 1 and
k = 2; hence for each set U of tuples of RQ , FMS (U ) =
t,s∈U δdis (t, s); and (iii) we
set B = 1. Then one can verify that the number of valid sets for (Q, D, k, FMS , B) is
equal to the number of truth assignments of Y that satisfy ϕ. This follows from the
fact that for each set U ⊆ Q(D) such that |U | = k = 2, FMS (U ) ≥ B = 1 if and only if
U = {(tY , 0, 1),(1, . . . , 1, 0)}, where the truth assignment encoded by tY satisfies ϕ.
We next show that RDC(CQ, FMM ) is #·NP-hard also by parsimonious reduction from
#Σ1 SAT. Given an instance ϕ(X, Y ) of #Σ1 SAT, we use the same reduction as given
above for RDC(CQ, FMS ) except that when λ = 1, FMM (U ) = mint,s∈U,t6=s δdis (t, s) for
each set U consisting of k tuples of RQ . Then along the same line as the proof given
above, one can show that the number of valid sets for (Q, D, k, FMM , B) equals the
number of truth assignments of Y that satisfy ϕ.
Upper bound. It is easy to see that when λ = 1, it is still in NP to verify whether a
given set U is valid for (Q, D, k, F, B) for ∃FO+ , when F is FMS or FMM following the
proof of Theorem 7.1. Hence RDC(∃FO+ , F ) is in #·NP in this case.
(1.1.2) When LQ is FO. We show that when λ = 1, QRD(FO, F ) and DRP(FO, F ) are
PSPACE-complete, and RDC(FO, F ) is #·PSPACE-complete, for FMS and FMM .
(A) Lower bound. We first prove the lower bounds.
(a) QRD(FO, FMS ). We show that when λ = 1, QRD(FO, FMS ) is PSPACE-hard by
reduction from the membership problem for FO. Given an instance (Q, D, s) of the
membership problem, we use the same reduction as the one given in Theorem 5.1 for
QRD(FO, FMS ), except the following: (i) we define δdis ((s, 0), (s, 1)) = 1, and for any other
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
App–7
′
two tuples t and t′ of RQ , δP
dis (t, t ) = 0, and (ii) we set λ = 1, and hence, for each set U
of tuples of RQ , FMS (U ) = t,t′ ∈U δdis (t, t′ ); and (iii) we set B = 1. Recall that k = 2.
We next show that s ∈ Q(D) if and only if there exists a valid set for (Q′ , D′ , k, FMS ,
B). First assume that s ∈ Q(D). Then there exists a set U = {(s, 1), (s, 0)} valid for (Q′ ,
D′ , k, FMS , B) by the definition of δdis . Conversely, if s 6∈ Q(D), then tuples (s, 1) and
(s, 0) are not in Q(D). Thus for each pair of tuples t and t′ in Q(D), FMS ({t, t′ }) = 0 < B.
(b) QRD(FO, FMM ). We show that QRD(FO, FMM ) is PSPACE-hard also by reduction
from the membership problem for FO queries. Given an instance (Q, D, s) of the
membership problem, we use the same reduction given above for QRD(FO, FMS ) except
that when setting λ = 1, for each set U of tuples of RQ , FMM (U ) = mint,s∈U,t6=s δdis (t, s).
Then along the same line as above, one can readily verify that s ∈ Q(D) if and only if
there exists a valid set U for (Q′ , D′ , k, FMM , B).
(c) DRP(FO, FMS ). We show that DRP(FO, FMS ) is PSPACE-hard by reduction from
the complement of the membership problem for FO. Given an instance (Q, D, s) of
the membership problem, we use the same reduction given in the proof of Theorem 6.1 for DRP(FO, FMS ), except the following: (i) we define δdis ((s, 1, 1), (s, 1, 0)) = 1,
δdis ((s, 0, 1), (s, 0, 0)) = 2 and for any other two tuples t and t′ of RQ , we let δdis (t, t′ ) = 0,
and (ii) we set λ = 1. Recall that k = 2, r = 1 and U
P = {(s, 1, 1), (s, 1, 0)}. Hence for
each set S of tuples of RQ , we have that FMS (S) = t,t′ ∈S δdis (t, t′ ). It is easy to see
that s ∈
/ Q(D) if and only if rank(U ) = 1 ≤ r.
(d) DRP(FO, FMM ). We show that DRP(FO, FMM ) is PSPACE-hard also by reduction from
the complement of the membership problem for FO. Given an instance (Q, D, s) of the
membership problem for FO, we use the same reduction given above for DRP(FO, FMS ),
except that when λ = 1, FMM (S) = mint,t′ ∈S,t6=t′ δdis (t, t′ ) for each set S of tuples of RQ′ .
Then one can verify that s ∈
/ Q(D) if and only if rank(U ) = 1 ≤ r.
(e) RDC(FO, FMS ). We show that when λ = 1, RDC(FO, FMS ) is #·PSPACE-hard
by parsimonious reduction from #QBF (see Section 7.1). Given an instance
ϕ = ∃X ∀y1 P2 y2 · · · Pn yn ψ of #QBF, we use the same reduction given in the
proof of Theorem 7.1 for RDC(FO, FMS ), except the following: (i) we define
δdis ((tX , 0, 1), (1, . . . , 1, 0)) = 1, where tX is a truth assignment of the X variables
′
that satisfies ϕ, and for any other pair of tuples t and t′ , we define δdis (t,
Pt ) = 0; (ii) we
set λ = 1 and k = 2, and thus for each set U of tuples of RQ , FMS (U ) = t,t′ ∈U δdis (t, t′ );
and moreover, (iii) we set B = 1. Then along the same line as the proof of Theorem 7.1
for RDC(FO, FMS ), one can verify that the number of valid sets for (D, Q, k, FMS , B)
equals the number of truth assignments of X that satisfy ϕ.
(f) RDC(FO, FMM ). We show that RDC(FO, FMM ) is #·PSPACE-hard, also by parsimonious reduction from #QBF. Given an instance ϕ = ∃X ∀y1 P2 y2 · · · Pn yn ψ of #QBF, we
use the same reduction given above for RDC(FO, FMS ), except the following: (i) when
setting λ = 1, FMM (U ) = mint,s∈U,t6=s δdis (t, s) for each set U of tuples of RQ ; and (ii) we
set B = 1. Then one can readily verify that the number of valid sets for (D, Q, k, FMM ,
B) equals the number of truth assignments of X that satisfy ϕ.
(B) Upper bound. We show that when λ = 1, QRD(FO, F ), DRP(FO, F ) and RDC(FO, F )
are in PSPACE, PSPACE and #·PSPACE, respectively, when F is FMS or FMM . Obviously,
the algorithms given in the proofs of Theorem 5.1 and 6.1 for QRD(FO, F ) and
DRP(FO, F ), respectively, work here, where F is FMS or FMM . Thus QRD(FO, F ) and
DRP(FO, F ) are both in PSPACE. Moreover, it is easy to see that when λ = 1, it is
still in PSPACE to verify that whether a set U is a valid set for (Q, D, k, F, B) for FO,
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
App–8
Ting Deng and Wenfei Fan
when F is FMS or FMM . Hence RDC(FO, F ) is in #·PSPACE for max-sum or max-min
diversification, when λ = 1.
(1.2) Data complexity. The lower bounds of QRD(LQ , F ), DRP(LQ , F ) and RDC(LQ , F )
for fixed queries hold here, as their proofs for Theorems 5.4, 6.4, 7.4 and 7.5, respectively, use λ = 1. The algorithms given there for the upper bounds also work here.
(2) When F is Fmono . For the combined complexity, the lower bounds proofs of Theorems 5.2, 6.2 and 7.2 for QRD(LQ , Fmono ), DRP(LQ , Fmono ) and RDC(LQ , Fmono ),
respectively, are verified by using λ = 1. Hence, those lower bounds hold here.
Moreover, all the upper bounds given there carry over to the special case when λ = 1.
For the data complexity, the PTIME upper bounds of Theorems 5.4 and 6.4 for
QRD(LQ , Fmono ) and DRP(LQ , Fmono ), respectively, obviously carry over to their special
case when λ = 1. We next show that when λ = 1, the data complexity of RDC(LQ , Fmono )
is #P-complete for CQ, UCQ, ∃FO+ and FO, under polynomial Turing reductions. It
suffices to show that RDC(CQ, Fmono ) is #P-hard under polynomial Turing reductions,
and that RDC(FO, Fmono ) is in #P, for fixed queries.
(2.1) Lower bound. We show that when λ = 1, RDC(CQ, Fmono ) is #P-hard by polynomial Turing reduction from #SSPk, which is shown #P-complete by Lemma 7.6. Along
the same line as the proof of Theorem 7.5, we construct a polynomial Turing reduction
as follows. Given an instance W , π, l and d of #SSPk, we define Q, D,Pδrel , δdis , Fmono ,
k and B, such that the number of subsets T of W with |T | = l and w∈T π(w) ≥ d
equals the number of valid sets U for (Q, D, k, Fmono , B). As discussed there, we can
find
P the solution to #SSPk, i.e., the number of subsets T of W with |T | = l and
w∈T π(w) = d, by calling the oracle COUNTRDC twice, to compute the numbers X
and Y of valid sets for (Q, D, k, Fmono , B) and (Q, D, k, Fmono , B + 1), respectively. Here
COUNTRDC (Q, D, k, Fmono , B) is the oracle that given Q, D, k, Fmono and B, returns the
number of valid sets U for (Q, D, k, Fmono , B).
We next give the transformation from #SSPk to RDC(CQ, Fmono ), for λ = 1.
(1) For each w ∈ W , let w′ be a distinct element not in W . The database D consists of
a single relation IW = {(w), (w′ ) | w ∈ W }, specified by schema RW (W ).
(2) We define query Q as the identity query on RW instances. Then |Q(D)| = 2|W |.
(3) We define δrel as a constant function that returns 1 for each tuple of RQ . Moreover,
we define δdis ((w), (w′ )) = π(w), and for any other pair of tuples t and t′ of RQ ,
we define δdis (t, t′ ) = 0. We
P set λ = 1, and thus for each set U of tuples of RQ ,
Fmono (U ) = (1/(2|W | − 1)) · t∈U,s∈Q(D) δdis (t, s).
(4) Finally, we set k = 2l and B = d/(2|W | − 1).
PWe only need to show that the number of subsets T of W with |T | = l and
w∈T π(w)≥ d is equal to the number of valid sets U for (Q, D, k, Fmono , B). Recall
that k = 2l and B = d/(2|W | − 1). Assume that there exists a set U ⊆Q(D) with
|U | = k and Fmono (U ) ≥ B. Then by the definition
of δdis , Fmono (U ) = (1/(2|W | − 1)) ·
P
P
′
= (1/(2|W | − 1)) · (w)∈U π(w) ≥ B. Then for the set T =
(w)∈U,(w ′ )∈U δdis ((w), (w )) P
{w | (w)
P ∈ U }, we have that w∈T π(w) ≥ d. Conversely, for a subset T of W with |T | = l
and w∈T π(w) ≥ d, the set U = {(w), (w′ ) | w ∈ T } is valid for (Q, D, k, Fmono , B).
(2.2) Upper bound. When λ = 1, RDC(FO, Fmono ) is in #P, since it is in PTIME to check
whether a set U is valid for (Q, D, k, Fmono , B) for fixed FO queries.
This completes the proof of Theorem 8.3
2
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
App–9
C OROLLARY 8.4. For a predefined constant k,
— the combined complexity bounds given in Theorems 5.1, 5.2, 6.1, 6.2, 7.1 and 7.2 are
unchanged for QRD, DRP and RDC, respectively; and
— the data complexity is in
— PTIME for QRD,
— PTIME for DRP, and
— FP for RDC,
no matter whether for FMS , FMM or Fmono , and for CQ, UCQ, ∃FO+ or FO.
P ROOF. We study QRD, DRP and RDC when k is a predefined constant, i.e., we
consider only candidate sets U with a constant size. We first establish their combined
complexity, and then investigate their data complexity.
(1) Combined complexity. We first show the lower bounds, followed by upper bounds.
(1.1) Lower bound. Observe that lower bounds proofs of QRD(LQ , F ), DRP(LQ , F ) and
RDC(LQ , F ) given for Theorem 5.1, 5.2, 6.1, 6.2, 7.1 and 7.2 are established by using
k = 2 when F is FMS , and by setting k = 1 when F is FMM or Fmono , except those for
QRD(LQ , F ) and DRP(LQ , F ), when F is FMS or FMM for CQ, UCQ and ∃FO+ . Hence
these lower bounds hold here. Moreover, QRD(LQ , F ) and DRP(LQ , F ) are also shown
to be NP-hard and coNP-hard in Theorem 8.2, respectively, for CQ, UCQ and ∃FO+ , by
also using k = 2 when F is FMS , and by setting k = 1 when F is FMM . Hence these
lower bounds hold here.
(1.2) Upper bound. The upper bounds of QRD(LQ , F ), DRP(LQ , F ) and RDC(LQ , F )
given for Theorem 5.1, 5.2, 6.1, 6.2, 7.1 and 7.2 obviously remain intact in the special
case when k is a constant.
(2) Data complexity. We show that QRD(FO, F ) and DRP(FO, F ) are in PTIME, and
RDC(FO, F ) is in FP for fixed FO queries, when F is FMS , FMM or FMM .
(2.1) QRD(FO, ). We show that QRD(FO, F ) is in PTIME by giving a PTIME algorithm.
Given Q, D, δrel , δdis , F , k and B, the algorithm checks whether there exists a valid set
U for (Q, D, k, F, B). It works as follows:
1. compute Q(D);
2. enumerate all subsets U of Q(D) such that |U | = k; denote by S the collection of
all such sets U ;
3. check whether there exists a set U ∈ S such that F (U ) ≥ B; if so, return “yes”;
otherwise, return “no”.
We show the algorithm is in PTIME when F is FMS , FMM or Fmono . Indeed, Q(D) is
PTIME computable since Q is fixed. Moreover, there are only polynomial many sets
U in S that need to be checked in step 3 since k is a constant. Obviously, FMS (U ) and
FMM (U ) are PTIME computable; Fmono is also PTIME computable since it is in PTIME
to compute Q(D). Thus step 3 can be done in PTIME. Hence QRD(FO, F ) is in PTIME
for fixed queries and constant k, when F is FMS , FMM or Fmono .
(2.2) DRP(FO, F ). We show that DRP(FO, F ) is in PTIME by giving a PTIME algorithm.
Given Q, D, δrel , δdis , F , U , k and r, the algorithm works as follows:
1. compute Q(D), in PTIME;
2. enumerate all subsets V of Q(D) such that |V | = k; denote by S the collection of
all such sets V ; sort all sets in S in descending order based on their F values;
3. check whether rank(U ) ≤ r; if so, return “yes”; otherwise return “no”.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
App–10
Ting Deng and Wenfei Fan
To see that the algorithm is in PTIME, note that it is in PTIME to compute Q(D) since
Q is fixed, and that there are only polynomial many sets in S. Moreover, F is PTIME
computable for FMS , FMM and Fmono when Q is fixed. Thus steps 2 and 3 are also in
PTIME. In particular, step 3 can be easily done by counting the number of sets S in
the sorted S that have F (S) ≥ F (U ). Hence the algorithm is in PTIME.
(2.3) RDC(FO, F ). We show that RDC(FO, F ) is in FP by giving a PTIME algorithm.
Given Q, D, δrel , δdis , F , k and B, the algorithm works as follows:
1. compute Q(D), in PTIME;
2. enumerate all subsets U of Q(D) such that |U | = k and sort them in descending
order based on their F () values;
3. count and return the number of distinct valid sets for (Q, D, k, F, B).
Its step 1 is in PTIME since Q is fixed. Moreover, there are polynomial many subsets U
in step 2 since k is a constant. Thus the algorithm is in PTIME, and the problem is in FP.
This completes the proof of Corollary 8.4.
2
B. PROOFS OF SECTION 9
C OROLLARY 9.2. In the presence of compatibility constraints of Cm , the combined
complexity bounds of Theorems 5.1, 5.2, 6.1, 6.2, 7.1 and 7.2 remain unchanged for
QRD, DRP and RDC.
2
P ROOF. Observe that the lower bounds of QRD, DRP and RDC given in Theorems 5.1, 5.2, 6.1, 6.2, 7.1 and 7.2, respectively, are established when the compatibility
constraints are absent. These bounds are obviously intact in the more setting when
compatibility constraints are present.
Below we show that the upper bounds given there also remain intact in the presence
of compatibility constraints Σ. To this end, we make minor changes to the algorithms
given there by considering candidate sets for (Q, D, Σ, k, ). We show that the revised
algorithms still work here and their complexity remain unchanged.
(1) QRD(LQ , F ). When F is FMS or FMM , we revise the algorithms given for Theorem 5.1
as follows: (a) the step 2 of the algorithm for QRD(∃FO+ , F ) checks whether |U | = k,
F (U ) ≥ B and U |= Σ; (b) in step 2, the algorithm for QRD(FO, F ) also checks whether
U ⊆ Q(D), F (U ) ≥ B and U |= Σ. When F is Fmono , we revise the algorithm given in
Theorem 5.2 such that in step 2, the algorithm checks whether U ⊆ Q(D) and U |= Σ.
Clearly, all the revised algorithms work here. Moreover, they are still in NP, PSPACE
and PSPACE, respectively, since U |= Σ can be checked in PTIME. Thus the upper
bounds given in Theorems 5.1 and 5.2 carry over here.
(2) DRP(LQ , F ). When F is FMS or FMM , we revise the algorithms given for Theorem 6.1
such that (a) in step 2, the algorithm for DRP(∃FO+ , F ) checks whether for each set
S ∈ S, |S| = k and S |= Σ; (b) in step 2, the algorithm for DRP(FO, F ) checks whether
for each set S ∈ S, S ⊆ Q(D) and S |= Σ. When F is Fmono , the step 2 of the algorithm
given for Theorem 6.2 checks whether for each S ∈ S, S ⊆ Q(D) and S |= Σ. Since
we can check whether S |= Σ is in PTIME, all the revised algorithms are still in coNP,
PSPACE and PSPACE, respectively. Thus the upper bounds given in Theorem 6.1 and
6.2 remain valid here.
(3) RDC(LQ , F ). It suffices to show the following: (a) when F is FMS or FMM , it is in NP
and in PSPACE to check if a set U is valid for (Q, D, Σ, k, F, B) for ∃FO+ queries Q and
FO queries Q, respectively, and (b) when F is Fmono , it is in PSPACE to check whether
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
App–11
a set U is valid for (Q, D, Σ, k, F, B) for FO queries Q. Since it is in PTIME to check
whether a set U satisfies Σ (i.e., U |= Σ), both statements (a) and (b) hold here.
This completes the proof of Corollary 9.2.
2
T HEOREM 9.3. In the presence of compatibility constraints of Cm , the data complexity bounds of Theorems 5.4, 6.4 and 7.4 remain unchanged for QRD, DRP and RDC,
respectively, for FMS and FMM . However, for Fmono ,
— QRD becomes NP-complete;
— DRP becomes coNP-complete; and
— RDC becomes #P-complete under parsimonious reductions.
2
P ROOF. The lower bounds of QRD(LQ , F ), DRP(LQ , F ) and RDC(LQ , F ) given in
Theorems 5.4, 6.4 and 7.4, respectively, are established in the absence of compatibility
constraints Σ, for FMS and FMM . These lower bounds obviously hold in the more
general setting when Σ may be present. In addition, the upper bounds given there
also carry over to the setting when compatibility constraints Σ are present. Indeed,
along the same line as the proof of Corollary 9.2, we can revise the algorithms given
in the upper bound proofs of Theorems 5.4, 6.4 and 7.4 by considering candidate sets
for (Q, D, Σ, k, ). The revised algorithms still work in the presence of Σ with the same
complexity, since one can check whether U |= Σ in PTIME. Hence all the upper bounds
still hold here.
We next show that for Fmono , the data complexity of QRD, DRP and RDC is
NP-complete, coNP-complete and #P-complete under parsimonious reductions,
respectively, in the presence of Σ, for CQ, UCQ, ∃FO+ and FO.
(1) QRD(LQ , Fmono ). It suffices to show that QRD(CQ, Fmono ) is NP-hard and that
QRD(FO, Fmono ) is in NP.
Lower bound. We verify that QRD(CQ, Fmono ) is NP-hard by reduction from 3SAT.
Given an instance ϕ = C1 ∧ . . . ∧ Cl of 3SAT defined over variables X = {x1 , . . . , xm },
we construct a database D, a fixed CQ query Q, functions δrel , δdis and Fmono , a set of
fixed compatibility constraints Σ, a positive integer k and a real number B. We show
that ϕ is satisfiable if and only if there exists a valid set U for (Q, D, Σ, k, Fmono , B).
(1) We use the database D given in the proof of Theorem 5.1 for QRD(CQ, FMS ), over
schema RC (cid, L1 , V1 , L2 , V2 , L3 , V3 ). That is, for each clause Ci , there are tuples in D
to encode all truth assignments for variables in Ci that make Ci true.
(2) The query Q is an identity query on instances of RC .
(3) For each tuple t ∈ D, we define δrel (t, Q) = 1, and for any other tuples t′ of RQ , we
let δrel (t, Q) = 0. We define δdis as a constant function that returns 0 for each pair of
tuples t and s of RQ . We set P
λ = 0. Then for each set U of tuples of RQ with k tuples,
one can see that Fmono (U ) = t∈U δrel (t, Q).
(4) We define Σ consisting of 10 constraints, given as follows:
^
t1 [A] = t2 [A] ,
ρ1 : ∀t1 , t2 : RQ t1 [cid] = t2 [cid] →
A∈RC
ρij : ∀t1 , t2 : RQ t1 [Li ] = t1 [Lj ] → t1 [Vi ] = t2 [Vj ] , i, j ∈ [1, 3].
Intuitively, constraint ρ1 states that tuples in a set U have pairwise distinct cid-values,
and the set {ρij | i, j ∈ [1, 3]} ensures that for each pair of tuples t and s in U , t
and s agree on the values of their common variables. Note that all these constraints
are defined on the fixed schema RC (since Q is an identity query on Rc ), and have
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
App–12
Ting Deng and Wenfei Fan
the form of functional dependencies (see, e.g., [Abiteboul et al. 1995] for functional
dependencies). Intuitively, these compatibility constraints are used to assure that
k-element sets U to be picked encode a valid truth assignment for variables X.
(5) Finally, we take k = B = l. That is, we only consider sets U that consist of l tuples,
one for each clause in ϕ.
We next show that the formula ϕ is satisfiable if and only if there exists a valid set
U for (Q, D, Σ, k, F, B), i.e., the construction given above is indeed a reduction.
⇒ Assume first that ϕ is satisfiable. Then there exists a truth assignment µ0X of X
variables such that every clause Cj of ϕ is true under µ0X . Let U consist of l tuples
of RQ , one for each clause, in which the values for the variables in X agree with
µ0X . Obviously, U |= Σ, and moreover, Fmono (U ) = l ≥ B by the definition of Fmono .
Therefore, U is a valid set for (Q, D, Σ, k, F, B).
⇐ Conversely, assume that ϕ is not satisfiable. Suppose by contradiction that there
exists a set U ⊆ Q(D) such that |U | = l, U |= Σ, and Fmono (U ) ≥ B. Then from the
tuples in U , we can construct a valid truth assignment µX of variables in X that
satisfies all clauses in ϕ, by the definition of δrel . This leads to a contradiction.
Upper bound. Observe that the algorithm for QRD(FO, Fmono ) revised in the proof of
Corollary 9.2 works here. Clearly, its step 2 is in PTIME since one can check whether
U |= Σ in PTIME, Q(D) is PTIME computable for a fixed FO query Q, and moreover,
Fmono (U ) is also in PTIME. Thus the algorithm is in NP when Q is fixed.
(2) DRP(LQ , Fmono ). It suffices to show that DRP(CQ, Fmono ) is coNP-hard and that
DRP(FO, Fmono ) is in coNP.
Lower bound. We verify the lower bound by reduction from the complement of 3SAT.
Given an instance ϕ = C1 ∧ . . . ∧ Cl of 3SAT over variables X = {x1 , . . . , xm }, we
construct a database D, a fixed CQ query Q, functions δrel , δdis and Fmono , a fixed set Σ
of compatibility constraints, a positive integer k, and a set U . We show that ϕ is not
satisfiable if and only if rank(U ) ≤ r. We use constant r = 1.
V
(1) We define the same formula ϕ′ = (ϕ ∨ z) ∧ z̄ = li=1 (Ci ∨ z) ∧ z̄, and the same
database D (a single relation) over schema RC (L1 , V1 , L2 , V2 , L3 , V3 , Z, VZ , A) as their
counterparts given in the proof of Theorem 6.1. Here D consists of tuples that encode,
for each clause Ci ∨ z, all truth assignments µi for the three variables in Ci and z,
and the truth value of the clause Ci ∨ z under µi . It also includes two extra tuples
(l + 1, e1 , f1 , e2 , f2 , e3 , f3 , z, 1, 0) and (l + 1, e1 , f1 , e2 , f2 , e3 , f3 , z, 0, 1) for z̄, where all ei
and fi are distinct constants that are not in X ∪ {z, 0, 1}.
(2) The query Q is an identity query on instances of Rc .
(3) Let U consist of l + 1 tuples from D, one for each clause in ϕ′ such that all variables
in X and z are all set to be 1.
(4) Along the same line as the proof of DRP(CQ, Fmono ) given above, we define constraints Σ to ensure that for any set U ⊆ Q(D), if U |= Σ, then the tuples in U have
pairwise distinct cid-values and moreover, for each pair of tuples t and s in U , t and s
agree on the values of their common variables.
(5) We define function δrel such that for each tuple t ∈ D, δrel (t, Q) = 1 if t[A] = 1; and
′
for any other tuple t′ of RQ , we let
Pδrel (t , Q) = 0. We set λ = 0. Then for each set U of
tuples of schema RQ , Fmono (U ) = t∈U (δrel (t, Q)).
(6) Finally, we set k = l + 1.
We next show that this is indeed a reduction.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
App–13
⇒ Assume first that ϕ is satisfiable. Then there exists a truth assignment µ0X for X
variables that satisfies ϕ. We show that rank(U ) = 2 > r = 1. Let U 0 consist of l + 1
tuples, one for each clause in ϕ′ , such that the values of all the variables in X agree
with µ0X and z is set to be 0. Obviously, for any tuple t in U 0 , we have that δrel (t, Q) = 1
by the definition of δrel . Then Fmono (U 0 ) = l + 1. Note that for each tuple t ∈ U , t[A] = 1
if t 6= (l + 1, e1 , f1 , e2 , f2 , e3 , f3 , z, 1, 0). Thus Fmono (U ) = l. Putting these together, we
have that rank(U ) ≥ 2 > r = 1.
⇐ Conversely, assume that ϕ is not satisfiable. Then there exists no truth assignment µX of X variables that satisfies ϕ. It is easy to see that for each candidate
set S for (Q, D, k), there exist at most l tuples t ∈ S such that t[A] = 1, and thus
Fmono (S) ≤ l = Fmono (U ). Therefore, rank(U ) = 1 ≤ r = 1.
Upper bound. Clearly, the algorithm for DRP(FO, FMS ) given in the proof of Corollary 9.2 also works here. Indeed, its step 2 is in PTIME since it is in PTIME to check
whether U |= Σ, Q(D) is PTIME computable for a fixed FO query Q, and moreover,
Fmono (U ) is also in PTIME. Thus the algorithm is in coNP here.
(3) RDC(LQ , Fmono ). It suffices to show that RDC(CQ, Fmono ) is #P-hard under parsimonious reductions, and that RDC(FO, Fmono ) is in #P.
Lower bound. We show that RDC(CQ, Fmono ) is #P-hard by parsimonious reduction
from #SAT (see Section 7 for the details of #SAT). Given an instance ϕ(X) = C1 ∧· · ·∧Cl
of 3SAT over variables X, we construct the same D, Q, Σ, λ, k, B, and functions δrel , δdis
and Fmono as their counterparts given above for QRD(CQ, Fmono ). We let k = l. It is easy
to verify that µX is a truth assignment of X variables that satisfies ϕ(X) if and only
if there exists a valid set U for (Q, D, Σ, k, Fmono , B) encoding µX , i.e., U consists of l
tuples in D, one for each clause, in which the values for the variables in X agree with
µX . Thus this is indeed a parsimonious reduction. Hence RDC(CQ, Fmono ) is #P-hard
under parsimonious reductions.
Upper bound. We only need to show that it is in PTIME to check whether a set U is
valid for (Q, D, Σ, k, Fmono , B) for a fixed FO query Q. Indeed, for a set U ⊆ Q(D), it is in
PTIME to check whether |U | = k, U |= Σ and Fmono (U ) ≥ B; in particular, Q(D) is PTIME
computable, and thus Fmono is PTIME computable. Hence RDC(FO, Fmono ) is in #P.
This completes the proof of Theorem 9.3.
2
C OROLLARY 9.4. For identity queries, in the presence of compatibility constraints
of Cm , both the combined complexity and data complexity of Corollary 8.1 remain
unchanged for QRD, DRP, and RDC for FMS and FMM .
However, when it comes to Fmono ,
— QRD becomes NP-complete;
— DRP becomes coNP-complete; and
— RDC becomes #P-complete under parsimonious reductions,
for both combined and data complexity.
2
P ROOF. In the presence of a set Σ of compatibility constraints, for identity queries
we first study the combined and data complexity of QRD, DRP and RDC for FMS and
FMM . We then investigate these problems for Fmono .
(1) When F is FMS or FMM . Observe the following. (a) We have shown in the proof of
Corollary 8.1 that QRD, DRP and RDC are NP-hard, coNP-hard and #P-hard under
parsimonious reductions, respectively, for fixed identity queries, when Σ is absent.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
App–14
Ting Deng and Wenfei Fan
Thus the lower bounds carry over to the more general setting when Σ may be present.
(b) Corollary 9.2 tells us that when Σ is present, QRD, DRP and RDC are in NP, coNP
and #P, respectively, when Q is a CQ query that is not necessarily fixed. Observe that
CQ subsumes identity queries. Hence for identity queries, QRD, DRP and RDC are in
NP, coNP and #P, respectively, in the presence of Σ. Therefore, for identity queries,
the combined complexity and data complexity of QRD, DRP and RDC are NP-complete,
coNP-complete and #P-complete under parsimonious reductions, respectively, in the
presence of Σ.
(2) When F is Fmono . Observe the following. (a) We have shown in the proof of Theorem 9.3 that QRD, DRP and RDC are NP-hard, coNP-hard and #P-hard under
parsimonious reductions, respectively, for Fmono , by using a fixed identity query Q in
the presence of Σ. Thus these lower bounds hold here. (b) The upper bounds given in
the proof of Theorem 9.3 for QRD, DRP and RDC carry over here, since Q(D) is PTIME
computable when Q is an identity query, even if it is not fixed. From these we can see
that for identity queries, the combined and data complexity of QRD, DRP and RDC
are NP-complete, coNP-complete and #P-complete under parsimonious reductions,
respectively, in the presence of Σ.
This completes the proof of Corollary 9.4.
2
C OROLLARY 9.5. For λ = 0, in the presence of compatibility constraints of Cm , the
combined complexity bounds given in Theorem 8.2 remain unchanged for QRD, DRP
and RDC, while the data complexity becomes
— NP-complete for QRD;
— coNP-complete for DRP; and
— #P-complete for RDC under parsimonious reductions,
no matter for FMS , FMM and Fmono , and for CQ, UCQ, ∃FO+ and FO.
2
P ROOF. For λ = 0, we first study the combined complexity of QRD, DRP and RDC,
and then investigate their data complexity.
(1) Combined complexity. Observe the following. (a) The lower bounds given in Theorem 8.2 for QRD, DRP and RDC, respectively, are established when Σ is absent. Hence
these lower bounds remain intact when Σ is possibly present. (b) The upper bounds
given there carry over here when Σ is present. Indeed, along the same line as the proof
of Corollary 9.2, we can revise the algorithms of Theorem 8.2 such that the revised
algorithms work in the presence of Σ, with the same complexity since whether U |= Σ
can be checked in PTIME.
(2) Data complexity. We first study QRD, DRP and RDC for FMS and FMM , and then
consider these problems for Fmono .
(2.1) When F is FMS or FMM . We show that in the presence of a set Σ of compatibility
constraints, for λ = 0, QRD, DRP and RDC are NP-complete, coNP-complete and
#P-complete under parsimonious reductions respectively, for fixed queries in CQ,
UCQ, ∃FO+ and FO.
(2.1.1) Lower bound. It suffices to verify that when F is FMS or FMM , QRD(CQ, F ),
DRP(CQ, F ) and RDC(CQ, F ) are NP-hard, coNP-hard and #P-hard under parsimonious
reductions, respectively, when λ = 0 and Σ is present.
We first show that QRD(CQ, FMS ) and QRD(CQ, FMM ) are NP-hard by reduction
from 3SAT. Given an instance ϕ of 3SAT, we construct the same D, Q, Σ and
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
App–15
functions δrel and δdis as their counterparts given in the proof of Theorem 9.3 for
QRD(CQ, Fmono ). Recall that δrel is defined as follows: δrel (t, Q) = 1 if t ∈ D = Q(D),
and for any other tuples t′ of RQ , δrel (t′ , Q) = 0.P
Thus, when λ = 0, for each set U
consisting of k tuples in D, FMS (U ) = (k − 1) · t∈U δrel (t, Q) = (k − 1) · Fmono (U ),
and FMM (U ) = mint∈U δrel (t, Q) = (1/k) · Fmono (U ). Moreover, the lower bound of
QRD(CQ, Fmono ) in Theorem 9.3 is verified by setting λ = 0. Thus for QRD(CQ, FMS ), we
let k = l and B = (l − 1) · l; and for QRD(CQ, FMM ), we let k = l and B = 1. Along the
same line as the proof of Theorem 9.3, we can show that ϕ is satisfiable if and only if
there exists a set U valid for (Q, D, Σ, k, F, B), when F is FMS or FMM .
We next verify that DRP(CQ, FMS ) and DRP(CQ, FMM ) are coNP-hard by reduction
from the complement of 3SAT. Given an instance ϕ of 3SAT, we use the same reduction given in the proof of Theorem 9.3 for DRP(CQ, Fmono ), except
P the following: for
each set U of k tuples of schema RQ , (a) FMS (U ) = (k − 1) · t∈U δrel (t, Q); and (b)
FMM (U ) = mint∈U δrel (t, Q). Recall that we show the lower bound of DRP(CQ, Fmono ) in
the proof of Theorem 9.3 by using λ = 0. Thus along the same line as that proof, one
can verify that ϕ is not satisfiable if and only if rank(U ) ≤ r, for FMS and FMM .
Finally, we show that RDC(CQ, FMS ) and RDC(CQ, FMM ) are #P-hard by parsimonious reductions from #SAT. We have shown in the proof of Theorem 9.3 that
RDC(CQ, Fmono ) is #P-hard under parsimonious reductions for fixed queries, when
λ = 0. Given an instance ϕ(X) of #SAT over variables X, along the same line as
the analysis in the proof of QRD(CQ, FMS ) and QRD(CQ, FMM ) given above, we can
use the same reduction given in the proof of Theorem 9.3 for RDC(CQ, Fmono ), except
the following: (a) for RDC(CQ, FMS ), we let k = l and B = (l − 1) · l; and (b) for
RDC(CQ, FMM ), we let k = l and B = 1. Similarly, we can show that the number of
truth assignments of the X variables that satisfies ϕ(X) equals the number of valid
sets for (Q, D, Σ, k, F, B), when F is FMS or FMM .
(2.1.2) Upper bound. We only need to show that QRD(FO, F ), DRP(FO, F ) and
RDC(FO, F ) are in NP, coNP and #P, respectively, when F is FMS or FMM . Clearly, the
upper bounds given in Theorem 9.3 for QRD(LQ , F ), DRP(LQ , F ) and RDC(LQ , F ) for
FMS and FMM carry over to the special case when λ = 0.
(2.2) When F is Fmono . Observe the following. (a) The lower bounds given in the Theorem 9.3 for the data complexity of QRD(LQ , Fmono ), DRP(LQ , Fmono ) and RDC(LQ , Fmono )
are established by using λ = 0. Thus the lower bounds hold here. (b) All the algorithms for these problems given there work in the special case when λ = 0. Thus
the upper bounds carry over here. Hence when Σ is present, the data complexity of
QRD(LQ , Fmono ), DRP(LQ , Fmono ) and RDC(LQ , Fmono ) are NP-complete, coNP-complete
and #P-complete under parsimonious reductions, respectively.
This completes the proof of Corollary 9.5.
2
C OROLLARY 9.6. For λ = 1, in the presence of compatibility constraints of Cm , the
combined complexity bounds given in Theorem 8.3 remain unchanged for QRD, DRP
and RDC.
The data complexity bounds of Theorem 8.3 remain unchanged for FMS and FMM . In
contrast, for Fmono and for CQ, UCQ, ∃FO+ and FO,
— QRD is NP-complete;
— DRP is coNP-complete; and
— RDC is #P-complete under parsimonious reductions.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
2
App–16
Ting Deng and Wenfei Fan
P ROOF. When λ = 1 and Σ is present, we study the combined complexity of QRD,
DRP and RDC, followed by their data complexity.
(1) Combined complexity. Observe the following. The lower bounds given in Theorem 8.3 for QRD, DRP and RDC are established when Σ is absent. Thus the lower
bounds hold in the more general setting when Σ is possibly present. Furthermore,
the upper bounds given in Corollary 9.2 for QRD, DRP and RDC carry over here to the
special case when λ = 1.
(2) Data complexity. We next study the data complexity for FMS , FMM , and Fmono .
(2.1) When F is FMS or FMM . In this setting, observe the following. (a) The lower bounds
given in Theorem 8.3 for QRD(LQ , F ), DRP and RDC are established when Σ is absent.
Thus the lower bounds also hold in the presence of Σ. (b) The algorithms given in the
proof of Theorem 9.3 for QRD, DRP and RDC work in the special cases when λ = 1.
Therefore, the upper bounds given there remain valid the the special setting.
(2.2) When F is Fmono . We show that QRD(LQ , Fmono ), DRP(LQ , Fmono ) and RDC(LQ ,
Fmono ) are NP-complete, coNP-complete and #P-complete under parsimonious reductions, respectively, for fixed queries in CQ, UCQ, ∃FO+ and FO.
(2.2.1) Lower bound. We first study the lower bounds.
We show that QRD(CQ, Fmono ) is NP-hard by reduction from 3SAT. Given an instance ϕ of 3SAT, we use the same reduction given in the proof of Theorem 9.3 for
QRD(CQ, Fmono ), except the following: (a) δdis is a constant function that returns 1 for
any two different tuples of schema RQ; P
(b) when λ = 1, for any set U of tuples of
schema RQ , Fmono (U ) = λ/(|Q(D)| − 1) · t∈U,t′ ∈Q(D) (δdis (t, t′ )); and (c) k = l and B =
l·(l−1)/(|Q(D)|−1). Along the same line as the proof given there, one can readily verify
that ϕ is satisfiable if and only if there exists a valid set U for (Q, D, Σ, k, Fmono , B).
We next show that DRP(CQ, Fmono ) is coNP-hard by reduction from the complement
of 3SAT. Given an instance ϕ of 3SAT, We construct the same D, Q, U , Σ and r as their
counterparts given in the proof of Theorem 9.3 for DRP(CQ, Fmono ), and moreover, we
define the same function Fmono as the one given there when λ = 1. Along the same line
as that proof, one can verify that ϕ is not satisfiable if and only if rank(U ) ≤ r = 1.
We verify that RDC(CQ, Fmono ) is #P-hard by parsimonious reduction from #SAT.
Given an instance ϕ(X) of #SAT, we construct the same D, Q, Σ, k, B and function
Fmono as their counterparts given above for QRD(CQ, Fmono ). It is easy to verify that
µX is a truth assignment of X variables that satisfies ϕ(X) if and only if there exists
a valid set U for (Q, D, Σ, k, Fmono , B) that encodes µX , such that U consists of l tuples
in D, one for each clause, in which the values for the variables in X agree with µX .
Thus it is indeed a parsimonious reduction. From this it follows that RDC(CQ, Fmono )
is #P-hard under parsimonious reductions.
(2.2.2) Upper bound. We have already shown in Theorem 9.3 that QRD(FO, Fmono ),
DRP(FO, Fmono ) and RDC(FO, Fmono ) are in NP, coNP and #P, respectively, in the
presence of Σ. Thus the upper bounds carry over here to the special case when λ = 1.
This completes the proof of Corollary 9.6.
2
C OROLLARY 9.7. For a predefined constant k, in the presence of compatibility
constraints of Cm , the combined complexity and data complexity of Corollary 8.4
remain unchanged for QRD, DRP and RDC, respectively.
2
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
On the Complexity of Query Result Diversification
App–17
P ROOF. We first study the combined complexity. Observe the following. (a) The
lower bounds given in Corollary 8.4 for QRD, DRP and RDC are established for a (fixed)
constant k when compatibility constraints are absent. Thus the lower bounds remain
intact in the more general setting when compatibility constraints are present. (b) The
upper bounds given there also carry over here. This follows from the fact that the algorithms developed in the proof of Corollary 9.2 for QRD, DRP and RDC in the presence
of compatibility constraints obviously work in the special case when k is a constant.
For the data complexity, consider the algorithms given in the proof of Corollary 8.4
for QRD(FO, F ), DRP(FO, F ) and RDC(FO, F ). Along the same line as the proof of
Corollary 9.2, we revise these algorithms to inspect candidate sets for (Q, D, Σ, k), by
additionally checking whether U |= Σ. Clearly, the revised algorithms work here and
the problems are still tractable, since it is in PTIME to check whether U |= Σ.
This completes the proof of Corollary 9.7.
ACM Transactions on Database Systems, Vol. V, No. N, Article A, Publication date: January YYYY.
2