ppt slides

Combining Fuzzy Information:
an Overview
Author: Ronald Fagin
Presented by: Bill Eberle
Overview






Introduction
Model
Algorithms
Turning TA into an Approximation
Algorithm
Restricting Sorted Access
Restricting Random Access
Introduction






Be able to access data from a variety of data repositories.
Such data is inherently “fuzzy” (ex. the color “red”, where there
are degrees of redness).
Result is a “graded” set, or set of pairs (x,g), where x is an
object and g is the grade – a real number [0,1].
Scoring (aggregation) function is used to handle compound
queries (ex. redness and roundness).
If x1,…,xm are the grades of an object R under each of the m
attributes, then t(x1,…,xm) is the overall grade of an object R.
A scoring function is monotone if t(x1,…,xm) <= t(x1,…,xm)
whenever xi <= x’I for every i. In other words: if for every
attribute, the grade of object R’ is at least as high as that of
object R, then we would expect the overall grade of R’ to be at
least as high as that of R.
(discussion restricted to monotone aggregate functions)
Introduction (continued)





Middleware: system “on top of” various subsystems with the
purpose of integrating results from the subsystems.
Random access: request the grade under a given attribute for
any given object.
Sorted Access: request the top k objects in sorted order, each
along with its grade.
Simplistic middleware cost = total number of objects obtained
from the database under sorted access + total number of
objects obtained from the database under random access (times
some positive constants).
This paper discusses and compares algorithms for finding the
top k objects. In other words, obtain k objects with the highest
grades on a query, along with their grades.
The Model





N is the number of objects.
Each object R has m fields x1,…,xm, where xi
is [0,1] for each i.
Database consists of m sorted lists L1,…,Lm,
each of length N.
Each entry of Li is of the form (R,xi), where xi
is the i th field of R, and the list Li is sorted in
descending order by the xi value.
Only takes into account access costs and
ignoring internal computation costs.
The Naive Algorithm


Under sorted access, looks at every entry in each
of the m sorted lists, computes (using t) the
overall grade of every object, and returns the top
k answers.
Linear middleware cost (linear in the database
size), and thus not efficient for a large database.
Fagin’s Algorithm (FA)

Algorithm:





Create a set H of at least k objects such that each of these
objects has been seen in each of the m (sorted) lists.
For each object R that has been seen, do random access to
each of the lists Li to find the ith field xi of R.
Compute the grade t for each object R that has been seen,
and let Y be the set containing the k objects that been seen
with the highest grades.
FA is correct for monotone scoring functions.
Middleware cost of FA (if N objects in the database
and the orderings in the sorted lists are
probabilistically independent):
ON ( m1) m k 1 m 
Threshold Algorithm (TA)

Algorithm:




Reason stopping rule for TA always occurs at least as early as the stopping rule for FA:


As an object R is seen under sorted access in some list, do random access to the other lists to find
the grade xi of object R in every list Li. Then compute the grade t(R) of object R. If this grade is
one of the k highest seen, then remember object R and its grade t(R).
For each list Li, let xi be the grade of the last object seen under sorted access. Define the threshold
value T to be t(x1,…,xm). As soon as at least k objects have been seen whose grade is at least
equal to T, then halt.
Let Y be the set containing the k objects that been seen with the highest grades.
In FA, if R is an object that has appeared under sorted access in every list, then by monotonicity,
the grade of R is at least equal to the threshold value. Therefore, when there are at least k objects,
each of which has appeared under sorted access in every list (the stopping rule for FA), there are at
least k objects whose grade is at least equal to the threshold value (the stopping rule for TA).
Advantages of TA over FA:


FA is optimal in a high-probability worst-case sense under certain assumptions; TA is instance
optimal, which intuitively means it is optimal in every instance, as opposed to just the worst case or
the average case.
FA requires buffers that grow arbitrarily large as the database grows, since it must remember every
object is has seen in sorted order, in order to check for matching objects in the various lists; TA
requires only bounded buffers, whose size is independent of the size of the database.
Turning TA into an
Approximation Algorithm



TA can easily be modified to be an approximation
algorithm, where we care only about the approximately
top k answers.
First define a  -approximation to the top k answers (for t
over database D) to be a collection of k objects (and their
grades) such that for each y among these k objects and
each z not among these k objects,  t(y) >= t(z).
To find a  -approximation to the top k answers, modify
the stopping rule of TA to be:



As soon as at least k objects have been seen whose grade is
at least equal to T/  , then halt.
If  > 1 and the aggregate function t is monotone, then
TA  correctly finds a  -approximation to the top k
answers for t.
Also suggests an interactive version.
Restricting Sorted Access

Sometimes it is not possible to access certain of the lists under sorted access.




Example: Zagat-Review web-site gives ratings of restaurants, NYT-Review web-site
gives prices, and MapQuest web-site gives distances – however, only Zagat-Review
web-site cane be accessed under sorted control.
Let Z be the set of indices i of those lists Li that can be accessed under sorted
access (assume that there is at least one list).
We take m’ to be the cardinality |Z| of Z (and m is still the total number of
sorted lists).
Modification to TA algorithm to deal with this restriction (TAZ):



Do sorted access in parallel to each of the m’ sorted lists Li with i in Z. As an object R
is seen under sorted access in some list, do random access to the other lists to find the
grade xi of object R in every list Li. Then compute the grade t(R) of object R. If this
grade is one of the k highest seen, then remember object R and its grade t(R).
For each list Li, with i in Z, let xi be the grade of the last object seen under sorted
access. For each list Li with i not in Z, let xi = 1. Define the threshold value T to be
t(x1,…,xm). As soon as at least k objects have been seen whose grade is at least equal
to T, then halt.
Let Y be the set containing the k objects that been seen with the highest grades.
Restricting Sorted Access Example





Assume there are only 3 sorted lists L1, L2 and L3, and that only
L1 may be accessed under sorted access (Z={1}).
Let t be the aggregation function where t(x,y,z) = min{x,y} if z=
1, and t(x,y,z) = (min{x,y,z}})/2 if z <> 1.
Assume we want to find the top answer (i.e. k = 1).
Looking at the tables, t(R) = 0.6, and t(R’) <= 0.5 (by the
distinctness property).
Thus, R is the top object.
L1
L2
L3
(R,1)
…
(R, 1)
…
(R,0.6)
…
(., 0.7)
…
…
Restricting Random Access

Sometimes it is not possible to access certain of the lists under random access.




Sometimes it is not impossible, but very expensive (ex. when the costs
correspond to disk access).
For these scenarios, the desired output changes to just returning the top k
objects, without their grades.
Some notions corresponding to lower bounds on the overall grade an object can
attain:


Example: Middleware system is a text retrieval system, and the subsystems are search
engines. There is not a way to ask a major search engine on the web for its internal
score on some document of our choice under a query.
Define WS(R) to be the minimum (or worst) value the aggregation function t can attain
for object R. When t is monotone, the minimum value is obtained by substituting for
each missing field the value 0, and applying t to the result.
Some notions corresponding to upper bounds on the overall grade an object can
attain:



Best value an object can attain depends on other information we have.
Use only the bottom values in each field: xi is the last (smallest) value of known fields
of R, with values xi1,xi2,…,xil for these known fields.
Define BS(R) to be the maximum (or best) value the aggregation function t can attain
for object R. When t is monotone, this maximum value is obtained by substituting for
each missing field the value xi, and applying t to the result.
No Random Access (NRA)


Goal is to obtain enough partial information
about grades to know that an object is in the
top k objects without knowing its exact grade.
Example:





L1
L2
(R,1)
(.,1/3)
(., 1/3)
…
…
(.,1/3)
Aggregation function is average.
(., 1/3)
(R,0)
k = 1 (only top object)
Two sorted lists L1 and L2, and the grade of every object in
both L1 and L2 is 1/3, except that object R has a grade 1 in L1
and grade 0 in L2.
After two sorted accesses to L1 and one sorted access to L2,
there is enough information to know that object R is the top
object (its average grade is at least 1/2, and every other
object has average grade at most 1/3).
If sorted order desired, can easily determine by finding top
object, then top 2 objects, etc.
Combined Algorithm (CA)





Uses random accesses, but takes their cost
(relative to sorted order) into account.
Let cS be the cost of a sorted access, and cR be
the cost of a random access.
Middleware cost of an algorithm that makes s
sorted accesses and r random ones is scS + rcR.
The optimality ratio is a function of the relative
cost of a random access to a sorted access:
cR/cS.
Goal is to find an algorithm that is instance
optimal and where the optimality ratio is
independent of cR/cS.
Combined Algorithm
(continued)



CA is a merge between TA (which is
instance optimal) and NRA.
Let h = cR/cS. Let’s assume that cR >= cS,
so that h >= 1.
The idea of CA is to run NRA, but every h
steps to run a random access phase and
update the information (the upper and
lower bounds B and W) accordingly.