Tushar Kumar J and Ritesh Bagga

Database Searching and Information Retrieval
Presented by:
Tushar Kumar.J
Ritesh Bagga
Background

Motivation
The main motivation behind choosing this topic was our
interest in expanding the knowledge about the database and
also due to the support which it will provide to our research
work.

Focus
Our focus is on the various algorithms employed to retrieve
top few results from the database. This is one of the most
exciting field in database recently.
Introduction to Problem





Most often we query single database.
At times we need to query multiple databases with
heterogeneous data.
Difficult for user to write a single sql-query to work on all
database.
Solution : develop a middleware system to work on top of these
subsystems.
This middleware divides the query into sub queries and run
them on each individual subsystem.
Introduction to Problem
User Query
(Color = “Red”) AND (Shape=“Circle”)
Middleware System
(We will study algorithms which run on this middleware)
Shape = “Circle”
Color = “Red”
“Redness”
R3 (0.70)
“Circle”
R3 (1.00)
R1 (1.00)
R2 (0.50)
R2 (0.00)
R4 (0.40)
R1 (0.10)
R4 (0.00)
Aggregation Function
(MIN)
Result
Framework of this presentation
 Basic algorithms
 Comparative study of basic algorithms
 Modifications of TA algorithm
 Advance algorithms
 Related work
 How web-search engines rank the web pages ?
 Conclusion
Basic algorithms
Fagin’s Algorithm
 The most basic and original algorithm for solving the problem
was developed by Ron Fagin, called as FA algorithm.
 FA algorithm consists of following steps:
 Sorted access in parallel to each of the ‘m’ lists.
 Random access for every new object seen in every other list to find
i th field x I of R.
 Use aggregation function t(R) = t( xI , x 2 …….. xm) for every object
to calculate over all grade and store it in set ‘Y’.
 Define set ‘H’ containing objects seen is all the lists.
 Stopping Point – Set ‘H’ has at least k objects.
 Sort set ‘Y’ and output top k values.
Basic algorithms
Fagin’s Algorithm
Objects
Seen
L1
L2
L3
L4
R1
R8(0.95)
R10(1.00)
R3(0.95)
R5(1.00)
R2
3.05
R2(0.90)
R3(0.95)
R10(0.80)
R7(0.95)
R3
R5(0.85)
R7(0.85)
R4(0.70)
R8(0.90)
R4
3.40
2.55
R5
3.05
R3(0.80)
R8(0.80)
R8(0.65)
R2(0.85)
R6
R7(0.75)
R5(0.75)
R7(0.60)
R4(0.80)
R7
3.15
R9(0.70)
R2(0.75)
R2(0.55)
R3(0.70)
R8
3.30
R4(0.65)
R6(0.60)
R9(0.50)
R1(0.65)
R9
2.05
R1(0.60)
R1(0.50)
R5(0.45)
R9(0.55)
R10
2.65
R10(0.55)
R4(0.40)
R6(0.40)
R6(0.45)
Objects seen in all 4 lists
R6(0.50)
R9(0.30)
R1(0.30)
R10(0.30)
R8
R7
R2
Basic algorithms
Threshold Algorithm
 Similar to FA with slight modification.
 TA algorithm consists of following steps:
 Sorted access in parallel to each of the ‘m’ lists.
 Random access for every new object seen in every other list to find
i th field x I of R.
 Use aggregation function t(R) = t( xI , x 2 …….. xm) for every object
to calculate over all grade and store it in set ‘Y’ only if it belongs to
top k objects.
 Calculate threshold value ‘T’ of aggregate function after every
sorted access.
 Stopping Point – As soon as at least k objects have been seen
whose grade is at least equal to ‘T”.
 Return set ‘Y’ which has top k values.
Basic algorithms
L1
R8(0.95)
Threshold Algorithm
L2
L3
L4
Threshold Value
R10(1.00)
R3(0.95)
R5(1.00)
3.90/4
Top 3
Objects
R3(3.40/4)
R2(0.90)
R3(0.95)
R10(0.80)
R7(0.95)
3.60/4
R5(0.85)
R7(0.85)
R4(0.70)
R8(0.90)
3.30/4
R3(0.80)
R8(0.80)
R8(0.65)
R2(0.85)
3.10/4
R7(0.75)
R5(0.75)
R7(0.60)
R4(0.80)
R5(3.05/4)
R9(0.70)
R2(0.65)
R2(0.55)
R3(0.70)
R2(2.95/4)
R4(0.65)
R6(0.60)
R9(0.50)
R1(0.65)
R10(2.65/4)
R1(0.60)
R1(0.50)
R5(0.45)
R9(0.55)
R10(0.55)
R4(0.40)
R6(0.40)
R6(0.45)
R6(0.50)
R9(0.30)
R1(0.30)
R10(0.30)
R8(3.30/4)
R7(3.15/4)
Basic algorithms
Comparison between TA and FA
 FA is optimal in some cases, but TA is optimal in all the cases.
 TA uses less buffer space, FA requires buffer that grows with the
database size.
 TA may do m-1 random access for every object not in top k set,
but FA does this random access only once for every newly seen
object in sorted access.
Modifications of TA Algorithms
 Approximation Algorithm – to find the top k elements with ‘x’
degree of approximation. Stops earlier then TA.
 Restricting Sorted Access – when sorted access to some lists are
not allowed, e.g. finding best restaurant.
 Restricting Random Access –
 NRA was developed when no random access was allowed, e.g. text
retrieval system.
 CA was developed for situations where random access are allowed
but are very costly. Is combination of TA and NRA, e.g. random
disk access.
Advance algorithms

Suppose we already have several ranked lists of objects, the
problem here is to aggregate these lists to form a single
ranked
list.

The problem can be solved using a median finding algorithm.

Steps involved in the median finding algorithm are
- Find out the rank of each object in each of the ranked lists
- Find the median of the ranks obtained from these lists for
each object.
- Sort the list containing the median ranks for these objects.
- Retrieve the results from this list.
Advance algorithms
 Limitation of the median finding algorithm is large number of
random accesses, which is overcome by the MEDRANK
algorithm.
 MEDRANK algorithm – access the ranked lists, one element of
every list at a time, until some element is seen in more than half
of the lists.
Related work
 In 1996, Chaudhuri and Gravano presented an algorithm which
was built on Fagin’s original FA algorithm.
 In 1997 and 1998, Carey and Kossmann presented techniques
to optimize top-k queries.
 In 1999, Nepal and Ramakrishna presented variations on Fagin’s
TA algorithm for processing queries over multimedia databases.
 In 2000, Guntzer made a remarkable contribution to the Fagin’s
TA algorithm by reducing the number of random accesses.
 In 2002, Chang and Zwang presented an algorithm called as
MPro to optimize the execution of expensive predicates.
How web-search engines rank the web
pages (1)
 Web-search engines rank the web pages based on various
factors.
 Some of the most commonly found web-search engines are
Frequency of occurrence and location are the primary factors.
Two most important web-search engines –
Google and AltaVista
How web-search engines rank the web
pages (2)

AltaVista
- Maintains a huge phrase dictionary.
- basic intuition behind the ranking of web pages is
as follows
 It first displays all the pages containing the phrase
- Then it displays all the pages in which the words are
closer to each other.
- Followed by displaying all pages containing all the terms,
displaying pages containing any of the terms
- Another important factor is the popularity of search being
performed.
How web-search engines rank the web
pages (3)
 Google
- Uses a very different technology called as page-rank
technology.
 Page rank technology
- Measures the importance of a web page by solving an
equation.
- Interprets a link as a vote.
- Assesses a page’s importance by the no. of votes it receives.
- Important pages receives a higher rank and appears at the
top of the search results.
Conclusion
 The literature studied signifies that much work is done to solve
the problem of retrieving top-k results from the database.
 We came across many algorithms which are very tricky to
understand.
 The research in this field is still very active.
 Now the focus is on devising a more sophisticated algorithm for
aggregating the ranked lists.
References
[1] Ronald Fagin, “Combining Fuzzy Information from Multiple Systems” received July 4, 1996;
revised June 22, 1998
[2] Ronald Fagin, “Combining Fuzzy Information: an Overview “, Appeared in ACM SIGMOD
Record 31, 2, June 2002, pages 109-118
[3] Ronald Fagin, Amnon Lotem and Moni Naor. “Optimal aggregation algorithms for
middleware” Computer and System Sciences 66 (2003), pp. 614-656. Extended abstract
appeared in Proc. 2001 ACM Symposium on Principles of Database Systems (PODS '01),
pp. 102-113.
[4] Ronald Fagin, Ravi Kumar and D. Sivakumar. “Efficient similarity search and classification via
rank Aggregation” Proc. 2003 ACM SIGMOD Conference (SIGMOD '03), pp. 301-312.
[5] Ronald Fagin, Ravi Kumar, Mohammad Mahdian, D. Sivakumar, and Erik Vee. “Comparing
and Aggregating Rankings with Ties” Proc. 2004 ACM Symposium on Principles of
Database Systems (PODS '04), pp. 47-58.
[6] Ronald Fagin, Ravi Kumar, and D. SivaKumar. “COMPARING TOP k LISTS” SIAM J. Discrete
Mathematics 17, 1 (2003), pp. 134-160. Extended abstract in 2003 ACM-SIAM
Symposium on Discrete Algorithms (SODA '03), pp. 28-36.
[7] A. Marian, N. Bruno, and L. Gravano. “Evaluating Top- k Queries over Web-Accessible
Databases” Accepted for publication in ACM Transactions on Database Systems, 2003.
[8] Martin P. Courtois and Michael W.Berry, “Results Ranking in Web Search Engines” online may
1999.
Thank you!
Any Questions?