IO-Effcient Faceted Search - Max Planck Institute for Informatics

Fast and Intelligent Search
In Very Large Amounts of Data
Kick-off meeting for Cluster of Excellence
Multimodal Computing and Interaction
November 13th, 2008
Hannah Bast
Max-Planck-Institute for Informatics
Saarbrücken
General theme of my group
Searching for information
Fancy and Fast, On Lots of Data
Terabytes of data, hundreds of millions of documents
Query times in a fraction of a second
Beyond Google-style keyword search
+ always open for other real-world algorithmic problems
currently: route planning in large transportation networks
Searching for Information

Problems we have recently worked on
– efficient prefix search
joint work with
the database people
– efficient faceted search
– efficient error-tolerant search
– efficient semantic search
– efficient snippet generation
– efficient index construction
– efficient 3D shape retrieval

planned joint work
with the CL people
joint work with
the graphics people
planned: efficient
music retrieval
Our system: the CompleteSearch engine
– efficient 
– does all of the above (not the shapes though)
There is a demo this afternoon at 2.30 pm
Recent Output

Installations
– CompleteSearch DBLP (several million hits / month)
– www.absolventa.de uses CompleteSearch (job search)
– many more: mailing list archives, library search, …

Publications
– Conferences: SIGIR, VLDB, CIKM, CIDR, SPIRE, …
– Journals: IR, TWEB, TOIS, VLDB Journal, …

Awards
– Jan’08: Meyer-Struckmann Award
15,000 €
– Oct’08: Alcatel-Lucent Award
20,000 €
– big press coverage (e.g, it was on the Heise newsticker)
Faceted Search

Problem
– Data: objects with ids and labels
– Query: set of object ids
– Answer: multi-set of labels of the respective objects
– This talk: exactly one label per object
year:2001
year:1997
year:2003
year:2001
year:2008
1
2
3
4
5
Query: I = {1, 3, 4}  Answer: {year:2001, year:2003, year:2001}
Faceted Search

Problem
– Data: objects with ids and labels
– Query: set of object ids
– Answer: multi-set of labels of the respective objects
– This talk: exactly one label per object
a1

a2
a3
a4
a5
Trivial if labels are in an array in main memory
Query: I = {1, 3, 4}  Answer: {a1, a3, a4}
– but if data is on disk, we have block access to the data
– each read gives us a whole block of B labels
typical: B=10,000
– we have to minimize the number of reads / IO operations
IO-efficient Faceted Search

a1 a2 a3 a4 a5 a6 a7 a8
Precomputation:
a4 a7 a5 a3 a1 a8 a2 a6
– given n elements a1,…,an
– organize in array of size N ≥ n

a3 a6 a4 a2 a7 a1 a8 a5
n = 8, N = 24
Query:
– given I = {i1,…, im} с {1,…,n}
– return elements ai1,…, aim
using as few IOs as possible

I = {1, 6, 8}, B = 4
get a1, a6, a8 with 1 IO
Extreme solutions:
– space: n
??? / B, |I|}
#IOs: min{n
– space: B???
∙ (n choose B) #IOs: |I| / B
(optimal space)
(optimal #IOs)
How much space is needed for which IO-efficiency?
A simple lower bound

Theorem:
– if we want < |I| IOs for every query I
– we need ≥ n2 / (4∙B) space

Proof:
1. construct graph G with n vertices
edge {i, j} iff ai and aj can be read in one IO
 m ≤ 2B ∙ N
a 1 a2 a3 a4
a 1 a4 a2 a3
n = 4, N = 8
a1
a2
a3
a4
2. by assumption, every I = {i, j} can be
read with 1 IO, hence edge {i, j} exists
 m ≥ (n choose 2) ≈ n2 / 2
The short queries alone make the problem hard
B=2
Restrict to large queries

Theorem:
– if we want < |I| IOs for all queries with |I| ≥ M
– we need ≥ n2 / (4∙B∙M) space

Proof sketch:
1. construct graph G as before
 m ≤ 2B ∙ N
2. Consider arbitrary I with |I| ≥ M
a1 a2 a3 a4
a1 a4 a2 a3
n = 4, N = 8
a1
a2
a3
a4
B=2
 I not independent in G (otherwise |I| IOs necessary)
 no independent set larger than M
3. Turan’s theorem implies m ≥ (n choose 2) / M
so there is hope for queries of size linear in n
and we indeed have a space-efficient algorithm for that case
(but no time to explain it here, sorry)
Turán numbers (extremal set theory)

Definition: for n ≥ k ≥ r
T(n, k, r) = the minimal number of r-subsets of {1,…n}
such that every k-subset of {1,…,n} contains one of the
r-subsets
For r = 2: minimal number of edges in an n-vertex
graph, where all independent sets have size < k

Turan’s theorem:
– lim n  ∞ T(n, k, r) / (n choose r) exists
– exact value of limit unknown for k ≥ 2

Lower bound
– T(n, k, r) ≥ (r / k)r-1 ∙ (n ch. r)
Very natural application in
the context of faceted search!
Paul (Pál) Turán
*1910 in Budapest
†1976 in Budapest
Erdös number 1
Route Planning

Route planning in road networks
– from a single source to a single target (point-to-point)
– weighted graph, edge costs = travel times
Transit Node Routing

We invented transit node routing
– 100 times faster than previous best scheme
– Oct’08 SaarLB Award
25.000 €
(together with Stefan Funke, now University of Greifswald)
– integration with previous best scheme published in Science
(joint work with P. Sanders and D. Schultes, Uni Karlsruhe)
– big press coverage
– we are currently trying to market the idea
(via Algorithmic Solutions, a spin-off from MPII D1)
There is a demo this afternoon at 2.00 pm
Google Transit

I am currently @ Google in Zürich
– as “visiting scientist”
– great experience; I can highly recommend it
– one of my projects there is Google Transit
– public transportation networks are completely different
from road networks
they can both be modeled as graphs
and that’s about it with the similarity
– the scale is an even bigger challenge there
one node per arrival / departure event
– will publish what I have done at the end of the year
Thank you!
Vorberechnung der Transitknoten

Von Distanzen zu Pfaden
2
Start
3
2
24 min
20 min
23 min
Ziel
Overview

How I work

Information retrieval
– overview of problems & results
– our CompleteSearch engine
– recent result: faceted search

Route planning
– ultrafast routing in road networks
– public transportation routing @ Google
Recent Output

Installations
– CompleteSearch DBLP (several million hits / month)
– www.absolventa.de uses CompleteSearch (job search)
– many more: mailing list archives, library search, …

Publications
– Conferences: SIGIR, VLDB, CIKM, CIDR, SPIRE, …
– Journals: IR, TWEB, TOIS, VLDB Journal, …

Awards
– Jan’08: Meyer-Struckmann Award
15,000 €
– Oct’08: Alcatel-Lucent Award
20,000 €
– Jul’09 : ......
25,000 €
How I work

I grew up in theoretical computer science
– well-defined, standard problems
– the goal are theorems
– the more difficult / original, the better
– often art for arts sake
– good to learn the art of clear & precise thinking

Then I moved to more applied problems
necessity is the mother
– work starts with a real problem
of all inventions
– finding the right abstraction is half of the challenge
– think about it, but keep in mind the real problem
– implement + experiment
– build a system and use it / let it be used

Download Report

IO-Effcient Faceted Search - Max Planck Institute for Informatics

Paperzz.com

Your Paperzz