Modeling Query-Based Access to Text Databases

Modeling Query-Based Access to
Text Databases
Eugene Agichtein
Panagiotis Ipeirotis
Luis Gravano
Computer Science Department
Columbia University
1
Extracting Structured Information
“Buried” in Text Documents
May 19 1995, Atlanta -- The Centers for Disease Control
and Prevention, which is in the front line of the world's
response to the deadly Ebola epidemic in Zaire ,
is finding itself hard pressed to cope with the crisis…
Information Extraction
System
(e.g., NYU’s Proteus)
2
Date
DiseaseName
Location
Jan. 1995
Malaria
Ethiopia
July 1995
Mad Cow Disease The U.K.
Feb. 1995
Pneumonia
The U.S.
May 1995
Ebola
Zaire
Extracting All “Tuples” of a Relation
from a Text Database
Information
Extraction
System
Extracted Tuples



3
Naïve approach: feed every document to information
extraction system. At 7 secs./document, Proteus takes
over 8 days for 100K documents
Only a tinySearch
fraction of
documents
contains
engines
can
help:tuples 
Processing every document is inefficient
efficiency
and
accessibility
Many databases are not crawlable (scannable), but
available only via a search engine.
A Query-Based Strategy for
Information Extraction
[Agichtein and Gravano, ICDE 2003]
0 Start with some seed tuples (e.g., <“May 1995”, “Ebola”, “Zaire”>)
1 While seed has unprocessed tuple t
4
seed
2
Retrieve up to MaxResults documents
using query derived from t
t0
3
Extract new tuples te from
these documents
t1
4
Augment seed with te
t2
Potential problem: May run out of tuples (and
queries)  incomplete relation!
Iterative Methods Sometimes
(but not Always) “Succeed”
seed
seed
SUCCESS!
5
FAIL 
Can we predict if a query-based strategy
will succeed?
Model: Querying Graph
Tokens

t1
d1
t2
d2
Each Token (as query)
retrieves documents
t3
d3
Documents contain tokens
t4
d4
t5
d5
Tokens: Tuple attributes
<“May 1995”, “Ebola”, “Zaire”>


6
Documents
Model: Reachability Graph
Tokens
t1
7
Documents
d1
t2
d2
t3
d3
t4
d4
t5
t1
t2
t3
t5
t4
t1 retrieves document d1
t2, t3, and t4 “reachable”
from t
that contains1t2
d5
Model: Connected Components
t1
t2
t3
Core
In
(strongly
Out
connected)
t4
8
Tokens not in Core but
from which Core is
reachable
Tokens not in Core,
but are reachable
from Core
Components of Reachability Graph
t0
In
In
Core
Out
Core
(strongly
Out
connected)
How many tokens are in the largest Core + Out?
In
9
Core
Out
Model: Power-law Graphs

Conjecture: Degree distribution in the reachability
graph follows power-law:
#(nodes with degree k) ≈ O(k-β)
(i.e., many nodes with small degree, a few nodes with large degree)

10
Power-law random graphs are expected to have at most
one giant connected component (~Core+In+Out).
Other connected components are small.
Model: Reachability
Core
t0
In
(strongly
Out
connected)
Reachability :
11
Fraction of tokens in the largest Core + Out
(Power law allows to ignore small components)
Estimating
Reachability

In a power-law random
graph G a giant component
CG emerges if the average
outdegree d > 1
Relative size of giant component (lower bound)
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0

1
2
3
4
5
6
7
average outdegree
Graph theory results predict relative size of CG
[Chung and Lu, Annals of Combinatorics, 2002 ]
12
Estimate reachability as relative size of CG, which
reduces to estimating average outdegree of
reachability graph
8
9
10
Estimating Reachability Using Sampling
(estimate average outdegree)
1.
2.
3.
4.
5.
Choose S random seed tokens
Query the database for seed
Extract tokens to compute the
reachability graph edges for
seed tokens.
Estimate d as average
outdegree of seed tokens.
Estimate reachability
t2
13
t2
t4
d =1.5
Tokens
Documents
t1
d1
t2
d2
t3
d3
t4
d4
t5
d5
Experimental Results:
Verifying the “Power-law” Conjecture
Task 1: NYT
DiseaseOutbreaks
(Date, Disease, Location)
New York Times, 1995
|T|= 8,859 |D|=137,000
14
Date
Disease
Location
Jan. 1995
Malaria
Ethiopia
June 1995 Ebola
Zaire
July 1995
Mad Cow
Disease
The U.K.
Feb. 1995
Pneumonia
The U.S.
…
…
…
Follows the power-law distribution
Experimental Results:
Estimating Reachability by Sampling
S=10
Approximate reachability is
estimated with S = 50 tokens
S=100
S=200
Real Graph
1
0.9
0.8
Reachability

S=50
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
MR=1
15
MR=10

The reachability correctly
predicts performance of query-based
information extraction strategy

If the estimated reachability is too low,
can switch to a different strategy early
MR=50
MR=100
MaxResults
MR=200
MR=1000
Future Work


16
Tokens
What if we have only limited
access to the database?
t1
 Limit on number of queries
 Limit on number of
t2
documents retrieved
t3
Not modelled by reachability
graph, but can be modelled
using properties of querying t4
graph
t5
Documents
d1
d2
d3
d4
d5
Summary

Presented graph model for query-based algorithms:
–
–
17
for Information Extraction
for Constructing Database Content Summaries

Showed that querying and reachability graphs can be used to analyze
such algorithms

Presented single reachability metric to predict success of iterative
query-based algorithms

Presented and verified conjecture that reachability graphs for these algorithms
follow the power law

Presented efficient techniques for estimating reachability by exploiting
properties of power-law random graphs