Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University 1 Extracting Structured Information “Buried” in Text Documents May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Information Extraction System (e.g., NYU’s Proteus) 2 Date DiseaseName Location Jan. 1995 Malaria Ethiopia July 1995 Mad Cow Disease The U.K. Feb. 1995 Pneumonia The U.S. May 1995 Ebola Zaire Extracting All “Tuples” of a Relation from a Text Database Information Extraction System Extracted Tuples 3 Naïve approach: feed every document to information extraction system. At 7 secs./document, Proteus takes over 8 days for 100K documents Only a tinySearch fraction of documents contains engines can help:tuples Processing every document is inefficient efficiency and accessibility Many databases are not crawlable (scannable), but available only via a search engine. A Query-Based Strategy for Information Extraction [Agichtein and Gravano, ICDE 2003] 0 Start with some seed tuples (e.g., <“May 1995”, “Ebola”, “Zaire”>) 1 While seed has unprocessed tuple t 4 seed 2 Retrieve up to MaxResults documents using query derived from t t0 3 Extract new tuples te from these documents t1 4 Augment seed with te t2 Potential problem: May run out of tuples (and queries) incomplete relation! Iterative Methods Sometimes (but not Always) “Succeed” seed seed SUCCESS! 5 FAIL Can we predict if a query-based strategy will succeed? Model: Querying Graph Tokens t1 d1 t2 d2 Each Token (as query) retrieves documents t3 d3 Documents contain tokens t4 d4 t5 d5 Tokens: Tuple attributes <“May 1995”, “Ebola”, “Zaire”> 6 Documents Model: Reachability Graph Tokens t1 7 Documents d1 t2 d2 t3 d3 t4 d4 t5 t1 t2 t3 t5 t4 t1 retrieves document d1 t2, t3, and t4 “reachable” from t that contains1t2 d5 Model: Connected Components t1 t2 t3 Core In (strongly Out connected) t4 8 Tokens not in Core but from which Core is reachable Tokens not in Core, but are reachable from Core Components of Reachability Graph t0 In In Core Out Core (strongly Out connected) How many tokens are in the largest Core + Out? In 9 Core Out Model: Power-law Graphs Conjecture: Degree distribution in the reachability graph follows power-law: #(nodes with degree k) ≈ O(k-β) (i.e., many nodes with small degree, a few nodes with large degree) 10 Power-law random graphs are expected to have at most one giant connected component (~Core+In+Out). Other connected components are small. Model: Reachability Core t0 In (strongly Out connected) Reachability : 11 Fraction of tokens in the largest Core + Out (Power law allows to ignore small components) Estimating Reachability In a power-law random graph G a giant component CG emerges if the average outdegree d > 1 Relative size of giant component (lower bound) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 average outdegree Graph theory results predict relative size of CG [Chung and Lu, Annals of Combinatorics, 2002 ] 12 Estimate reachability as relative size of CG, which reduces to estimating average outdegree of reachability graph 8 9 10 Estimating Reachability Using Sampling (estimate average outdegree) 1. 2. 3. 4. 5. Choose S random seed tokens Query the database for seed Extract tokens to compute the reachability graph edges for seed tokens. Estimate d as average outdegree of seed tokens. Estimate reachability t2 13 t2 t4 d =1.5 Tokens Documents t1 d1 t2 d2 t3 d3 t4 d4 t5 d5 Experimental Results: Verifying the “Power-law” Conjecture Task 1: NYT DiseaseOutbreaks (Date, Disease, Location) New York Times, 1995 |T|= 8,859 |D|=137,000 14 Date Disease Location Jan. 1995 Malaria Ethiopia June 1995 Ebola Zaire July 1995 Mad Cow Disease The U.K. Feb. 1995 Pneumonia The U.S. … … … Follows the power-law distribution Experimental Results: Estimating Reachability by Sampling S=10 Approximate reachability is estimated with S = 50 tokens S=100 S=200 Real Graph 1 0.9 0.8 Reachability S=50 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 MR=1 15 MR=10 The reachability correctly predicts performance of query-based information extraction strategy If the estimated reachability is too low, can switch to a different strategy early MR=50 MR=100 MaxResults MR=200 MR=1000 Future Work 16 Tokens What if we have only limited access to the database? t1 Limit on number of queries Limit on number of t2 documents retrieved t3 Not modelled by reachability graph, but can be modelled using properties of querying t4 graph t5 Documents d1 d2 d3 d4 d5 Summary Presented graph model for query-based algorithms: – – 17 for Information Extraction for Constructing Database Content Summaries Showed that querying and reachability graphs can be used to analyze such algorithms Presented single reachability metric to predict success of iterative query-based algorithms Presented and verified conjecture that reachability graphs for these algorithms follow the power law Presented efficient techniques for estimating reachability by exploiting properties of power-law random graphs
© Copyright 2026 Paperzz