Presented by Avinash S Bharadwaj How can data be extracted from the web? Execution plans for text-centric tasks follow two general paradigms for processing a text database: The entire web can be crawled or scanned for the text automatically Search engine indexes can be used to retrieve the documents of interest using carefully constructed queries depending on the task. Introduction Text is ubiquitous and many applications rely on the text present in web pages for performing a variety of tasks. Examples of text centric tasks Reputation management systems download web pages to track the buzz around the companies. Comparative shopping agents locate e-commerce web sites and add the products offered in the pages to their own index. Examples of text centric tasks According to the paper there are mainly three types of text centric tasks Task 1: Information Extraction Task 2: Content Summary Construction Task 3: Focused Resource Discovery Task 1: Information Extraction Information extraction applications extract structured relations from unstructured text May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis… Disease Outbreaks in The New York Times Information Extraction System (e.g., NYU’s Proteus) Date Disease Name Location Jan. 1995 Malaria Ethiopia July 1995 Mad Cow Disease U.K. Feb. 1995 Pneumonia U.S. May 1995 Ebola Zaire Information Extraction tutorial yesterday by AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan 5 Task 2: Content Summary construction Many text databases have valuable contents “hidden” behind search interfaces. Metasearchers are used to search over multiple databases using a unified query interface. Generation of content summary. Task 3: Focused Resource Discovery This task considers building applications based on a particular resource. Simplest approach is to crawl the entire web and classify the web pages accordingly Much more efficient approach is to use a focused crawler. The focused crawlers depend documents and hyperlinks that are on-topic, or likely to lead to on-topic documents, as determined by a number of heuristics. An Abstract View of Text-Centric Tasks Output Tokens Text Database … Extraction System 1. Retrieve documents from database 2. Process documents 3. Extract output tokens Task Token Information Extraction Relation Tuple Database Selection Word (+Frequency) Focused Crawling Web Page about a Topic 8 Execution Strategies The paper describes four execution strategies. Scan Crawl Based Filtered Scan Iterative Set Expansion Query or Index Based Automatic Query Generation Execution Strategy: Scan Scan methodology processes each document in a database exhaustively until the number of tokens extracted satisfies the target recall. The Scan execution strategy does not need training and does not send any queries to the database. Execution time = |Retrieved Docs| · (R + P) Time for retrieving a document Time for processing a document Prioritizing the documents based may help in improving efficiency. Execution Strategy: Filtered Scan Filtered scan is an improvement over the basic scan methodology. Unlike scan filtered scan uses a classifier for a specific task to check whether the article contributes at least one token before parsing the article. Execution time = |Retrieved Docs| · (R + P + C) Time for retrieving a document Time for processing a document Time for classifying a document Execution Strategy: Iterative Set Expansion Output Tokens Text Database Extraction … System 1. Query database with seed tokens (e.g., [Ebola AND Zaire]) 2. Process retrieved documents Query Generation 3. Extract tokens from docs (e.g., <Malaria, Ethiopia>) 4. Augment seed tokens with new tokens Execution Strategy: Iterative Set Expansion contd… Iterative Set Expansion has been successfully applied in many tasks. Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q Time for answering a query Time for retrieving a document Time for processing a document Execution Strategy: Automatic Query Generation Iterative Set Expansion has recall limitation due to iterative nature of query generation Automatic Query Generation avoids this problem by creating queries offline (using machine learning), which are designed to return documents with tokens. Automatic Query Generation works in two stages: In the first stage, Automatic Query Generation trains a classifier to categorize documents as useful or not for the task In the execution stage, Automatic Query Generation searches a database using queries that are expected to retrieve useful documents. Estimating Execution plan costs: Scan <SARS, China> Modeling Scan for Token t: What is the probability of seeing t (with frequency g(t)) after retrieving S documents? A “sampling without replacement” process After retrieving S documents, frequency of token t follows hypergeometric distribution Recall for token t is the probability that frequency of t in S docs > 0 Token t d1 d2 ... dS ... dN Probability of seeing token t after retrieving S documents g(t) = frequency of token t D Sampling for t Estimating Execution plan costs: Iterative Set Expansion Tokens Documents The querying graph is a bipartite graph, containing tokens and documents Each token (transformed to a keyword query) retrieves documents Documents contain tokens t1 d1 <SARS, China> t2 d2 t3 d3 <Ebola, Zaire> <Malaria, Ethiopia> t4 d4 t5 d5 <Cholera, Sudan> <H5N1, Vietnam> Estimating execution plan costs: Iterative Set Expansion contd… We need to compute the: Number of documents retrieved after sending Q tokens as queries (estimates time) Number of tokens that appear in the retrieved documents (estimates recall) Tokens t1 Documents d1 <SARS, China> t2 d2 <Ebola, Zaire> To estimate these we need to compute the: t3 Degree distribution of the tokens discovered <Malaria, Ethiopia> by retrieving documents Degree distribution of the documents t4 retrieved by the tokens <Cholera, Sudan> (Not the same as the degree distribution of a randomly chosen token or document – it is easier to discover documents and tokens with high degrees) t5 d3 d4 d5 <H5N1, Vietnam> 17 Experimental Results 100,000 Execution Time (secs) 10,000 Scan 1,000 Filt. Scan Automatic Query Gen. Iterative Set Expansion 100 0.00 0.10 0.20 0.30 0.40 0.50 Recall 0.60 0.70 0.80 0.90 1.00 100,000 Execution Time (secs) 10,000 Scan Filt. Scan 1,000 Iterative Set Expansion Automatic Query Gen. OPTIMIZED 100 0.00 0.10 0.20 0.30 0.40 0.50 Recall 0.60 0.70 0.80 0.90 1.00 Experimental Results contd… Conclusions Common execution plans for multiple text-centric tasks Analytic models for predicting execution time and recall of various crawl- and query-based plans Techniques for on-the-fly parameter estimation Optimization framework picks on-the-fly the fastest plan for target recall 20
© Copyright 2026 Paperzz