Methods for DomainIndependant Information Extraction from the Web An Experimental Comparison [Etzioni et al., 2004] Outline • • • • • • • • Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion Methods for Domain-Independant Information Extraction from the Web. Outline • • • • • • • • Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion Methods for Domain-Independant Information Extraction from the Web. Introduction • Information extraction from the web (~web mining) • A good prerequisite for this talk: Information granularity 1 information 1 locations. 10 information 100 locations 1,000 infos 100,000 locations (job posting) (HP digital camera) (cities of the world) fine Methods for Domain-Independant Information Extraction from the Web. coarse Outline • • • • • • • • Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion Methods for Domain-Independant Information Extraction from the Web. Outline • • • • • • • • Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion Methods for Domain-Independant Information Extraction from the Web. Paper’ structure • • • • • Presentation of an existing WebMining system Author’ intuition of a « Recall problem » Proposition of three possible improvements Definition of a metric for the « quantification of success » Evaluation of proposed improvements Methods for Domain-Independant Information Extraction from the Web. Outline • • • • • • • • Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion Methods for Domain-Independant Information Extraction from the Web. Outline • • • • • • • • Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion Methods for Domain-Independant Information Extraction from the Web. KnowItAll System • Autonomous, domain-independant system that extract facts, concepts, and relationships from the Web. 1 Focus (e.g.: city) 2 Patterns instanciation: NP1 such as NP2 = « city such as » Plural(NP1) such as NP2-List = « cities such as » 3 Search + passage retrieval: … a city such as Sudbury, at north of the Great Lakes… …cities such as Chicago, New York, Atlanta and Orlando … 4 Assessor: PMI-IR Hits(Atlanta AND city) / Hits (Atlanta) Methods for Domain-Independant Information Extraction from the Web. Outline • • • • • • • • Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion Methods for Domain-Independant Information Extraction from the Web. Outline • • • • • • • • Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion Methods for Domain-Independant Information Extraction from the Web. Rule Learning (RL) • Goal: increase the recall of KnowItAll Patterns Facts (with likelihood) “city, such as Boston” “mega-city such as Mexico” “within a city, such as Rice University” PMI(Boston, city) = 0,60 PMI(Mexico, city) = 0,56 PMI(Rice University, city) = 0,24 Methods for Domain-Independant Information Extraction from the Web. Rule Learning (RL) Facts (most probable) of Boston College the Boston Globe a Boston Parking Space headhquartered in Boston Crime in Mexico continues Mexico City Hotels headhquartered in Mexico New patterns Headhquartered in NP Methods for Domain-Independant Information Extraction from the Web. Rule Learning (RL) • • • Estimating rule quality Heuristic 1: remove all substring that appear in a single seed. Heuristic 2: rule precision = c k cnm – c is the number of time the rule match a seed – n is the number of time the rule match a known negative example – k / m is the prior estimate of the rule (PMI tests) Methods for Domain-Independant Information Extraction from the Web. Outline • • • • • • • • Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion Methods for Domain-Independant Information Extraction from the Web. Outline • • • • • • • • Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion Methods for Domain-Independant Information Extraction from the Web. Subclass Extraction (SE) • Goal: increase the recall of KnowItAll Focus: scientist Pattern: « scientist such as NP» … scientist such as Arthur Noyes … scientist such as Isaac Newton … scientist such as Sandra Steingraber Methods for Domain-Independant Information Extraction from the Web. Subclass Extraction (SE) • Using found facts, apply the reverse pattern: «N such as Arthur Noyes » « chemist such as Arthur Noyes » « biologist such as Sandra Steingraber » • Assess subclasses by PMI trick and morphology test (« ist ») Methods for Domain-Independant Information Extraction from the Web. Outline • • • • • • • • Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion Methods for Domain-Independant Information Extraction from the Web. Outline • • • • • • • • Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion Methods for Domain-Independant Information Extraction from the Web. List Extraction (LE) • Goal: increase the recall of KnowItAll Find web pages with set (k=4) of random facts. « chicago AND boston AND mexico AND buenos aires » repeat 5,000-10,000 times In each document, try to find « a list » Methods for Domain-Independant Information Extraction from the Web. List Extraction (LE) Use a web page « wrapper » i.e. a classifier that identify positive nodes (element of the list) and negative nodes (all the remaining html markup) Methods for Domain-Independant Information Extraction from the Web. List Extraction (LE) • Quality of new fact == number of list in which it appears! • PMI can also be use to assess the quality (LE+A) Methods for Domain-Independant Information Extraction from the Web. Outline • • • • • • • • Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion Methods for Domain-Independant Information Extraction from the Web. Outline • • • • • • • • Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion Methods for Domain-Independant Information Extraction from the Web. Experiments • How to calculate the recall improvement? • Cannot calculate the true recall (unknown) • Can use the size of the set of facts • But how to make sure the set is pure? – Sort facts by probability – Use only high-quality facts (e.g.: prob > 0.9) – Manually assert a sample Methods for Domain-Independant Information Extraction from the Web. Experiments Methods for Domain-Independant Information Extraction from the Web. Experiments Methods for Domain-Independant Information Extraction from the Web. Outline • • • • • • • • Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion Methods for Domain-Independant Information Extraction from the Web. Outline • • • • • • • • Introduction Paper structure KnowItAll System Rule Learning (RL) Subclass Extraction (SE) List Extraction (LE) Experiments Conclusion Methods for Domain-Independant Information Extraction from the Web. Conclusion • KnowItAll is an Information extraction system (coarse IE) • The only input is a 1-word « focus » (city, scientist, movie, …) • Pattern instanciation, passage retrieval, PMI-IR test • RL, SE and LE improve extraction recall • Overall LE gives the greatest improvement • SE was notably good on the « scientist » task Methods for Domain-Independant Information Extraction from the Web. Conclusion http://knowitall-1.cs.washington.edu/dbinterface/knowitall2/default.asp Methods for Domain-Independant Information Extraction from the Web.
© Copyright 2026 Paperzz