Nadeau-etzioni

Methods for DomainIndependant Information
Extraction from the Web
An Experimental Comparison
[Etzioni et al., 2004]
Outline
•
•
•
•
•
•
•
•
Introduction
Paper structure
KnowItAll System
Rule Learning (RL)
Subclass Extraction (SE)
List Extraction (LE)
Experiments
Conclusion
Methods for Domain-Independant Information Extraction from the Web.
Outline
•
•
•
•
•
•
•
•
Introduction
Paper structure
KnowItAll System
Rule Learning (RL)
Subclass Extraction (SE)
List Extraction (LE)
Experiments
Conclusion
Methods for Domain-Independant Information Extraction from the Web.
Introduction
• Information extraction from the web (~web mining)
• A good prerequisite for this talk:
Information granularity
1 information
1 locations.
10 information
100 locations
1,000 infos
100,000 locations
(job posting)
(HP digital
camera)
(cities of the
world)
fine
Methods for Domain-Independant Information Extraction from the Web.
coarse
Outline
•
•
•
•
•
•
•
•
Introduction
Paper structure
KnowItAll System
Rule Learning (RL)
Subclass Extraction (SE)
List Extraction (LE)
Experiments
Conclusion
Methods for Domain-Independant Information Extraction from the Web.
Outline
•
•
•
•
•
•
•
•
Introduction
Paper structure
KnowItAll System
Rule Learning (RL)
Subclass Extraction (SE)
List Extraction (LE)
Experiments
Conclusion
Methods for Domain-Independant Information Extraction from the Web.
Paper’ structure
•
•
•
•
•
Presentation of an existing WebMining system
Author’ intuition of a « Recall problem »
Proposition of three possible improvements
Definition of a metric for the « quantification of success »
Evaluation of proposed improvements
Methods for Domain-Independant Information Extraction from the Web.
Outline
•
•
•
•
•
•
•
•
Introduction
Paper structure
KnowItAll System
Rule Learning (RL)
Subclass Extraction (SE)
List Extraction (LE)
Experiments
Conclusion
Methods for Domain-Independant Information Extraction from the Web.
Outline
•
•
•
•
•
•
•
•
Introduction
Paper structure
KnowItAll System
Rule Learning (RL)
Subclass Extraction (SE)
List Extraction (LE)
Experiments
Conclusion
Methods for Domain-Independant Information Extraction from the Web.
KnowItAll System
• Autonomous, domain-independant system that extract facts,
concepts, and relationships from the Web.
1 Focus (e.g.: city)
2 Patterns instanciation:
NP1 such as NP2 = « city such as »
Plural(NP1) such as NP2-List = « cities such as »
3 Search + passage retrieval:
… a city such as Sudbury, at north of the Great Lakes…
…cities such as Chicago, New York, Atlanta and Orlando …
4 Assessor: PMI-IR  Hits(Atlanta AND city) / Hits (Atlanta)
Methods for Domain-Independant Information Extraction from the Web.
Outline
•
•
•
•
•
•
•
•
Introduction
Paper structure
KnowItAll System
Rule Learning (RL)
Subclass Extraction (SE)
List Extraction (LE)
Experiments
Conclusion
Methods for Domain-Independant Information Extraction from the Web.
Outline
•
•
•
•
•
•
•
•
Introduction
Paper structure
KnowItAll System
Rule Learning (RL)
Subclass Extraction (SE)
List Extraction (LE)
Experiments
Conclusion
Methods for Domain-Independant Information Extraction from the Web.
Rule Learning (RL)
• Goal: increase the recall of KnowItAll
Patterns
Facts
(with likelihood)
“city, such as Boston”
“mega-city such as Mexico”
“within a city, such as Rice University”
PMI(Boston, city) = 0,60
PMI(Mexico, city) = 0,56
PMI(Rice University, city) = 0,24
Methods for Domain-Independant Information Extraction from the Web.
Rule Learning (RL)
Facts
(most probable)
of Boston College
the Boston Globe
a Boston Parking Space
headhquartered in Boston
Crime in Mexico continues
Mexico City Hotels
headhquartered in Mexico
New
patterns
Headhquartered in NP
Methods for Domain-Independant Information Extraction from the Web.
Rule Learning (RL)
•
•
•
Estimating rule quality
Heuristic 1: remove all substring that appear in a single seed.
Heuristic 2: rule precision = c  k
cnm
– c is the number of time the rule match a seed
– n is the number of time the rule match a known negative example
– k / m is the prior estimate of the rule (PMI tests)
Methods for Domain-Independant Information Extraction from the Web.
Outline
•
•
•
•
•
•
•
•
Introduction
Paper structure
KnowItAll System
Rule Learning (RL)
Subclass Extraction (SE)
List Extraction (LE)
Experiments
Conclusion
Methods for Domain-Independant Information Extraction from the Web.
Outline
•
•
•
•
•
•
•
•
Introduction
Paper structure
KnowItAll System
Rule Learning (RL)
Subclass Extraction (SE)
List Extraction (LE)
Experiments
Conclusion
Methods for Domain-Independant Information Extraction from the Web.
Subclass Extraction (SE)
• Goal: increase the recall of KnowItAll
Focus: scientist
Pattern: « scientist such as NP»
… scientist such as Arthur Noyes
… scientist such as Isaac Newton
… scientist such as Sandra Steingraber
Methods for Domain-Independant Information Extraction from the Web.
Subclass Extraction (SE)
• Using found facts, apply the reverse pattern:
«N such as Arthur Noyes »
« chemist such as Arthur Noyes »
« biologist such as Sandra Steingraber »
• Assess subclasses by PMI trick and morphology test (« ist »)
Methods for Domain-Independant Information Extraction from the Web.
Outline
•
•
•
•
•
•
•
•
Introduction
Paper structure
KnowItAll System
Rule Learning (RL)
Subclass Extraction (SE)
List Extraction (LE)
Experiments
Conclusion
Methods for Domain-Independant Information Extraction from the Web.
Outline
•
•
•
•
•
•
•
•
Introduction
Paper structure
KnowItAll System
Rule Learning (RL)
Subclass Extraction (SE)
List Extraction (LE)
Experiments
Conclusion
Methods for Domain-Independant Information Extraction from the Web.
List Extraction (LE)
• Goal: increase the recall of KnowItAll
Find web pages with set (k=4) of random facts.
« chicago AND boston AND mexico AND buenos aires »
repeat 5,000-10,000 times
In each document, try to find « a list »
Methods for Domain-Independant Information Extraction from the Web.
List Extraction (LE)
Use a web page
« wrapper » i.e. a
classifier that identify
positive nodes
(element of the list)
and negative nodes
(all the remaining
html markup)
Methods for Domain-Independant Information Extraction from the Web.
List Extraction (LE)
• Quality of new fact == number of list in which it appears!
• PMI can also be use to assess the quality (LE+A)
Methods for Domain-Independant Information Extraction from the Web.
Outline
•
•
•
•
•
•
•
•
Introduction
Paper structure
KnowItAll System
Rule Learning (RL)
Subclass Extraction (SE)
List Extraction (LE)
Experiments
Conclusion
Methods for Domain-Independant Information Extraction from the Web.
Outline
•
•
•
•
•
•
•
•
Introduction
Paper structure
KnowItAll System
Rule Learning (RL)
Subclass Extraction (SE)
List Extraction (LE)
Experiments
Conclusion
Methods for Domain-Independant Information Extraction from the Web.
Experiments
• How to calculate the recall improvement?
• Cannot calculate the true recall (unknown)
• Can use the size of the set of facts
• But how to make sure the set is pure?
– Sort facts by probability
– Use only high-quality facts (e.g.: prob > 0.9)
– Manually assert a sample
Methods for Domain-Independant Information Extraction from the Web.
Experiments
Methods for Domain-Independant Information Extraction from the Web.
Experiments
Methods for Domain-Independant Information Extraction from the Web.
Outline
•
•
•
•
•
•
•
•
Introduction
Paper structure
KnowItAll System
Rule Learning (RL)
Subclass Extraction (SE)
List Extraction (LE)
Experiments
Conclusion
Methods for Domain-Independant Information Extraction from the Web.
Outline
•
•
•
•
•
•
•
•
Introduction
Paper structure
KnowItAll System
Rule Learning (RL)
Subclass Extraction (SE)
List Extraction (LE)
Experiments
Conclusion
Methods for Domain-Independant Information Extraction from the Web.
Conclusion
• KnowItAll is an Information extraction system (coarse IE)
• The only input is a 1-word « focus » (city, scientist, movie, …)
• Pattern instanciation, passage retrieval, PMI-IR test
• RL, SE and LE improve extraction recall
• Overall LE gives the greatest improvement
• SE was notably good on the « scientist » task
Methods for Domain-Independant Information Extraction from the Web.
Conclusion
http://knowitall-1.cs.washington.edu/dbinterface/knowitall2/default.asp
Methods for Domain-Independant Information Extraction from the Web.