Anna - CSE @ IITD

Automatic Rule Refinement for Information
Extraction
Bin Liu
University of Michigan
Laura Chiticariu
IBM Research - Almaden
H. V. Jagadish
University of Michigan
Vivian Chu
IBM Research - Almaden
Frederic R. Reiss
IBM Research - Almaden
VLDB 2010
Date: 20th Oct 2011
Presenter: Ajay Gupta
Outline
2
 Introduction





Rules Representation
Method Overview
Experimental Setup
Results
Conclusion & Future Work
Information Extraction (IE)
3
 Distill structured data from unstructured text
 Exploit the extracted data in your applications
For years, Microsoft
Corporation CEO Bill Gates
was against open source. But
today he appears to have
changed his mind. "We can be
open source. We love the
concept of shared source,"
said Bill Veghte, a Microsoft
VP. "That's a super-important
shift for us in terms of code
access.“
Annotations
Name
Bill Gates
Bill Veghte
Richard Stallman
Title
Organization
CEO
Microsoft
VP
Microsoft
Founder Free Soft..
Richard Stallman, founder of
the Free Software Foundation,
countered saying…
Frederick Reiss et. al. SIGMOD 2010 Tutorial
3
Rule Based Information Extraction
Most IE systems uses Rules to define important
patterns in the text.
Example: Person name extractor
If a match of a dictionary of common first names occurs in the text, followed
immediately by a capitalized word, mark the two words as a “candidate person
name”.
4
Example Extraction Rules – When Things Go Wrong
Person
Person
Phone
Person
Phone
Anna at James
James, her assistant -555-7789
555-7789 have the details.
Anna
James St. office (555-1234),
555-1234 or James
5
Rule Development in Information Extraction
Iterative refinement process
labor intensive
time-consuming
Develop
error prone
Analyze
Test
6
Rule Refinement Is Hard

Number of rules could be large.

Rule interactions could be complex.

Analyzing side effects
- False positive → improve precision
- Correct results → decrease recall

Identifying change could take hours
- Person extractor has 14 complex rules
7
Rules Representation
8
Rules Representation

SQL to represent rules.

SQL Subset:
– Select, Project, Join, Union ALL, Except ALL
 SQL Extension:
Data Type: span
Table: Document(text span)
Predicate Functions:
- Follows, FollowsTok, Contains
Scalar Functions:
- Merge, Between, LeftContext
Table Functions:
- Regex, Dictionary
9
Rules Examples
Dictionary file first_names.dict:
anna, james, john, peter…
Anna at James St. Office (555-1234), or James, her assistant
t0:
- 555-7789 have the details.
Phone
R1: create view Phone as
Regex(‘d{3}-\d{4}’, Document,
text);
R2: create view FirstNameCand F as
Dictionary(‘first_names.dict’,
Document, text);
R3: create view FirstName as
Select * from FirstNameCand F
where
Not(ContainsDict('street_suffix.dict',
RightContextTok(F.match,1)));
t1
555-1234
t2
555-7789
FirstNameCand
t3
Anna
t4
James
t5
James
FirstName
t6
Anna
t7
James
10
Rules Examples
PersonPhoneAll
R4: create view PersonPhoneAll as
Select Merge(F.Match, P.match) as match
from FirstName F, Phone P where
Follows(F.match, P.match, 0, 60);
t8
Anna at James St. Office (5551234)
t9
James, her assistant - 555-7789
t10
Anna at James St. Office (5551234) or James, her assistant 555-7789
R5: create table PersonPhone(match
span);
insert into PersonPhone
( select * from PersonPhoneAll A)
except all
( select A1.* from PersonPhoneAll A1 ,
PersonPhoneAll A2 where
Contains(A1.match, A2.match) and
Not(Equals(A1.match, A2.match));
PersonPhone
t11
Anna at James St. Office (5551234)
t12
James, her assistant - 555-7789
11
Canonical Representation of Rules
12
Method Overview
13
Method Overview
Data Provenance:
Boris Glavic, Gustavo Alonso, ICDE 09 14
Method Overview
Input: Set of correct and incorrect examples
(Simplified) provenance
generated by an Extractor
of a wrong output
Anna555-7789
Goal: Generate refinements of Extractor
that remove incorrect example, while minimizing the rest
PersonPhoneAll
of the results.
Join
Follows(name,phone,0,60)
Basic Idea:
Data provenance allows one to understand the origins of an output
Cut any provenance link  wrong output disappears
Anna
555-7789
FirstNameCand
Phone
Dictionary
FirstNames.dict
Regex
/\d{3}-\d{4}/
Doc
15
Method Overview
Solution:
Stage1: Generate High Level Changes
“remove tuple t from the output of operator Op in the canonical representation of the
extractor”.
Problems:
1) feasibility
2) side-effects
Stage2: Generate Low Level Changes
- How to modify the operator to implement high level change.
- Ranking
16
High-Level Change
DEFINITION: HIGH-LEVEL CHANGE
Let t be a tuple in an output table V . A high-level change
for t is a pair (t′ , Op), where Op is an operator in the
canonical operator graph of V and t′ is a tuple in the
output of Op such that eliminating t′ from the output of Op
by modifying Op results in eliminating t from V .
17
Computing Provenance
18
Algorithm to Generate HLCs
19
HLC Example
Anna555-7789
HLC1
Remove Anna<--> 555-7789
From output of Join in R4
PersonPhoneAll
Join
Follows(name,phone,0,60)
HLC3
Remove Anna From
Output of Select in R3
HLC4
Remove Anna From
Output of Dictionary in R2
Anna
555-7789
FirstName
HLC2
Remove 555-7789
From Output of Regex in R1
Select
Not(ContainsDict('street_suffix..
Anna
FirstNameCand
Phone
Dictionary
FirstNames.dict
Regex
/\d{3}-\d{4}/
Doc
20
Generating Low-Level Changes from HLCs
Anna555-7789
HLC1
Remove Anna<--> 555-7789
From output of Join in R4
LLC
Change Join Predicate to
Follows(name,phone,0,50)
PersonPhoneAll
Join
Follows(name,phone,0,60)
Anna
555-7789
FirstName
Select
Not(ContainsDict('street_suffix..
HLC4
Remove Anna From
Output of Dictionary in R2
LLC
Remove 'anna' From
FirstNames.dict
Anna
FirstNameCand
Phone
Dictionary
FirstNames.dict
Regex
/\d{3}-\d{4}/
Doc
21
Generating Low-Level Changes from HLCs:
Naive Approach
Input: Set of HLCs
Output: List of LLCs, ranked based on effects
Algorithm:
1) For each operator Op, consider all HLCs
2) For each HLC, enumerate all possible LLCs
3) For each LLC:
• Compute the set of local tuples it removes from the output of Op
• Propagate these removals up through the provenance graph to compute the
effect on end-to-end result
4) Rank LLCs
22
Problems with Naive Approach
Problem1: Number of possible LLCs for a HLC could be very large
Example: Remove output tuple of a Dictionary operator
Dictionary with 1000 entries possible LLCs: 2^999 -1 !.
Solution:
Limit the LLCs considered to a set of tractable size, while still considering
all feasible combinations of HLCs for given operator
1) Generate a single LLC for each of k promising combinations of HLCs for
given operator
2) k is the number of LLCs presented to the user
23
Problems with Naive Approach
Problem2: Traversing the provenance graph is expensive
O(n2), where n is the size of the operator tree
Solution:
Remember the mapping from each high-level change back to the affected
output tuple.
24
Specific Classes of Low-Level Changes
1) Modify numerical join parameters E.g., “Modify max char. distance of Follows() predicate in the join operator of rule R4
from 60 to 20”
2) Remove dictionary entries E.g., “Modify the Dictionary operator of rule R2 by removing entry Anna from
first_names.dict”
3) Add filtering dictionary E.g., “Add predicate Not(ContainsDict(‘street_suffix.dict’, RightContextTok(match,1)))
to Dictionary operator of rule R3”
4) Add filtering view - applies to an entire view
E.g., “Subtract from the result of rule R4 PersonPhoneAll spans that are strictly
contained within another PersonPhoneAll span”
25
LLC Generation: Removing Dictionary Entries
26
Output of operator
Dictionary(‘FirstNameDict’)
James

Final output of
FirstName
extractor
James X
Effects of removing
Dictionary entry
Dictionary entries
in ‘FirstNameDict’
‘anna’

Anna ABC
James Y
James Anderson
Anna XYZ
‘james’

James X
James Y
Anna

Anna XYZ

Anna ABC
James Anderson
Generated LLCs:
Remove from dictionary FirstNameDict the following entries:
1. ‘anna’
2. ‘anna’, ‘james’
26
Experiments
- Rule refinement approach implemented in SystemT information
extraction system
- Uses SystemT’s AQL rule language
Goals:
Quality evaluation of generated refinements
Performance evaluation
Setup:
Ubuntu Linux version 9.10, 2.26 GHz Intel Xeon CPU with 8 GB
RAM.
10 fold cross validation.
27
Extraction Tasks and Rule Sets
Person task
14 complex rules for identifying person names
• E.g., “CapitalizedWord followed by FirstName”
“LastName followed by Comma followed by CapitalizedWord”
Rules for identifying other Named Entities
• E.g., Organization, EmailAddress, Address
These can be used as filtering purpose to enable refinement.
- “Morgan Stanley”, “Georgia”
PersonPhone task
11 complex rules for identifying phone numbers
High-quality Person extractor
One rule to identify PersonPhone candidates: “Person followed by
Phone within 0 to 60 characters”
28
Evaluation Datasets
Training Set
Test Set
Dataset
#docs
#labels
#docs
#labels
ACE
CoNLL
Enron
EnronPP
273
946
434
322
5201
6560
4500
157
69
216
218
161
1220
1842
1969
46
ACE: collection of newswire reports, broadcast news and conversations with Person labeled data from the ACE05 Dataset.
CoNLL: collection of news articles with Person labeled data from
the CoNLL 2003 Shared Task.
Enron, EnronPP: collections of emails from the Enron corpus
annotated with Person and respectively PersonPhone labels.
29
Quality Evaluation
30
Quality Evaluation
- F1-measure improves between 6% to 26% in few
iterations
- Recall remains stable.
- F1-measure and Precision reaches platue
- First few high ranked refinements
- Some low level changes are not implemented yet
31
Quality Evaluation: Comparison with Experts
- Two experts
- Enron dataset for Person task
- Time: One hour
32
Performance Evaluation
- One hour by an expert
- 3 min to 15 min per refinement
- System refinement time: 2 min
33
Conclusion & Future work
- Database provenance technique for refining
information extraction rules.
Future work:
- Extensions
• Other types of LLCs. e.g. Regex
- Addressing false negatives
34
Thank You
35