Fully Automatic Wrapper Generation For Search Engines

VLDB 2006 Seoul
Automatic Extraction of
Dynamic Record Sections
from Search Engine Result
Pages
Hongkun Zhao, Weiyi Meng, Clement Yu*
Department of Computer Science
State University of New York at Binghamton
* Department of Computer Science
University of Illinois at Chicago
September 15, 2006
Presentation Outline
• Background
• Dynamic section extraction
– Problem Statement
– The solution
• Experiments
• Related work
Background: Search Result Record (SRR)
Background: SRR Extraction - Motivations
• SRRs are frequently needed to feed into other Web
applications:
– Metasearch engines need the SRRs from
different search engines and merge them.
– Comparison shopping services need to compare
SRRs from different search engines to find the
best deal.
Background: SRRs within Multiple Sections
Background: Main Research Issues
• Three levels of search result extraction
– Section identification
– Record extraction
– Data unit identification and annotation
• Automatic wrapper generation
Background: SRR Extraction – ViNTs
• Most current works on automatic search
result extraction are on record extraction,
including
– ViNTs (WWW 2005)
• ViNTs can extract records from sections
containing at least three records, including
non-result (static) records
Problem Definition: Dynamic Sections
• A typical search engine result page contains static,
semi-dynamic and dynamic contents.
– Static: query independent
– Semi-dynamic: basic structure is query
independent
– Dynamic: query dependent
• A dynamic section is a set of all SRRs that appear
consecutively and have certain common features
such as a common header and a common display
format.
Example: SRRs within Multiple Sections
Example: SRRs within Multiple Sections
Problem Definition: Dynamic Section
Extraction
Problem statement: automatically extract all dynamic
sections as well as SRRs within each dynamic
section from search result page of any search
engine.
Why dynamic section extraction:
• They correspond to search results and many applications
need them.
• Different applications may needs SRRs from different
sections.
Background: Search Result Record (SRR)
Problem Definition: Challenges in
Dynamic Section Extraction
• Non-uniform section format problem
• Section-record granularity problem
– Records versus sections
• Hidden section extraction problem
– Some sections may not appear in sample result
pages used for training
Background: SRRs within Multiple Sections
Result Page Layout Model
template
sections
records
MSE: Multiple Section Extraction
MRE
Web
Pages
Refining
MRs and DSs
Checking Granularity for MRs
DSE
Wrapper family
Building
Section
Wrappers
Wrapper
Building
Record Mining
From DSs
Section Instances
Clustering
Refined Section
Instances
MRE: Multi-Record section Extraction
•
•
MRE is revised from ViNTs (WWW 2005)
Using MRE to extract MRs has four potential
problems:
1. boundary problem, i.e., some records near the two
boundaries of an MR may be incorrectly extracted
2. sections with fewer than three records may not be
extracted
3. some extracted sections may contain static contents with
repeating patterns
4. some extracted MRs may mistakenly take consecutive
sections with the same format as records, and some large
records may be incorrectly extracted as sections.
DSE: Dynamic Section Extraction
Step 1: Identify candidate section boundary markers
(CSBM)
–
–
Use a pair of result pages at a time
CSBMs are usually static or semi-dynamic content lines that
appear in both result pages and have compatible tag paths
Step 2: Identify dynamic sections (DS) based on the
CSBMs
–
–
Each (candidate) DS has a left boundary marker (LBM) and
a right boundary marker (RBM), which are CSBMs and are
not part of the DS
Note: some DSs may be incorrect due to incorrect CSBMs
MRs and DSs Refining
•
Idea: Use MRs and DSs to refine each other to
– identify and discard static sections
– correct the boundaries of some MRs and DSs
•
Note: To deal with the non-uniform section format
problem, neither of the two algorithms, MRE and
DSE, assumes there is a common format/pattern
among different sections when performing section
extraction
MRs and DSs Refining
1. MR=DS
2. MRDS
MR
3. MRDS
4. MRDS 5. MRDS=
Extra MR part
Overlapping part
DS
Extra DS part
Mining Records from DSs
Goal: Identify records from dynamic sections that do not
match any MRs such as those with fewer than three
records.
Method: Consider dynamic section DS
1. Identify repeating tags within the tag forest for DS as
candidate separators
2. Use each candidate separator to partition DS into
records and select the partition with the highest
section cohesion.
Mining Records from DSs
Observations about section cohesion: records in a section
tend to be similar to each other, while the lines within
a record tend to be dissimilar to each other.
The cohesion of a section S with records r1, r2, …, rk
average distance of the lines within each record
=
average distance among the records
Background: Search Result Record (SRR)
Partition with high cohesion
Partition with low cohesion
Solving Section-Record Granularity
Problem
Two subproblems:
• Oversized record problem: Some consecutive sections
are recognized as records or multiple small records
are recognized as a single large record
• Splitting record problem: Large records are
recognized as sections or large records are split into
smaller records
Solving Oversized Record Problem
•
Use record mining technique to try to find smaller
records from a candidate oversized record R.
– If no smaller records can be found, R is not an
oversized record
– If smaller records can be found, R is recognized as
an oversized record
– If small records can be found and they are similar
to the records mined from another (adjacent)
candidate oversized record R1, then R and R1 are
recognized as consecutive sections.
Solving Splitting Record Problem
•
Let R be an MR with records (r1, …, rk), which is a
partition of R.
–
–
•
We generate new partitions by merging these records in
different ways and calculate the cohesion of each partition.
The partition with the highest cohesion will be selected and
larger records may be yielded as a result.
If there exists a set of consecutive MRs that are
siblings under the same sub-tree of the DOM tree, and
all MRs in the set consist of only one record, then we
form a new section with each original section in the
set as a record and remove the original sections.
Certifying DSs Based on Multiple Result
Pages
•
•
•
•
Multiple result pages are used
If an MR on one result page matches with an MR on
at least another result page, both MRs are certified as
the section instances of the same section schema.
More than two result pages can be used to generate
section instance groups for different section schemas.
A matching score is computed between two MRs
from two pages based on their tag path similarity,
SBM similarity and tag forest similarity.
Wrapper Generation
•
Section wrapper format: <pref, seps, LBMs, RBMs>
– pref is the tag path that leads to the minimum subtree t that contains all records in this section
– seps is the separator set used to partition the subforest of t into records
– LBMs and RBMs are the sets of left and right
boundary markers of the section
• Page wrapper: a sequence of section wrappers
Solving Hidden Section Extraction Problem
•
•
•
For sections with zero or only one instances on
sample result pages, no wrapper will be generated.
Use section family to solve this problem: A section
family represents a class of section schemas that share
some common features.
Basic idea: Hope the schema of the hidden section is
similar to that of an existing section.
Solving Hidden Section Extraction Problem
An example of a section family: All member section
schemas have the same pref and seps, and their LBMs
(RBMs) share the same line text attribute.
<HTML>
<HEAD>
<BODY>
<TBODY>
<TR>
<TR>
<TR> LBM of Section 1
Section 1
<TR>
<TR>
<TR> RBM of Section 1
<TR>
LBM of Section 2
<TR>
<TR>
<TR>
Section 2
RBM of Section 2
Experimental Results
•
Dataset
–
–
–
–
•
100 search engines from the ViNTs dataset, 19 with
multiple DSs
19 additional search engines that produce multiple DSs
Total 38 search engines produce multiple DSs
Collect 10 result pages for each search engine, 5 are used
for wrapper generation and 5 are used to test the wrappers
Performance measures: Recall and Precision
–
–
Perfect
Partially correct (> 60% records are extracted)
Experimental Results
Section extraction results on all 119 search engines:
#Actual #Extr
acted
#Perfect #Partially
correct
Recall %
Precision %
Perfect Total
Perfect Total
Sample
Pages
1057
1106
899
136
85.0
97.9
81.3
93.6
Test
Pages
981
1028
820
134
83.6
97.2
79.8
92.8
Total
2038
2134
1719
270
84.3
97.6
80.6
93.2
Experimental Results
Section extraction results on the 38 search engines whose
result pages have multiple dynamic sections:
#Actual #Extr
acted
#Perfect #Partially
correct
Recall %
Precision %
Perfect Total
Perfect Total
Sample
Pages
652
670
538
92
82.5
96.6
80.2
94.0
Test
Pages
590
611
468
95
79.3
95.4
76.6
92.1
Total
1242
1281
1006
187
81.0
96.1
78.5
93.1
Experimental Results
Record extraction results on all extracted sections:
#Actual #Extracted #Correct
Recall % Precision %
Sample Pages
9615
9597
9490
98.7
98.9
Test Pages
8248
8245
8139
98.7
98.7
Total
17863
17842
17628
98.7
98.8
Related Work
•
Many existing works on record extraction from web
pages: RoadRunner, EXALG, IEPAD, DeLa, Omni,
MDR, ViPER …
Only MDR (Liu, Grossman, Zhai, SIGKDD, 2003)
has the ability to output multiple sections but
•
–
–
–
it does not differentiate dynamic sections from static
contents
it does not address the non-uniform format problem and the
section-record granularity problem.
the hidden section extraction problem does not occur for
MDR as it does not generate wrapper, which can lead to
other problems such as lower efficiency
Conclusions and Future Work
Conclusions:
–
–
–
Studied the automatic section extraction problem
Identified several interesting issues: non-uniform format
problem, section-record granularity problem and hidden
section extraction problem
Provided solutions to the new problems
Future work
–
–
–
Still room to improve: increase the accuracy of identifying
boundary markers of dynamic sections
Section classification
……