With Course Title - VideoLectures.NET

Towards Combining Web Classification and
Web Information Extraction: a Case Study
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
*Hewlett-Packard Labs China
^Institute of Computing Technology, CAS
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
Web Content Analysis for Vertical
Search
•
Web Classification
•
− Identify the target pages
Web
pages
after
crawling
2
Web Information Extraction
−
Extract the metadata in the
target pages
product
pages
product name,
model number,
price …
course
homepag
es
course title, ID,
time, teacher …
OfCourse
•
Search engine for online courses
− More than 60, 000 courses from the top 50 universities
in US
3
Web Classification and Web Information
Extraction
•
WC vs. WIE
− Two sequential and separate phases
− Error accumulation
Web
Classificatio
n
4
Web Content
Analysis for
Vertical
Search
Web Information
extraction
Contributions
Web Content
Analysis for
Vertical
Search
Web
Web Classification andWeb Information
Classifica
extraction
Web Information
tion
Extraction
Combine them by probabilistic model to achieve
mutual enhancement
5
Motivating Examples (1)
WIE
Oracle
•
Lots of course-related terms on this page
•
WIE helps to improve the precision of WC
6
Motivating Examples (2)
WIE
Oracle
•
Few course-related terms on this page
•
WIE helps to improve the recall of WC
7
Problem Formulation (1)
•
Denotations
− x, a given Web page
− y, the class label of this page (indicating the type of the
Web page for WC)
− xi (i=1…k), a text DOM leaf node in the page x
− yi (i=1…k), the class label of xi (indicating the type of the
text node for WIE)
− k, the number of text DOM leaf nodes in this page
•
Label assignment problem for both x and x1 … xk
8
Problem Formulation (2)
•
Given a Web page x with k text DOM nodes x1 …
xk
•
Let y,y1…yk be one possible label assignment for
x,x1…xk
•
The principle of Maximum A Posteriori for the label
assignment problem
9
The Graphical Model
•
Undirected graphical model for combining WC and
WIE
10
The Graphical Model
•
Undirected graphical model for combining WC and
WIE
maximal clique on x and y
11
The Graphical Model
•
Undirected graphical model for combining WC and
WIE
maximal clique on each xi an
k such kind of maximal clique
12
The Graphical Model
•
Undirected graphical model for combining WC and
WIE
maximal clique on all label
variables y,y1…yk
January, 2009
13
Expressing the Conditional Probability
•
Adopting the form of CRFs
January, 2009
14
Parameter Learning
15
Model Inference with Constrained
Output (1)
•
The challenge: the normalization factor in the
conditional probability
− Exact computation when the structure of the elements
in the vector y is simple
− Approximate computation otherwise (fully connected
y,y1…yk in our model)
16
Model Inference with Constrained
Output (2)
•
Use the domain knowledge to constrain the output
label space
− A course homepage contains one and only one course
title
− A non course homepage do not contain a course title
17
Baseline Methods
•
Local training and separate inference
− Train the two classifiers for WC and WIE respectively
− Use these two classifiers sequentially when predicting
•
Local training and joint inference
− Train the two classifiers for WC and WIE respectively
− Use these two classifiers jointly when predicting
18
Experimental Results
19
Conclusions and Discussion
•
Tasks that are inherently joint should be addressed
using only one model
− WC and WIE
•
However, this definitely increase the complexity of
the statistic model
•
This work is to show the possibility of this joint
model with tractable complexity, which is achieved
by adopting the domain assumption
January, 2009
20
OfCourse
− Open search engine
• support interactively adding of the course data
21
Experimental Data
•
Positive data
− 530 course homepages
•
Negative data
− 1200 other web pages
23