Towards Combining Web Classification and Web Information Extraction: a Case Study Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^ *Hewlett-Packard Labs China ^Institute of Computing Technology, CAS © 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Web Content Analysis for Vertical Search • Web Classification • − Identify the target pages Web pages after crawling 2 Web Information Extraction − Extract the metadata in the target pages product pages product name, model number, price … course homepag es course title, ID, time, teacher … OfCourse • Search engine for online courses − More than 60, 000 courses from the top 50 universities in US 3 Web Classification and Web Information Extraction • WC vs. WIE − Two sequential and separate phases − Error accumulation Web Classificatio n 4 Web Content Analysis for Vertical Search Web Information extraction Contributions Web Content Analysis for Vertical Search Web Web Classification andWeb Information Classifica extraction Web Information tion Extraction Combine them by probabilistic model to achieve mutual enhancement 5 Motivating Examples (1) WIE Oracle • Lots of course-related terms on this page • WIE helps to improve the precision of WC 6 Motivating Examples (2) WIE Oracle • Few course-related terms on this page • WIE helps to improve the recall of WC 7 Problem Formulation (1) • Denotations − x, a given Web page − y, the class label of this page (indicating the type of the Web page for WC) − xi (i=1…k), a text DOM leaf node in the page x − yi (i=1…k), the class label of xi (indicating the type of the text node for WIE) − k, the number of text DOM leaf nodes in this page • Label assignment problem for both x and x1 … xk 8 Problem Formulation (2) • Given a Web page x with k text DOM nodes x1 … xk • Let y,y1…yk be one possible label assignment for x,x1…xk • The principle of Maximum A Posteriori for the label assignment problem 9 The Graphical Model • Undirected graphical model for combining WC and WIE 10 The Graphical Model • Undirected graphical model for combining WC and WIE maximal clique on x and y 11 The Graphical Model • Undirected graphical model for combining WC and WIE maximal clique on each xi an k such kind of maximal clique 12 The Graphical Model • Undirected graphical model for combining WC and WIE maximal clique on all label variables y,y1…yk January, 2009 13 Expressing the Conditional Probability • Adopting the form of CRFs January, 2009 14 Parameter Learning 15 Model Inference with Constrained Output (1) • The challenge: the normalization factor in the conditional probability − Exact computation when the structure of the elements in the vector y is simple − Approximate computation otherwise (fully connected y,y1…yk in our model) 16 Model Inference with Constrained Output (2) • Use the domain knowledge to constrain the output label space − A course homepage contains one and only one course title − A non course homepage do not contain a course title 17 Baseline Methods • Local training and separate inference − Train the two classifiers for WC and WIE respectively − Use these two classifiers sequentially when predicting • Local training and joint inference − Train the two classifiers for WC and WIE respectively − Use these two classifiers jointly when predicting 18 Experimental Results 19 Conclusions and Discussion • Tasks that are inherently joint should be addressed using only one model − WC and WIE • However, this definitely increase the complexity of the statistic model • This work is to show the possibility of this joint model with tractable complexity, which is achieved by adopting the domain assumption January, 2009 20 OfCourse − Open search engine • support interactively adding of the course data 21 Experimental Data • Positive data − 530 course homepages • Negative data − 1200 other web pages 23
© Copyright 2026 Paperzz