PEBL: Web Page Classification without Negative

PEBL: Web Page Classification
without Negative Examples
Hwanjo Yu, Jiawei Han, Kevin ChenChuan Chang
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, JAN 2004
Outline







Problem statement
Motivation
Related work
Main contribution
Technical details
Experiments
Summary
Problem Statement
 To classify web pages into “userinteresting” classes.
 E.g. “Home-Page Classifier” “Call for
Papers Classifier”
 Negative Samples are not given
specifically.
 Positive and Unlabeled Samples.
Motivation
 Collecting Negative Samples may be
delicate and arduous
 Negative samples must uniformly represent the
universal set.
 Manually collected negative training examples
could be biased.
 Predefined classes usually do not match
users’ diverse and changing search targets.
Challenges
 Collecting unbiased unlabeled data from
universal set.
 Random Sampling of web pages on Internet.
 Achieving classification accuracy from
positive unlabeled data as high as from
labeled data.
 PEBL framework (Mapping-Convergence
Algorithm using SVM).
Related Work
 Semisupervised Learning
 Requires sample of labeled (+/-) and unlabeled data
 EM algorithm
 Transductive SVM
 Single-Class Learning or Classification
 Rule-based (k-DNF)
 Not tolerant to sparse, high-dimensionality.
 Requires knowledge of proportion of positive instances
in the universal set.
 Probability-based
 Requires prior probabilities for each class.
 Assumes linear separation.
 OSVM, Neural Networks
Main Contribution
 Collection of just positive samples speeds
up the process of building classifiers.
 The universal set of unlabeled samples can
be reused for training different classifiers.
 This supports example based query on
internet.
 PEBL achieves accuracy as high as that of
a typical framework w/o loss of efficiency in
testing.
SVM Overview
Mapping-Convergence Algorithm
 Mapping Stage
 A weak classifier (1) that draws an initial
approximation of “strong” negative data.
 1 must not generate false negatives.
 Convergence Stage
 Runs in iteration using a second base classifier
(2) that maximizes the margin to make
progressively better approximation of negative
data.
 2 must maximize margin.
Mapping Stage
Checking the frequency of the features
within positive and unlabeled samples
gives us a list of positive features.
Filter out all the samples having positive
features leaving behind just the “strong”
negative samples.
Mapping-Convergence Algorithm
Mapping-Convergence Algorithm
Experiments
 LIBSVM for SVM implementation.
 Gaussian Kernels for better text categorization
accuracy.
 Experiment1: The Internet




2388 pages from DMOZ - unlabeled dataset
368 personal homepages, 449 non-homepages
192 college admission pages, 450 non-admission
188 resume pages, 533 non-resume pages
Experiments
 Experiment2: University CS Department




4093 pages from WebKB - unlabeled dataset
1641 student pages, 662 non-student pages
504 project pages, 753 non-project pages
1124 faculty pages, 729 non-faculty pages
 Precision-Recall (P-R) breakeven point is used as
the performance measure.
 Compared against
 TSVM: Traditional SVM
 OSVM: One-Class SVM
 UN: treating unlabeled data as negative instances
Experiments
Experiments
Summary
 Classifying web pages of interesting class
requires laborious preprocessing.
 PEBL framework eliminates the need for
negative training samples.
 M-C algorithm achieves accuracy as high
as traditional SVM.
 Additional multiplicative logarithmic factor
in training time on top of SVM.