An Efficient Centroid Based Chinese Web Page Classifier

An Efficient Centroid Based
Chinese Web Page Classifier
LIU Hui
EE Dept of Tsinghua Univ. China
Aug 28, 2003
Outline



Background
Basic Technique
Classifier Design & Implementation





Idea
Architecture
Feature
Experiment
Summary
Background

Explosive information need organization




of Web Page Classification
Digital Library
Search Engine
Special (Categorized) Sites
Research hot points




Data Mining
Information Retrieval
Pattern Recognition
Text Automatic Categorization
Background

Net-compass Search Engine



An emerging large and distributed search engine
Embedded in its new version
Chinese web page categorization competition



of Our Classifier
Held on March 14th –15th, 2003
Ranked first
Workgroup

EE Dept of Tsinghua Univ., 3 master students & 1
undergraduate student
Basic Text Categorization System
Training
Training
Samples
Preprocessing
Feature
Selection
Training
Classifier
Testing
Samples
Preproc
-essing
Feature
Selection
Classifier
(Testing)
Result
Testing
Feature Selection




Term Frequency (TF)
Term Frequency & Inverse Document Frequency
(TF.IDF)
Mutual Information (MI)

2
Pr (t | c)
Pr (t , c)
I (t , c)  log
 log
Pr (t )
Pr (t )  Pr (c)
Statistics
N   AD  CB 
 t , c  
 A  C   B  D    A  B   C  D 
2
2
Training
- Statistical Machine Learning
Vector Distance
 Centroid Based Method
 k-Nearest Neighbor: lazy learning
 Support Vector Machine: Structural Risk Minimization
Feedback & Combining Classifiers
 Neuron Network
 Boosting method
Probability

Naïve Bayes: Pr (Term/Class) -> Pr(Text/Class)
Idea

Large Database



Fast Speed
Tolerable Precision
Web Resource Fast changing


Net-compass Search Engine
Easy building Classifier
Fast Training
Supporting multi-language


Word segmentation
Easy Training Set Building & Updating
Architecture
Training
Training
Samples
Preprocessing
Feature
Selection
Vector
Generation
Model
Generation
Feedback
Testing
Samples
Formatting &
Segmentation
Vector
Generation
Testing
Classific
ation
Features

Chinese Word Segmentation

Dictionary built on search engine log


Adaptability, Manageability, Accuracy
Maximum Matching Segmenting Method


Preprocessing
Fast, tolerable accuracy
Noise Filtering


Stop word: common word, abandon word
Advertising links: length & content
Features


Combined Feature Selection
2


Statistics: tend to choose high-freq words
 Mutual Information: tend to low-freq words
W (t , c)     2 (t , c)  (1   )  I (t , c) 0    1   0.93
Subspace
Features

Adaptive Factors
Adjust model, compensate for deficiency of training set
 Class Weight
 VIP word factor
W ' (t , d )  class _ weight[class ] VIP _ factor[class ] WTF .IDF

Implementation




Berkeley DB
Structured dictionary
Avoid I/O
3000 medium-sized Chinese Web page: 50 seconds
Experiment

Corpus




Chinese Web Page training set
Provided by Peking University
11 classes, 14000 samples, much unbalanced
distribution
Evaluation

Precision, Recall, F-measure
2 R P
( R  P)
Experiment
Result
#
Precision
Recall
F-mea
1
0.829787
0.772277
0.800000
2
0.259259
0.583333
0.358974
3
0.812183
0.884146
0.784314
4
0.961661
0.815718
0.882698
5
0.859873
0.823171
0.841121
6
0.802768
0.800000
0.801382
7
0.658768
0.847561
0.741333
8
0.903448
0.836170
0.868508
9
0.883978
0.695652
0.778589
10
0.735450
0.960829
0.833167
11
0.955932
0.938436
0.947103
Ave
0.862267
0.828680
0.845140
Experiment




More samples,
more accurate
Some classes are
more difficult
Corpus cover not
large enough
Open testing: 85%
Discussion
Relation between Precision and
number of training samples
Summary
An efficient Chinese Web Page Classifier
 Clear Design


Novel Features




Centroid based, general steps
Preprocessing tricks
Combined feature selection
Subspace & Adaptive factors
Satisfactory Performance



Comparatively high accuracy
Very fast speed
High adaptability
Thank you all!
Welcome any question 