An Efficient Centroid Based Chinese Web Page Classifier LIU Hui EE Dept of Tsinghua Univ. China Aug 28, 2003 Outline Background Basic Technique Classifier Design & Implementation Idea Architecture Feature Experiment Summary Background Explosive information need organization of Web Page Classification Digital Library Search Engine Special (Categorized) Sites Research hot points Data Mining Information Retrieval Pattern Recognition Text Automatic Categorization Background Net-compass Search Engine An emerging large and distributed search engine Embedded in its new version Chinese web page categorization competition of Our Classifier Held on March 14th –15th, 2003 Ranked first Workgroup EE Dept of Tsinghua Univ., 3 master students & 1 undergraduate student Basic Text Categorization System Training Training Samples Preprocessing Feature Selection Training Classifier Testing Samples Preproc -essing Feature Selection Classifier (Testing) Result Testing Feature Selection Term Frequency (TF) Term Frequency & Inverse Document Frequency (TF.IDF) Mutual Information (MI) 2 Pr (t | c) Pr (t , c) I (t , c) log log Pr (t ) Pr (t ) Pr (c) Statistics N AD CB t , c A C B D A B C D 2 2 Training - Statistical Machine Learning Vector Distance Centroid Based Method k-Nearest Neighbor: lazy learning Support Vector Machine: Structural Risk Minimization Feedback & Combining Classifiers Neuron Network Boosting method Probability Naïve Bayes: Pr (Term/Class) -> Pr(Text/Class) Idea Large Database Fast Speed Tolerable Precision Web Resource Fast changing Net-compass Search Engine Easy building Classifier Fast Training Supporting multi-language Word segmentation Easy Training Set Building & Updating Architecture Training Training Samples Preprocessing Feature Selection Vector Generation Model Generation Feedback Testing Samples Formatting & Segmentation Vector Generation Testing Classific ation Features Chinese Word Segmentation Dictionary built on search engine log Adaptability, Manageability, Accuracy Maximum Matching Segmenting Method Preprocessing Fast, tolerable accuracy Noise Filtering Stop word: common word, abandon word Advertising links: length & content Features Combined Feature Selection 2 Statistics: tend to choose high-freq words Mutual Information: tend to low-freq words W (t , c) 2 (t , c) (1 ) I (t , c) 0 1 0.93 Subspace Features Adaptive Factors Adjust model, compensate for deficiency of training set Class Weight VIP word factor W ' (t , d ) class _ weight[class ] VIP _ factor[class ] WTF .IDF Implementation Berkeley DB Structured dictionary Avoid I/O 3000 medium-sized Chinese Web page: 50 seconds Experiment Corpus Chinese Web Page training set Provided by Peking University 11 classes, 14000 samples, much unbalanced distribution Evaluation Precision, Recall, F-measure 2 R P ( R P) Experiment Result # Precision Recall F-mea 1 0.829787 0.772277 0.800000 2 0.259259 0.583333 0.358974 3 0.812183 0.884146 0.784314 4 0.961661 0.815718 0.882698 5 0.859873 0.823171 0.841121 6 0.802768 0.800000 0.801382 7 0.658768 0.847561 0.741333 8 0.903448 0.836170 0.868508 9 0.883978 0.695652 0.778589 10 0.735450 0.960829 0.833167 11 0.955932 0.938436 0.947103 Ave 0.862267 0.828680 0.845140 Experiment More samples, more accurate Some classes are more difficult Corpus cover not large enough Open testing: 85% Discussion Relation between Precision and number of training samples Summary An efficient Chinese Web Page Classifier Clear Design Novel Features Centroid based, general steps Preprocessing tricks Combined feature selection Subspace & Adaptive factors Satisfactory Performance Comparatively high accuracy Very fast speed High adaptability Thank you all! Welcome any question
© Copyright 2026 Paperzz