Automatic Syllabus Classification JCDL – Vancouver – 22 June 2007 Edward A. Fox (presenting co-author), Xiaoyan Yu, Manas Tungare, Weiguo Fan, Manuel Perez-Quinones, William Cameron, GuoFang Teng, and Lillian (“Boots”) Cassel Why Study the Syllabus Genre? ► Educational resource ► Importance to the educational community Educators Students Self-learners ► Thanks to NSF DUE grant 5328255 (personalization support for NSDL) Where to look for a specific syllabus? ► Non-standard publishing mechanisms: Instructor’s website CMSs (courseware management systems, e.g., Sakai) Catalogs ► Limited access outside the university ► Search on the Web Many non-relevant links in search results Syllabus Library ► Bootstrapping Identify true syllabi from search results Store in a repository Develop tools & applications ► Scaling up Encourage contributions from educational communities An Essential Step towards Syllabus Library: Classification ► Classification Objects: Potential syllabi in Computer Science: search on the Web, using syllabus keywords, only in the educational domains ► Class Definition ► Feature Selection ► Model Selection ► Training and Testing Four Classes Class distribution on 1020 documents manually tagged Partial 20% Full 49% Entry 13% Noise 18% Noise Full Syllabus Partial Syllabus Entry Page Noise Syllabus Components ► course code ► title ► class time& location ► offering institution ► teaching staff ► course description ► objectives ► web site ► prerequisite ► textbook ► grading policy ► schedule ► assignment ► exam and resources Features ► 84 Genre-specific Features the occurrences of keywords the positions of keywords, and the co-occurrences of keywords and links ►A series of keywords for each syllabus component Classification Models ► Discriminative Models Support Vector Machines (SVM) SMO-L: Sequential Minimal Optimization, accelerating the training process of SVM SMO-P: SMO with a polynomial kernel ► Generative Models Naïve Bayes (NB) NB-K: Applying kernel methods to estimate the distribution of numeric attributes in NB modeling Evaluation ► Training corpus: 1020 out of the 8000+ potential syllabi ► All in HTML, PDF, PostScript, or Text ► Manual tagging on the training corpus Unanimous agreement by three co-authors ► Evaluation strategy: ten-fold cross validation ► Metrics: F1 (an overall measure of classification performance) Results w. random set Best items are in purple boxes. Acctr: Classification accuracy on the training set. Results (Cont’d) ► SVM outperforms NB regarding our syllabus classification on average. ► All classifiers fail in identifying the partial syllabus class. ► The kernel settings for NB are not helpful in the syllabus classification task. ► Classification accuracy on training data is not that good. Future Work ► Feature selection Add general feature selection methods on text classification e.g., Document Frequency, Information Gain, and Mutual Information Hybrid: combine our genre-specific features with the general features Future Work (Cont’d) ► Syllabus Library Welcome to http://doc.cs.vt.edu Share your favorite course resources – not limited to the syllabus genre. ► Information Extraction Semantic search ► Personalization Summary ► Towards a syllabus library Starting from search results on the web Classification of the search results for true syllabi ► SVM is a better choice for our syllabus classification task. ► Towards an educational on-line community around the syllabus library Q&A
© Copyright 2026 Paperzz