20070622JCDLpaperYuSyllabi

Automatic Syllabus
Classification
JCDL – Vancouver – 22 June 2007
Edward A. Fox (presenting co-author),
Xiaoyan Yu, Manas Tungare, Weiguo Fan,
Manuel Perez-Quinones, William Cameron,
GuoFang Teng, and Lillian (“Boots”) Cassel
Why Study the Syllabus Genre?
► Educational
resource
► Importance to the educational community
 Educators
 Students
 Self-learners
► Thanks
to NSF DUE grant 5328255
(personalization support for NSDL)
Where to look for a specific syllabus?
► Non-standard
publishing mechanisms:
 Instructor’s website
 CMSs (courseware management systems, e.g.,
Sakai)
 Catalogs
► Limited
access outside the university
► Search on the Web
 Many non-relevant links in search results
Syllabus Library
► Bootstrapping
 Identify true syllabi from search results
 Store in a repository
 Develop tools & applications
► Scaling
up
 Encourage contributions from educational
communities
An Essential Step towards Syllabus
Library: Classification
► Classification
Objects:
 Potential syllabi in Computer Science: search on
the Web, using syllabus keywords, only in the
educational domains
► Class
Definition
► Feature Selection
► Model Selection
► Training and Testing
Four Classes
Class distribution on 1020
documents manually tagged
Partial
20%
Full
49%
Entry
13%
Noise
18%
Noise
Full
Syllabus
Partial
Syllabus
Entry
Page
Noise
Syllabus Components
► course
code
► title
► class
time& location
► offering institution
► teaching staff
► course description
► objectives
► web
site
► prerequisite
► textbook
► grading policy
► schedule
► assignment
► exam and resources
Features
► 84
Genre-specific Features
 the occurrences of keywords
 the positions of keywords, and
 the co-occurrences of keywords and links
►A
series of keywords for each syllabus
component
Classification Models
► Discriminative
Models
 Support Vector Machines (SVM)
 SMO-L: Sequential Minimal Optimization, accelerating
the training process of SVM
 SMO-P: SMO with a polynomial kernel
► Generative
Models
 Naïve Bayes (NB)
 NB-K: Applying kernel methods to estimate the
distribution of numeric attributes in NB modeling
Evaluation
► Training
corpus: 1020 out of the 8000+
potential syllabi
► All in HTML, PDF, PostScript, or Text
► Manual tagging on the training corpus
 Unanimous agreement by three co-authors
► Evaluation
strategy: ten-fold cross validation
► Metrics: F1 (an overall measure of
classification performance)
Results w. random set
Best items are in purple boxes.
Acctr: Classification accuracy on the training set.
Results (Cont’d)
► SVM
outperforms NB regarding our syllabus
classification on average.
► All classifiers fail in identifying the partial
syllabus class.
► The kernel settings for NB are not helpful in
the syllabus classification task.
► Classification accuracy on training data is
not that good.
Future Work
► Feature
selection
 Add general feature selection methods on text
classification
 e.g., Document Frequency, Information Gain,
and Mutual Information
 Hybrid: combine our genre-specific features
with the general features
Future Work (Cont’d)
► Syllabus
Library
 Welcome to http://doc.cs.vt.edu
 Share your favorite course resources – not
limited to the syllabus genre.
► Information
Extraction
 Semantic search
► Personalization
Summary
► Towards
a syllabus library
 Starting from search results on the web
 Classification of the search results for true
syllabi
► SVM is a better choice for our syllabus
classification task.
► Towards
an educational on-line community
around the syllabus library
Q&A