Vote/Veto Classification, Ensemble Clustering and Sequence Classification for Author Identification Roman Kern1,2 Stefan Klampfl2 Mario Zechner2 1 Knowledge Management Institute - Graz University of Technology 2 Know-Center [email protected] {rkern, sklampfl, mzechner}@know-center.at PAN Workshop @ CLEF 2012 / 2012-09-20 Graz University of Technology Authorship Attribution - Approach Graz University of Technology Vote/Veto Classification I Same as last year ⇒ Compare data-sets I Three different feature-set sets ⇒ Compare influence of uni-grams features vs. stylometric features 2 / 15 Authorship Attribution - Classification Graz University of Technology Classification Algorithm I Combine feature-spaces via individual base classifiers I Based on performance in training phase I In classification phase combine results Base Feature Spaces I Basic statistics, token statistics, grammar statistics I Stop-word terms, slang terms, pronoun terms I Intro-outro terms, bigram terms, unigram terms, terms Feature Space Combinations I Terms I Stylometric I Statistics 3 / 15 Authorship Attribution - Data-Sets Graz University of Technology Basic Statistics 1 2 3 4 5 PAN 2011 PAN 2012 Paragraph to lines ratio Text to lines ratio Number of lines Empty lines ratio Number of paragraphs Number of characters Number of words Number of lines Number of stop-words Number of tokens Token Statistics 1 2 3 4 5 PAN 2011 PAN 2012 Likelihood of proper nouns Number of tokens Average token length Likelihood of infrequent word groups Likelihood of tokens of length 9 Number of tokens Likelihood of proper nouns Average verb length Average token length Likelihood of pronouns 4 / 15 Authorship Attribution - Feature Types Graz University of Technology Comparison of configurations 70 Terms Statistics Stylometric 60 50 40 30 20 10 0 Terms Statistics Stylometric 5 / 15 Authorship Clustering - Approach Graz University of Technology Ensemble Clustering I Multi-tier clustering I Combine output of base clusters I Only use stylometric features Ensemble clustering is also known as consensus clustering or clustering aggregation 6 / 15 Authorship Clustering - Features Graz University of Technology Multiple feature spaces I Basic statistics (same as for authorship attribution) I Stylometric features (hapax-legomena, hapax-dislegomena, yules-k, simpsons-d, brunets-w, sichels-s, honores-h, ...) I Stem-suffixes, stop-words, pronouns I Character 1-grams, 2-grams, 3-grams ⇒ Total of 7 feature spaces 7 / 15 Authorship Clustering - Clustering Graz University of Technology Base clustering I k-means clustering I k-means++ seed selection Different relatedness measures for different feature spaces I I I Cosine similarity Euclidean distance (after normalising the features) Ensemble clustering I I Create a meta-space from the individual clustering solution In meta-space the distance between instances depends on the agreement of the clustering solutions I I Give different base clusters different weight k-means clustering 8 / 15 Authorship Clustering - Evaluation Graz University of Technology Ensemble clustering results Feature Space A vs B C vs D E vs F 1-grams 2-grams 3-grams Stop-Words & Pronouns Stem Suffices Stylometry Basic Statistics Ensemble 51.52% 50.91% 50.91% 62.20% 65.85% 52.74% 57.01% 66.10% 53.98% 54.46% 51.33% 50.72% 63.01% 59.76% 56.87% 80.34% 61.87% 56.70% 52.37% 72.91% 54.61% 64.25% 65.22% 78.44% 9 / 15 Sexual Predator Identification - Approach Graz University of Technology Sequence classification I Not directly classify predators I Classify individual messages/line in chats I Simple features 10 / 15 Sexual Predator Identification - Classes Graz University of Technology Chat message classes/labels normal, predator; offending; reaction, post-offending Chat #1 1 normal 2 predator 3 normal 4 normal 5 offending 6 reaction 7 post-offending 8 post-offending 9 reaction 10 reaction Chat #2 pre 1 normal 2 predator 3 normal 4 normal 5 predator 6 normal 7 predator 8 predator 9 normal post I 11 / 15 Sexual Predator Identification - Features Graz University of Technology Simple features I Unigrams I Double Metaphone I isInitialAuthor, isLastAuthor, isMostVerboseAuthor, isFewerAuthors, hasTermFromPrevious Classification algorithm I Maximum entropy & beam search 12 / 15 Sexual Predator Identification - Training Graz University of Technology 13 / 15 Sexual Predator Identification - Results Class normal predator offending post-offending reaction Identify predators Graz University of Technology Count Precision Recall 3,117 29 52 216 275 0.955 0.3 0 0.871 0.959 0.995 0.103 0 0.847 0.764 2 0.667 1 14 / 15 The End Graz University of Technology Thank you! Open-source code https://www.knowminer.at/svn/ opensource/projects/pan2012/trunk Corresponding Author Roman Kern <[email protected]> 15 / 15
© Copyright 2026 Paperzz