Vote/Veto Classication, Ensemble Clustering and Sequence

Vote/Veto Classification, Ensemble Clustering and
Sequence Classification for Author Identification
Roman Kern1,2
Stefan Klampfl2 Mario Zechner2
1
Knowledge Management Institute - Graz University of Technology
2 Know-Center
[email protected]
{rkern, sklampfl, mzechner}@know-center.at
PAN Workshop @ CLEF 2012 / 2012-09-20
Graz University of Technology
Authorship Attribution - Approach
Graz University of Technology
Vote/Veto Classification
I
Same as last year
⇒ Compare data-sets
I
Three different feature-set sets
⇒ Compare influence of uni-grams features vs. stylometric
features
2 / 15
Authorship Attribution - Classification
Graz University of Technology
Classification Algorithm
I
Combine feature-spaces via individual base classifiers
I
Based on performance in training phase
I
In classification phase combine results
Base Feature Spaces
I
Basic statistics, token statistics, grammar statistics
I
Stop-word terms, slang terms, pronoun terms
I
Intro-outro terms, bigram terms, unigram terms, terms
Feature Space Combinations
I
Terms
I
Stylometric
I
Statistics
3 / 15
Authorship Attribution - Data-Sets
Graz University of Technology
Basic Statistics
1
2
3
4
5
PAN 2011
PAN 2012
Paragraph to lines ratio
Text to lines ratio
Number of lines
Empty lines ratio
Number of paragraphs
Number of characters
Number of words
Number of lines
Number of stop-words
Number of tokens
Token Statistics
1
2
3
4
5
PAN 2011
PAN 2012
Likelihood of proper nouns
Number of tokens
Average token length
Likelihood of infrequent word groups
Likelihood of tokens of length 9
Number of tokens
Likelihood of proper nouns
Average verb length
Average token length
Likelihood of pronouns
4 / 15
Authorship Attribution - Feature Types
Graz University of Technology
Comparison of configurations
70
Terms
Statistics
Stylometric
60
50
40
30
20
10
0
Terms
Statistics
Stylometric
5 / 15
Authorship Clustering - Approach
Graz University of Technology
Ensemble Clustering
I
Multi-tier clustering
I
Combine output of base clusters
I
Only use stylometric features
Ensemble clustering is also known as consensus clustering or clustering aggregation
6 / 15
Authorship Clustering - Features
Graz University of Technology
Multiple feature spaces
I
Basic statistics (same as for authorship attribution)
I
Stylometric features (hapax-legomena, hapax-dislegomena,
yules-k, simpsons-d, brunets-w, sichels-s, honores-h, ...)
I
Stem-suffixes, stop-words, pronouns
I
Character 1-grams, 2-grams, 3-grams
⇒ Total of 7 feature spaces
7 / 15
Authorship Clustering - Clustering
Graz University of Technology
Base clustering
I
k-means clustering
I
k-means++ seed selection
Different relatedness measures for different feature spaces
I
I
I
Cosine similarity
Euclidean distance (after normalising the features)
Ensemble clustering
I
I
Create a meta-space from the individual clustering solution
In meta-space the distance between instances depends on the
agreement of the clustering solutions
I
I
Give different base clusters different weight
k-means clustering
8 / 15
Authorship Clustering - Evaluation
Graz University of Technology
Ensemble clustering results
Feature Space
A vs B
C vs D
E vs F
1-grams
2-grams
3-grams
Stop-Words & Pronouns
Stem Suffices
Stylometry
Basic Statistics
Ensemble
51.52%
50.91%
50.91%
62.20%
65.85%
52.74%
57.01%
66.10%
53.98%
54.46%
51.33%
50.72%
63.01%
59.76%
56.87%
80.34%
61.87%
56.70%
52.37%
72.91%
54.61%
64.25%
65.22%
78.44%
9 / 15
Sexual Predator Identification - Approach
Graz University of Technology
Sequence classification
I
Not directly classify predators
I
Classify individual messages/line in chats
I
Simple features
10 / 15
Sexual Predator Identification - Classes
Graz University of Technology
Chat message classes/labels
normal, predator; offending; reaction, post-offending
Chat #1
1 normal
2 predator
3 normal
4 normal
5 offending
6 reaction
7 post-offending
8 post-offending
9 reaction
10 reaction
Chat #2
pre
1 normal
2 predator
3 normal
4 normal
5 predator
6 normal
7 predator
8 predator
9 normal
post
I
11 / 15
Sexual Predator Identification - Features
Graz University of Technology
Simple features
I
Unigrams
I
Double Metaphone
I
isInitialAuthor, isLastAuthor, isMostVerboseAuthor,
isFewerAuthors, hasTermFromPrevious
Classification algorithm
I
Maximum entropy & beam search
12 / 15
Sexual Predator Identification - Training
Graz University of Technology
13 / 15
Sexual Predator Identification - Results
Class
normal
predator
offending
post-offending
reaction
Identify predators
Graz University of Technology
Count
Precision
Recall
3,117
29
52
216
275
0.955
0.3
0
0.871
0.959
0.995
0.103
0
0.847
0.764
2
0.667
1
14 / 15
The End
Graz University of Technology
Thank you!
Open-source code
https://www.knowminer.at/svn/
opensource/projects/pan2012/trunk
Corresponding Author
Roman Kern <[email protected]>
15 / 15