poster

Shedding Light on Dickens’ Style Using Representativeness
and Distinctiveness
Carmen Klaussner, John Nerbonne and Çağrı Çöltekin
[email protected] & [email protected] & [email protected]
Introduction
Representative and Distinctive Elements in Authorship
The task of discovering stylistic elements
of Charles Dickens has been addressed
by Tabata [1], using Random Forest (RF)
classification [2], by comparing Dickens
to Wilkie Collins, as well as a larger set
of contemporary authors from the 18th
and 19th century. RF is an ensemble learning technique that averages feature importance over a large number of trees
generated from the data. Here, we set
ourselves the same task, but using a simple statistical measure to compute stylistic elements of “Dickens vs. Collins”
and “Dickens vs. World”. In the absence of a gold standard, we evaluate
heuristically, using separability in clustering.
R EPRESENTATIVENESS
D ISTINCTIVENESS
D ISTINCTIVENESS
C ASE 1
C ASE 2
To detect stylistic markers of authors, such as Dickens, we propose Representativeness and Distinctiveness (RD) [4], for revealing
those features most consistent for Dickens while also being distinctive with respect to another author.
• Representative features: used consistently either frequently or infrequently by the author
• Distinctive features: used differently compared to another author
R EPRESENTATIVENESS of a feature f
for author document set D:
Evaluation
D ISTINCTIVENESS for comparing feature f to outside documents:
X
2
d f (d, d0)
d Df = 2
|D| − |D| d,d0∈D,d,d0
D0
df
Combined measure, with standardization
using all distance values for f :
X
1
=
d f (d, d0)
|D|(|DS | − |D|) d∈D,d0<D
D0
df
− df
sd(d f )
−
d Df − d f
sd(d f )
4
Random Forest Feature Selection
• Sample n cases at random with replacement
D ICKENS VS . ’W ORLD ’
Test set
Test set
1
2
3
4
5
Rand Index
0.49
1
1
1
1
1
2
3
4
5
References
3
●
0.8
0.6
0.6
0.4
0.4
0.4
0
0.2
0
Dickens
0.6
0.2
1.5
1
0.2
1.0
0.8
Node 5 (n = 7)
0.5
1
Collins
Node 4 (n = 18)
Collins
Collins
0.8
• returns variable ranking in terms of classification importance
Pn (c | f )
1
Dickens
Node 2 (n = 23)
●
0
1
Collins
Pn (class | f )
[2] Leo Breiman. Random Forests. In Machine Learning,
pages 5–32, 2001.
[3] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, December 1985.
[4] Jelena Prokić, Çağri Çöltekin, and John Nerbonne.
Detecting shibboleths. In Proceedings of the EACL
2012 Joint Workshop of LINGVIS & UNCLH, pages
72–80. Association for Computational Linguistics,
2012.
[5] John Nerbonne, Rinke Colen, Charlotte Gooskens, Peter Kleiweg, and Therese Leinonen. Gabmap-a web
application for dialectology. Dialectologia: revista
electrònica, pages 65–89, 2011.
Dickens
Authors
Stylistic Elements
The two methods are conceptually similar in that distinguishing features are chosen by considering the document space as a set of smaller comparisons.
MDS PLOT BASED ON
1 ST TEST SET
C_54_HS
D_37a_OEP
D_33_SB D_37b_OT
[1] Tomoji Tabata. Approaching Dickens’ Style through
Random Forests. In Proceedings of the Digital Humanities 2012, DH2012, 2012.
Dickens
> 0.49217
Frequency of 'words'
≤ 0.49217
• result may either be an average/weighted average
P(c | f) =
Collins
Authors
• new input is run down all of the trees
n
X
0
3
words
p = 0.022
Rand Index
1
1
0.46
0.22
0.20
> 0.56676
– At each node:
- select m predictor variable at random
- predictor variable providing best split
(objective function)
Table 1: Clustering using discriminatory terms
on test sets based on 5-fold cross-validation.
D ICKENS VS . C OLLINS
≤ 0.56676
●
1
For each tree:
(for chosen n,m)
Dickens
We test whether the selected features can
separate authors by clustering on the test
set using representative and distinctive
terms (i.e. Case 1 of Distinctiveness).
We evaluate clustering using the Adjusted
Rand Index [3], where 0 is the expected
(chance) value and 1 perfect overlap with
a (gold) standard.
1
upon
p < 0.001
2
Building a “forest” by combining trees into an
ensemble:
Frequency of 'upon'
TREE tn : feature(f)
D ICKENS
C OLLINS
RF
D ICKENS
RF
RD
RD
very
many
upon
being
much
and
so
with
a
such
indeed
air
off
but
would
left
first upon
letter
words though
only
only
such
first
end
so
discovered
left
only
later
moment being
but
room
but
produced
last much
advice
letter many
wait
to answer
upon
enough
very
though
back
and
words
answer
left
future
leave
to
news
still
first
W ORLD
RF
C_52_Basil
D_36_PP
RF
RD
eyes
hands
again
are
these
under
right
yes
up
sir
child
looked
together
here
back
till
lady
head
head
poor shaking
being
less corner
down
of returned
window
things
legs
given
leave streets
return
love
iron
determined
not
lying
only
from
until
promised should
dust
without
can
air
future
last lighted
hitherto
saw
hat
conduct
now stopping
nor
next heavily
C51_RBR
RD
MDS PLOT BASED ON
2 ND TEST SET
D_40a_MHC
C50_Antonina
D_38_NN
D_40b_OCS
D_41_BR
C_56_AD
C_59_QOH
C_57_ARL
C_60_WIW