Shedding Light on Dickens’ Style Using Representativeness and Distinctiveness Carmen Klaussner, John Nerbonne and Çağrı Çöltekin [email protected] & [email protected] & [email protected] Introduction Representative and Distinctive Elements in Authorship The task of discovering stylistic elements of Charles Dickens has been addressed by Tabata [1], using Random Forest (RF) classification [2], by comparing Dickens to Wilkie Collins, as well as a larger set of contemporary authors from the 18th and 19th century. RF is an ensemble learning technique that averages feature importance over a large number of trees generated from the data. Here, we set ourselves the same task, but using a simple statistical measure to compute stylistic elements of “Dickens vs. Collins” and “Dickens vs. World”. In the absence of a gold standard, we evaluate heuristically, using separability in clustering. R EPRESENTATIVENESS D ISTINCTIVENESS D ISTINCTIVENESS C ASE 1 C ASE 2 To detect stylistic markers of authors, such as Dickens, we propose Representativeness and Distinctiveness (RD) [4], for revealing those features most consistent for Dickens while also being distinctive with respect to another author. • Representative features: used consistently either frequently or infrequently by the author • Distinctive features: used differently compared to another author R EPRESENTATIVENESS of a feature f for author document set D: Evaluation D ISTINCTIVENESS for comparing feature f to outside documents: X 2 d f (d, d0) d Df = 2 |D| − |D| d,d0∈D,d,d0 D0 df Combined measure, with standardization using all distance values for f : X 1 = d f (d, d0) |D|(|DS | − |D|) d∈D,d0<D D0 df − df sd(d f ) − d Df − d f sd(d f ) 4 Random Forest Feature Selection • Sample n cases at random with replacement D ICKENS VS . ’W ORLD ’ Test set Test set 1 2 3 4 5 Rand Index 0.49 1 1 1 1 1 2 3 4 5 References 3 ● 0.8 0.6 0.6 0.4 0.4 0.4 0 0.2 0 Dickens 0.6 0.2 1.5 1 0.2 1.0 0.8 Node 5 (n = 7) 0.5 1 Collins Node 4 (n = 18) Collins Collins 0.8 • returns variable ranking in terms of classification importance Pn (c | f ) 1 Dickens Node 2 (n = 23) ● 0 1 Collins Pn (class | f ) [2] Leo Breiman. Random Forests. In Machine Learning, pages 5–32, 2001. [3] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, December 1985. [4] Jelena Prokić, Çağri Çöltekin, and John Nerbonne. Detecting shibboleths. In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, pages 72–80. Association for Computational Linguistics, 2012. [5] John Nerbonne, Rinke Colen, Charlotte Gooskens, Peter Kleiweg, and Therese Leinonen. Gabmap-a web application for dialectology. Dialectologia: revista electrònica, pages 65–89, 2011. Dickens Authors Stylistic Elements The two methods are conceptually similar in that distinguishing features are chosen by considering the document space as a set of smaller comparisons. MDS PLOT BASED ON 1 ST TEST SET C_54_HS D_37a_OEP D_33_SB D_37b_OT [1] Tomoji Tabata. Approaching Dickens’ Style through Random Forests. In Proceedings of the Digital Humanities 2012, DH2012, 2012. Dickens > 0.49217 Frequency of 'words' ≤ 0.49217 • result may either be an average/weighted average P(c | f) = Collins Authors • new input is run down all of the trees n X 0 3 words p = 0.022 Rand Index 1 1 0.46 0.22 0.20 > 0.56676 – At each node: - select m predictor variable at random - predictor variable providing best split (objective function) Table 1: Clustering using discriminatory terms on test sets based on 5-fold cross-validation. D ICKENS VS . C OLLINS ≤ 0.56676 ● 1 For each tree: (for chosen n,m) Dickens We test whether the selected features can separate authors by clustering on the test set using representative and distinctive terms (i.e. Case 1 of Distinctiveness). We evaluate clustering using the Adjusted Rand Index [3], where 0 is the expected (chance) value and 1 perfect overlap with a (gold) standard. 1 upon p < 0.001 2 Building a “forest” by combining trees into an ensemble: Frequency of 'upon' TREE tn : feature(f) D ICKENS C OLLINS RF D ICKENS RF RD RD very many upon being much and so with a such indeed air off but would left first upon letter words though only only such first end so discovered left only later moment being but room but produced last much advice letter many wait to answer upon enough very though back and words answer left future leave to news still first W ORLD RF C_52_Basil D_36_PP RF RD eyes hands again are these under right yes up sir child looked together here back till lady head head poor shaking being less corner down of returned window things legs given leave streets return love iron determined not lying only from until promised should dust without can air future last lighted hitherto saw hat conduct now stopping nor next heavily C51_RBR RD MDS PLOT BASED ON 2 ND TEST SET D_40a_MHC C50_Antonina D_38_NN D_40b_OCS D_41_BR C_56_AD C_59_QOH C_57_ARL C_60_WIW
© Copyright 2026 Paperzz