Characterizing Stylistic Elements in Syntactic Structure Song Feng, Ritwik Banerjee,Yejin Choi Stony Brook University The Beginning of Stylometry Donation of Constantine Emperor Constantine I supposedly transferred authority over Rome and the western part of the Roman Empire to the Pope by a decree. 2 The Beginning of Stylometry Donation of Constantine Lorenzo Valla In 1439, Lorenzo Valla proved that it was a forgery, based on the comparison of the Latin used in this decree. 3 15th C. . . . 21st C. CFG Analysis 4 Outline Related work Sentence types Sentence outlines Tree topology Beyond production rules Experiments Conclusion 5 Outline Related work Sentence types Sentence outlines Tree topology Beyond production rules Experiments Conclusion 6 Why use deep syntax for authorship attribution? Rhetorical and compositional theories ◦ Bain, 1887 ◦ Kemper, 1987 ◦ Strunk and White, 2008 7 Why use deep syntax for authorship attribution? Rhetorical and compositional theories ◦ Bain, 1887 Deep syntactic elements ◦ Kemper, 1987 ◦ Strunk and White, 2008 8 Why use deep syntax for authorship attribution? Rhetorical and compositional theories ◦ Bain, 1887 Deep syntactic elements ◦ Kemper, 1987 ◦ Strunk and White, 2008 Computational stylometric analysis and authorship attribution ◦ Stamatatos et al., 2001 ◦ Baayen et al., 2002 ◦ Koppel and Schler, 2003 9 Why use deep syntax for authorship attribution? Rhetorical and compositional theories ◦ Bain, 1887 Deep syntactic elements ◦ Kemper, 1987 ◦ Strunk and White, 2008 Computational stylometric analysis and authorship attribution ◦ Stamatatos et al., 2001 ◦ Baayen et al., 2002 ◦ Koppel and Schler, 2003 Shallow lexicosyntactic patterns 10 PCFG models for stylometry Detecting distributional differences in sentence structures ◦ Raghavan et al., 2010 (authorship attribution) ◦ Sarawgi et al., 2011 (gender attribution) ◦ Wong and Dras, 2011 (native language identification) 11 PCFG models for stylometry Detecting distributional differences in sentence structures ◦ Raghavan et al., 2010 (authorship attribution) ◦ Sarawgi et al., 2011 (gender attribution) ◦ Wong and Dras, 2011 (native language identification) But … Short of providing clues about salient styles of sentence usage. 12 PCFG models for stylometry Detecting distributional differences in sentence structures ◦ Raghavan et al., 2010 (authorship attribution) ◦ Sarawgi et al., 2011 (gender attribution) ◦ Wong and Dras, 2011 (native language identification) What are the stylistic elements in sentence structures that characterize individual authors? 13 Outline Related work Sentence types Sentence outlines Tree topology Beyond production rules Experiments Conclusion 14 “Christopher Columbus finally reached the shores of San Salvador after months of uncertainty at sea, the threat of mutiny, and a shortage of food and water.” Loose (cumulative) “After months of uncertainty at sea, the threat of mutiny, and a shortage of food and water, Christopher Columbus finally reached the shores of San Salvador.” Periodic 15 “Christopher Columbus finally reached the shores of San Salvador after months of uncertainty at sea, the threat of mutiny, and a shortage of food and water.” Loose (cumulative) “After months of uncertainty at sea, the threat of mutiny, and a shortage of food and water, Christopher Columbus finally reached the shores of San Salvador.” Periodic 16 “Christopher Columbus finally reached the shores of San Salvador after months of uncertainty at sea, the threat of mutiny, and a shortage of food and water.” Loose (cumulative) “After months of uncertainty at sea, the threat of mutiny, and a shortage of food and water, Christopher Columbus finally reached the shores of San Salvador.” Periodic 17 Sentence Type - I “Christopher Columbus finally reached the shores of San Salvador after months of uncertainty at sea, the threat of mutiny, and a shortage of food and water.” ◦ Loose (cumulative) “After months of uncertainty at sea, the threat of mutiny, and a shortage of food and water, Christopher Columbus finally reached the shores of San Salvador.” 18 Sentence Type - I “Christopher Columbus finally reached the shores of San Salvador after months of uncertainty at sea, the threat of mutiny, and a shortage of food and water.” ◦ Loose (cumulative) “After months of uncertainty at sea, the threat of mutiny, and a shortage of food and water, Christopher Columbus finally reached the shores of San Salvador.” Main clause 19 Sentence Type - I “Christopher Columbus finally reached the shores of San Salvador after months of uncertainty at sea, the threat of mutiny, and a shortage of food and water.” ◦ Loose (cumulative) “After months of uncertainty at sea, the threat of mutiny, and a shortage of food and water, Christopher Columbus finally reached the shores of San Salvador.” Supporting clauses/phases 20 Sentence Type - I “Christopher Columbus finally reached the shores of San Salvador after months of uncertainty at sea, the threat of mutiny, and a shortage of food and water.” ◦ Loose (cumulative) “After months of uncertainty at sea, the threat of mutiny, and a shortage of food and water, Christopher Columbus finally reached the shores of San Salvador.” Supporting clauses/phases 21 Sentence Type - I “Christopher Columbus finally reached the shores of San Salvador after months of uncertainty at sea, the threat of mutiny, and a shortage of food and water.” ◦ Loose (cumulative) “After months of uncertainty at sea, the threat of mutiny, and a shortage of food and water, Christopher Columbus finally reached the shores of San Salvador.” Main clause 22 Sentence Type - I “Christopher Columbus finally reached the shores of San Salvador after months of uncertainty at sea, the threat of mutiny, and a shortage of food and water.” ◦ Loose (cumulative) “After months of uncertainty at sea, the threat of mutiny, and a shortage of food and water, Christopher Columbus finally reached the shores of San Salvador.” ◦ Periodic 23 Sentence Type Classification Type-I classification ◦ Loose ◦ Periodic Type-II classification ◦ ◦ ◦ ◦ Simple Complex Compound Complex-Compound 24 Sentence Type Classification Type-I classification ◦ Loose ◦ Periodic Type-II classification ◦ ◦ ◦ ◦ Simple Complex Compound Complex-Compound 25 Sentence Type Classification Type-I classification ◦ Loose ◦ Periodic Type-II classification ◦ ◦ ◦ ◦ } Occurrence of main & supporting clauses Simple Complex Compound Complex-compound 26 Sentence Type Classification Type-I classification ◦ Loose ◦ Periodic Type-II classification ◦ ◦ ◦ ◦ Simple Complex Compound Complex-compound } } Occurrence of main & supporting clauses Occurrence of independent & dependent clauses 27 Type-II Classification Type Simple Sentence Jeju is a beautiful island. Jeju is so beautiful that we decided to stay for a few more days. Jeju island is so beautiful and the Compound food here is great too. Although I want to climb Halla, I Complexhaven't had the time, and haven't compound found anyone to go with. Complex # ICs # DCs 1 0 1 ≥1 ≥2 0 ≥2 ≥1 28 Type-II Classification Type Simple Sentence Jeju is a beautiful island. Jeju is so beautiful that we decided to stay for a few more days. Jeju island is so beautiful and the Compound food here is great too. Although I want to climb Halla, I Complexhaven't had the time, and haven't compound found anyone to go with. Complex # ICs # DCs 1 0 1 ≥1 ≥2 0 ≥2 ≥1 29 Type-II Classification Type Simple Sentence Jeju is a beautiful island. Jeju is so beautiful that we decided to stay for a few more days. Jeju island is so beautiful and the Compound food here is great too. Although I want to climb Halla, I Complexhaven't had the time, and haven't compound found anyone to go with. Complex # ICs # DCs 1 0 1 ≥1 ≥2 0 ≥2 ≥1 30 Type-II Classification Type Simple Sentence Jeju is a beautiful island. Jeju is so beautiful that we decided to stay for a few more days. Jeju island is so beautiful and the Compound food here is great too. Although I want to climb Halla, I Complexhaven't had the time, and haven't compound found anyone to go with. Complex # ICs # DCs 1 0 1 ≥1 ≥2 0 ≥2 ≥1 31 Type-II Classification Type Simple Sentence Jeju is a beautiful island. Jeju is so beautiful that we decided to stay for a few more days. Jeju island is so beautiful and the Compound food here is great too. Although I want to climb Halla, I Complexhaven't had the time, and haven't compound found anyone to go with. Complex # ICs # DCs 1 0 1 ≥1 ≥2 0 ≥2 ≥1 32 Datasets Scientific Papers ◦ ACL anthology reference corpus Bird et al., 2008 ◦ 10 authors, 8 single-author papers per author Novels ◦ 5 novelists ◦ 5 novels for each author ◦ First 3,000 sentences taken from each novel 33 Outline Related work Sentence types Sentence outlines Tree topology Beyond production rules Experiments Conclusion 34 Using parse trees sentence outlines to discover 35 Using parse trees sentence outlines to discover Outline: S PP , VP 36 Comparing sentence outlines Hobbs Joshi Lin S S CC S. S ADVP PP NP VP . S SBAR NP VP . S CC NP VP . S PP NP ADVP VP . FRAG NP : S . S S VP . S NP VP . S NP VP . S NP NP VP . S S S CC S . S PP VP . S PP NP VP . S ADVP NP VP . S NP ADVP VP . 37 Comparing sentence outlines Hobbs Joshi Lin S S CC S. S ADVP PP NP VP . S SBAR NP VP . S CC NP VP . S PP NP ADVP VP . FRAG NP : S . S S VP . S NP VP . S NP VP . S NP NP VP . S S S CC S . S PP VP . S PP NP VP . S ADVP NP VP . S NP ADVP VP . 38 Comparing sentence outlines Hobbs Joshi Lin S S CC S. S ADVP PP NP VP S SBAR NP VP . S CC NP VP . S PP NP ADVP VP . FRAG NP : S . S S VP . S NP VP . S NP VP . S NP NP VP . S S S CC S . S PP VP . S PP NP VP . S ADVP NP VP . S NP ADVP VP . 39 Comparing sentence outlines Hobbs Joshi Lin S S CC S. S ADVP PP NP VP . S SBAR NP VP . S CC NP VP . S PP NP ADVP VP . FRAG NP : S . S S VP . S NP VP . S NP VP . S NP NP VP . S S S CC S . S PP VP . S PP NP VP . S ADVP NP VP . S NP ADVP VP . 40 Outline Related work Sentence types Sentence outlines Tree topology Beyond production rules Experiments Conclusion 41 Tree topology “For processing free texts, hand-crafted grammars are neither practical nor reliable.” “These algorithms cannot deal with words for which classifiers have not been learned.” 42 Tree topology “For processing free texts, handcrafted grammars are neither practical nor reliable.” “These algorithms cannot deal with words for which classifiers have not been learned.” 43 Tree topology “For processing free texts, hand-crafted grammars are neither practical nor reliable.” “These algorithms cannot deal with words for which classifiers have not been learned.” 44 Tree topology “For processing free texts, hand-crafted grammars are neither practical nor reliable.” “These algorithms cannot deal with words for which classifiers have not been learned.” 45 Tree topology “For processing free texts, hand-crafted grammars are neither practical nor reliable.” “These algorithms cannot deal with words for which classifiers have not been learned.” 46 Tree topology: metrics Leaf height Furcation height Level width Horizontal imbalance Vertical imbalance 47 Tree topology: leaf height Leaf height (“texts”) = 6 48 Tree topology: furcation height Furcation height (VP2) = 3 49 Tree topology: level width level 3 Level Width(level3) = 8 50 Tree topology: imbalance Vertical Imbalance(PP) = |height(IN) – height(S2)| = |2 – 6| = 4 Horizontal Imbalance(PP) = |width(IN) – width(S2)| = |1 – 3| = 2 51 Tree topology metrics: novelists Tree-topology metrics Novels Charles Dickens Edward Bulwer-Lytton Jane Austen Thomas Hardy Walter Scott Sentence Length 24.1 26.7 31.4 21.5 34.1 Leaf Height 4.7 5.0 5.4 4.9 5.9 Furcation Height 1.9 1.9 2.1 1.9 2.1 Level Width 4.1 4.4 4.7 3.8 4.9 Horizontal 1.1 1.1 1.3 1.2 1.4 Vertical 1.0 1.1 1.2 1.0 1.4 Imbalance 52 Tree topology metrics: novelists Tree-topology metrics Novels Charles Dickens Edward Bulwer-Lytton Jane Austen Thomas Hardy Walter Scott Sentence Length 24.1 26.7 31.4 21.5 34.1 Leaf Height 4.7 5.0 5.4 4.9 5.9 Furcation Height 1.9 1.9 2.1 1.9 2.1 Level Width 4.1 4.4 4.7 3.8 4.9 Horizontal 1.1 1.1 1.3 1.2 1.4 Vertical 1.0 1.1 1.2 1.0 1.4 Imbalance 53 Tree topology metrics: novelists Tree-topology metrics Novels Charles Dickens Edward Bulwer-Lytton Jane Austen Thomas Hardy Walter Scott Sentence Length 24.1 26.7 31.4 21.5 34.1 Leaf Height 4.7 5.0 5.4 4.9 5.9 Furcation Height 1.9 1.9 2.1 1.9 2.1 Level Width 4.1 4.4 4.7 3.8 4.9 Horizontal 1.1 1.1 1.3 1.2 1.4 Vertical 1.0 1.1 1.2 1.0 1.4 Imbalance 54 Tree topology metrics: novelists Tree-topology metrics Novels Charles Dickens Edward Bulwer-Lytton Jane Austen Thomas Hardy Walter Scott Sentence Length 24.1 26.7 31.4 21.5 34.1 Leaf Height 4.7 5.0 5.4 4.9 5.9 Furcation Height 1.9 1.9 2.1 1.9 2.1 Level Width 4.1 4.4 4.7 3.8 4.9 Horizontal 1.1 1.1 1.3 1.2 1.4 Vertical 1.0 1.1 1.2 1.0 1.4 Imbalance 55 Tree topology metrics: novelists Tree-topology metrics Novels Charles Dickens Edward Bulwer-Lytton Jane Austen Thomas Hardy Walter Scott Sentence Length 24.1 26.7 31.4 21.5 34.1 Leaf Height 4.7 5.0 5.4 4.9 5.9 Furcation Height 1.9 1.9 2.1 1.9 2.1 Level Width 4.1 4.4 4.7 3.8 4.9 Horizontal 1.1 1.1 1.3 1.2 1.4 Vertical 1.0 1.1 1.2 1.0 1.4 Imbalance 56 Outline Related work Sentence types Sentence outlines Tree topology Beyond production rules Experiments Conclusion 57 PCFG: production rules Pr: VP1 VBG NP1 58 Beyond PCFG production rules Pr^: VP1 ^ S2 VBG NP1 59 Beyond PCFG production rules Pr*: NNS1 “texts” Pr^*: NNS1 ^ NP1 “texts” 60 Beyond PCFG production rules Syn↑: VP1 S PP 61 Beyond PCFG production rules Syn↓: VP1 VBG ,VP1 NP1 62 Outline Related work Sentence types Sentence outlines Tree topology Beyond production rules Experiments Conclusion 64 Experiments SVM classifier (LIBLINEAR) 5-fold cross validation ◦ 80% training, 20% testing ◦ 20% training, 80% testing 65 Experiments SVM classifier (LIBLINEAR) 5-fold cross validation ◦ 80% training, 20% testing ◦ 20% training, 80% testing Sufficient training data may not be available in practical scenarios (e.g., forensics). (Luyckx and Daelemans, 2008) 66 Experiments SVM classifier (LIBLINEAR) 5-fold cross validation ◦ 80% training, 20% testing ◦ 20% training, 80% testing Features ◦ PCFG rule-based ◦ STYLE11 6 parameters from distribution of sentence types 5 topological metrics 67 Experiments SVM classifiers builtLibLinear % of sentences that are 5-fold cross validation 1. Simple 2. Complex 80% training, 20% testing 3. Compound 20% training, 80% testing 4. Complex-compound 5. Loose 6. Periodic ◦ STYLE11 6 parameters from distribution of sentence types 5 topological metrics 68 Experiments SVM classifiers builtLibLinear 1. Leaf height 5-fold cross validation 2. Furcation height 3. Level-width 80% training, 20% testing 4. Horizontal imbalance 20% training, 80% testing 5. Vertical imbalance ◦ STYLE11 6 parameters from distribution of sentence types 5 topological metrics 69 Experimental results Scientific Papers: 20% training data 70 65 60 Parse-tree features 55 50 unigrams pr^* syn↕* syn*v+h 70 Experimental results Scientific Papers: 20% training data 70 65 Parse-tree features 60 Parse-tree + Style11 features 55 50 unigrams pr^* syn↕* syn*v+h 71 Experimental results Scientific Papers: 20% training data 70 65 Parse-tree features 60 Best unlexicalized feature (pr^): 60.6% Parse-tree + Style11 features 55 50 unigrams pr^* syn↕* syn*v+h 72 Experimental results Novels: 20% training data 75 70 Parse-tree features 65 Parse-tree + Style11 features 60 unigrams pr^* syn↕* syn*v+h 73 Experimental results Novels: 20% training data 75 Best unlexicalized feature (synv+h): 73.2% 70 Parse-tree features 65 Parse-tree + Style11 features 60 unigrams pr^* syn↕* syn*v+h 74 Unlexicalized features across domains 80 17.0% 70 32.9% Trained on 20% data 60 Trained on 80% data 50 40 Scientific Papers Novels Training v/s Performance: unlexicalized features 75 Conclusions Analyzed writing styles with interpretable characterization of stylistic elements. Even without lexical elements, features derived from sentence structures alone can predict authorship with high accuracy. Using topological features of parse trees in conjunction with features derived from production rules provide the best results in authorship attribution. Even with little training data, our techniques provide reasonably good performance. 76
© Copyright 2026 Paperzz