Characterizing Stylistic Elements in Syntactic Structure

Characterizing Stylistic Elements
in Syntactic Structure
Song Feng, Ritwik Banerjee,Yejin Choi
Stony Brook University
The Beginning of Stylometry
Donation of Constantine
Emperor Constantine I supposedly transferred
authority over Rome and the western part of
the Roman Empire to the Pope by a decree.
2
The Beginning of Stylometry
Donation of Constantine
Lorenzo Valla
In 1439, Lorenzo Valla proved that it was
a forgery, based on the comparison of
the Latin used in this decree.
3
15th C.
.
.
.
21st C.
CFG Analysis
4
Outline
Related work
 Sentence types
 Sentence outlines
 Tree topology
 Beyond production rules
 Experiments
 Conclusion

5
Outline
Related work
 Sentence types
 Sentence outlines
 Tree topology
 Beyond production rules
 Experiments
 Conclusion

6
Why use deep syntax for authorship
attribution?

Rhetorical and compositional theories
◦ Bain, 1887
◦ Kemper, 1987
◦ Strunk and White, 2008
7
Why use deep syntax for authorship
attribution?

Rhetorical and compositional theories
◦ Bain, 1887
Deep syntactic elements
◦ Kemper, 1987
◦ Strunk and White, 2008
8
Why use deep syntax for authorship
attribution?

Rhetorical and compositional theories
◦ Bain, 1887
Deep syntactic elements
◦ Kemper, 1987
◦ Strunk and White, 2008

Computational stylometric analysis and
authorship attribution
◦ Stamatatos et al., 2001
◦ Baayen et al., 2002
◦ Koppel and Schler, 2003
9
Why use deep syntax for authorship
attribution?

Rhetorical and compositional theories
◦ Bain, 1887
Deep syntactic elements
◦ Kemper, 1987
◦ Strunk and White, 2008

Computational stylometric analysis and
authorship attribution
◦ Stamatatos et al., 2001
◦ Baayen et al., 2002
◦ Koppel and Schler, 2003
Shallow lexicosyntactic patterns
10
PCFG models for stylometry

Detecting distributional differences in
sentence structures
◦ Raghavan et al., 2010 (authorship attribution)
◦ Sarawgi et al., 2011 (gender attribution)
◦ Wong and Dras, 2011 (native language identification)
11
PCFG models for stylometry

Detecting distributional differences in
sentence structures
◦ Raghavan et al., 2010 (authorship attribution)
◦ Sarawgi et al., 2011 (gender attribution)
◦ Wong and Dras, 2011 (native language identification)
But …
Short of providing clues about
salient styles of sentence usage.
12
PCFG models for stylometry

Detecting distributional differences in
sentence structures
◦ Raghavan et al., 2010 (authorship attribution)
◦ Sarawgi et al., 2011 (gender attribution)
◦ Wong and Dras, 2011 (native language identification)
What are the stylistic elements in
sentence structures that
characterize individual authors?
13
Outline
Related work
 Sentence types
 Sentence outlines
 Tree topology
 Beyond production rules
 Experiments
 Conclusion

14


“Christopher Columbus finally reached the shores of
San Salvador after months of uncertainty at sea, the
threat of mutiny, and a shortage of food and water.”
Loose (cumulative)
“After months of uncertainty at sea, the threat of
mutiny, and a shortage of food and water, Christopher
Columbus finally reached the shores of San Salvador.”
Periodic
15


“Christopher Columbus finally reached the shores of
San Salvador after months of uncertainty at sea, the
threat of mutiny, and a shortage of food and water.”
Loose (cumulative)
“After months of uncertainty at sea, the threat of
mutiny, and a shortage of food and water, Christopher
Columbus finally reached the shores of San Salvador.”
Periodic
16


“Christopher Columbus finally reached the shores of
San Salvador after months of uncertainty at sea, the
threat of mutiny, and a shortage of food and water.”
Loose (cumulative)
“After months of uncertainty at sea, the threat of
mutiny, and a shortage of food and water, Christopher
Columbus finally reached the shores of San Salvador.”
Periodic
17
Sentence Type - I


“Christopher Columbus finally reached the shores of
San Salvador after months of uncertainty at sea, the
threat of mutiny, and a shortage of food and water.”
◦ Loose (cumulative)
“After months of uncertainty at sea, the threat of
mutiny, and a shortage of food and water, Christopher
Columbus finally reached the shores of San Salvador.”
18
Sentence Type - I


“Christopher Columbus finally reached the shores of
San Salvador after months of uncertainty at sea, the
threat of mutiny, and a shortage of food and water.”
◦ Loose (cumulative)
“After months of uncertainty at sea, the threat of
mutiny, and a shortage of food and water, Christopher
Columbus finally reached the shores of San Salvador.”
Main clause
19
Sentence Type - I


“Christopher Columbus finally reached the shores of
San Salvador after months of uncertainty at sea, the
threat of mutiny, and a shortage of food and water.”
◦ Loose (cumulative)
“After months of uncertainty at sea, the threat of
mutiny, and a shortage of food and water, Christopher
Columbus finally reached the shores of San Salvador.”
Supporting clauses/phases
20
Sentence Type - I


“Christopher Columbus finally reached the shores of
San Salvador after months of uncertainty at sea, the
threat of mutiny, and a shortage of food and water.”
◦ Loose (cumulative)
“After months of uncertainty at sea, the threat of
mutiny, and a shortage of food and water, Christopher
Columbus finally reached the shores of San Salvador.”
Supporting clauses/phases
21
Sentence Type - I


“Christopher Columbus finally reached the shores of
San Salvador after months of uncertainty at sea, the
threat of mutiny, and a shortage of food and water.”
◦ Loose (cumulative)
“After months of uncertainty at sea, the threat of
mutiny, and a shortage of food and water, Christopher
Columbus finally reached the shores of San Salvador.”
Main clause
22
Sentence Type - I


“Christopher Columbus finally reached the shores of
San Salvador after months of uncertainty at sea, the
threat of mutiny, and a shortage of food and water.”
◦ Loose (cumulative)
“After months of uncertainty at sea, the threat of
mutiny, and a shortage of food and water, Christopher
Columbus finally reached the shores of San Salvador.”
◦ Periodic
23
Sentence Type Classification

Type-I classification
◦ Loose
◦ Periodic

Type-II classification
◦
◦
◦
◦
Simple
Complex
Compound
Complex-Compound
24
Sentence Type Classification

Type-I classification
◦ Loose
◦ Periodic

Type-II classification
◦
◦
◦
◦
Simple
Complex
Compound
Complex-Compound
25
Sentence Type Classification

Type-I classification
◦ Loose
◦ Periodic

Type-II classification
◦
◦
◦
◦
}
Occurrence of
main & supporting
clauses
Simple
Complex
Compound
Complex-compound
26
Sentence Type Classification

Type-I classification
◦ Loose
◦ Periodic

Type-II classification
◦
◦
◦
◦
Simple
Complex
Compound
Complex-compound
}
}
Occurrence of
main & supporting
clauses
Occurrence of
independent &
dependent clauses
27
Type-II Classification
Type
Simple
Sentence
Jeju is a beautiful island.
Jeju is so beautiful that we decided
to stay for a few more days.
Jeju island is so beautiful and the
Compound
food here is great too.
Although I want to climb Halla, I
Complexhaven't had the time, and haven't
compound
found anyone to go with.
Complex
# ICs # DCs
1
0
1
≥1
≥2
0
≥2
≥1
28
Type-II Classification
Type
Simple
Sentence
Jeju is a beautiful island.
Jeju is so beautiful that we decided
to stay for a few more days.
Jeju island is so beautiful and the
Compound
food here is great too.
Although I want to climb Halla, I
Complexhaven't had the time, and haven't
compound
found anyone to go with.
Complex
# ICs # DCs
1
0
1
≥1
≥2
0
≥2
≥1
29
Type-II Classification
Type
Simple
Sentence
Jeju is a beautiful island.
Jeju is so beautiful that we decided
to stay for a few more days.
Jeju island is so beautiful and the
Compound
food here is great too.
Although I want to climb Halla, I
Complexhaven't had the time, and haven't
compound
found anyone to go with.
Complex
# ICs # DCs
1
0
1
≥1
≥2
0
≥2
≥1
30
Type-II Classification
Type
Simple
Sentence
Jeju is a beautiful island.
Jeju is so beautiful that we decided
to stay for a few more days.
Jeju island is so beautiful and the
Compound
food here is great too.
Although I want to climb Halla, I
Complexhaven't had the time, and haven't
compound
found anyone to go with.
Complex
# ICs # DCs
1
0
1
≥1
≥2
0
≥2
≥1
31
Type-II Classification
Type
Simple
Sentence
Jeju is a beautiful island.
Jeju is so beautiful that we decided
to stay for a few more days.
Jeju island is so beautiful and the
Compound
food here is great too.
Although I want to climb Halla, I
Complexhaven't had the time, and haven't
compound
found anyone to go with.
Complex
# ICs # DCs
1
0
1
≥1
≥2
0
≥2
≥1
32
Datasets

Scientific Papers
◦ ACL anthology reference corpus
 Bird et al., 2008
◦ 10 authors, 8 single-author papers per author

Novels
◦ 5 novelists
◦ 5 novels for each author
◦ First 3,000 sentences taken from each novel
33
Outline
Related work
 Sentence types
 Sentence outlines
 Tree topology
 Beyond production rules
 Experiments
 Conclusion

34
Using parse trees
sentence outlines
to
discover
35
Using parse trees
sentence outlines
to
discover
Outline: S  PP , VP
36
Comparing sentence outlines
Hobbs
Joshi
Lin
S  S CC S.
S  ADVP PP NP VP . S  SBAR NP VP .
S  CC NP VP .
S  PP NP ADVP VP .
FRAG  NP : S .
S  S VP .
S  NP VP .
S  NP VP .
S  NP NP VP .
S  S S CC S .
S  PP VP .
S  PP NP VP .
S  ADVP NP VP .
S  NP ADVP VP .
37
Comparing sentence outlines
Hobbs
Joshi
Lin
S  S CC S.
S  ADVP PP NP VP .
S  SBAR NP VP .
S  CC NP VP .
S  PP NP ADVP VP . FRAG  NP : S .
S  S VP .
S  NP VP .
S  NP VP .
S  NP NP VP .
S  S S CC S .
S  PP VP .
S  PP NP VP .
S  ADVP NP VP .
S  NP ADVP VP .
38
Comparing sentence outlines
Hobbs
Joshi
Lin
S  S CC S.
S  ADVP PP NP VP S  SBAR NP VP .
S  CC NP VP .
S  PP NP ADVP VP . FRAG  NP : S .
S  S VP .
S  NP VP .
S  NP VP .
S  NP NP VP .
S  S S CC S .
S  PP VP .
S  PP NP VP .
S  ADVP NP VP .
S  NP ADVP VP .
39
Comparing sentence outlines
Hobbs
Joshi
Lin
S  S CC S.
S  ADVP PP NP VP .
S  SBAR NP VP .
S  CC NP VP .
S  PP NP ADVP VP . FRAG  NP : S .
S  S VP .
S  NP VP .
S  NP VP .
S  NP NP VP .
S  S S CC S .
S  PP VP .
S  PP NP VP .
S  ADVP NP VP .
S  NP ADVP VP .
40
Outline
Related work
 Sentence types
 Sentence outlines
 Tree topology
 Beyond production rules
 Experiments
 Conclusion

41
Tree topology
“For processing free texts, hand-crafted
grammars are neither practical nor reliable.”
“These algorithms cannot deal with words
for which classifiers have not been learned.”
42
Tree topology
“For processing free texts, handcrafted grammars are neither
practical nor reliable.”
“These algorithms cannot deal with words
for which classifiers have not been learned.”
43
Tree topology
“For processing free texts, hand-crafted
grammars are neither practical nor reliable.”
“These algorithms cannot deal
with words for which classifiers
have not been learned.”
44
Tree topology
“For processing free texts, hand-crafted
grammars are neither practical nor reliable.”
“These algorithms cannot deal with words
for which classifiers have not been learned.”
45
Tree topology
“For processing free texts, hand-crafted
grammars are neither practical nor reliable.”
“These algorithms cannot deal with words
for which classifiers have not been learned.”
46
Tree topology: metrics
Leaf height
 Furcation height
 Level width
 Horizontal imbalance
 Vertical imbalance

47
Tree topology: leaf height
Leaf height (“texts”) = 6
48
Tree topology: furcation height
Furcation height (VP2) = 3
49
Tree topology: level width
level 3
Level Width(level3) = 8
50
Tree topology: imbalance
Vertical Imbalance(PP)
= |height(IN) – height(S2)|
= |2 – 6| = 4
Horizontal Imbalance(PP)
= |width(IN) – width(S2)|
= |1 – 3| = 2
51
Tree topology metrics: novelists
Tree-topology
metrics
Novels
Charles
Dickens
Edward
Bulwer-Lytton
Jane
Austen
Thomas
Hardy
Walter
Scott
Sentence Length
24.1
26.7
31.4
21.5
34.1
Leaf Height
4.7
5.0
5.4
4.9
5.9
Furcation Height
1.9
1.9
2.1
1.9
2.1
Level Width
4.1
4.4
4.7
3.8
4.9
Horizontal
1.1
1.1
1.3
1.2
1.4
Vertical
1.0
1.1
1.2
1.0
1.4
Imbalance
52
Tree topology metrics: novelists
Tree-topology
metrics
Novels
Charles
Dickens
Edward
Bulwer-Lytton
Jane
Austen
Thomas
Hardy
Walter
Scott
Sentence Length
24.1
26.7
31.4
21.5
34.1
Leaf Height
4.7
5.0
5.4
4.9
5.9
Furcation Height
1.9
1.9
2.1
1.9
2.1
Level Width
4.1
4.4
4.7
3.8
4.9
Horizontal
1.1
1.1
1.3
1.2
1.4
Vertical
1.0
1.1
1.2
1.0
1.4
Imbalance
53
Tree topology metrics: novelists
Tree-topology
metrics
Novels
Charles
Dickens
Edward
Bulwer-Lytton
Jane
Austen
Thomas
Hardy
Walter
Scott
Sentence Length
24.1
26.7
31.4
21.5
34.1
Leaf Height
4.7
5.0
5.4
4.9
5.9
Furcation Height
1.9
1.9
2.1
1.9
2.1
Level Width
4.1
4.4
4.7
3.8
4.9
Horizontal
1.1
1.1
1.3
1.2
1.4
Vertical
1.0
1.1
1.2
1.0
1.4
Imbalance
54
Tree topology metrics: novelists
Tree-topology
metrics
Novels
Charles
Dickens
Edward
Bulwer-Lytton
Jane
Austen
Thomas
Hardy
Walter
Scott
Sentence Length
24.1
26.7
31.4
21.5
34.1
Leaf Height
4.7
5.0
5.4
4.9
5.9
Furcation Height
1.9
1.9
2.1
1.9
2.1
Level Width
4.1
4.4
4.7
3.8
4.9
Horizontal
1.1
1.1
1.3
1.2
1.4
Vertical
1.0
1.1
1.2
1.0
1.4
Imbalance
55
Tree topology metrics: novelists
Tree-topology
metrics
Novels
Charles
Dickens
Edward
Bulwer-Lytton
Jane
Austen
Thomas
Hardy
Walter
Scott
Sentence Length
24.1
26.7
31.4
21.5
34.1
Leaf Height
4.7
5.0
5.4
4.9
5.9
Furcation Height
1.9
1.9
2.1
1.9
2.1
Level Width
4.1
4.4
4.7
3.8
4.9
Horizontal
1.1
1.1
1.3
1.2
1.4
Vertical
1.0
1.1
1.2
1.0
1.4
Imbalance
56
Outline
Related work
 Sentence types
 Sentence outlines
 Tree topology
 Beyond production rules
 Experiments
 Conclusion

57
PCFG: production rules
Pr: VP1  VBG NP1
58
Beyond PCFG production rules
Pr^: VP1 ^ S2  VBG NP1
59
Beyond PCFG production rules
Pr*: NNS1  “texts”
Pr^*: NNS1 ^ NP1  “texts”
60
Beyond PCFG production rules
Syn↑: VP1  S  PP
61
Beyond PCFG production rules
Syn↓: VP1  VBG ,VP1  NP1
62
Outline
Related work
 Sentence types
 Sentence outlines
 Tree topology
 Beyond production rules
 Experiments
 Conclusion

64
Experiments
SVM classifier (LIBLINEAR)
 5-fold cross validation

◦ 80% training, 20% testing
◦ 20% training, 80% testing
65
Experiments
SVM classifier (LIBLINEAR)
 5-fold cross validation

◦ 80% training, 20% testing
◦ 20% training, 80% testing
Sufficient training data may not
be available in practical
scenarios (e.g., forensics).
(Luyckx and Daelemans, 2008)
66
Experiments
SVM classifier (LIBLINEAR)
 5-fold cross validation

◦ 80% training, 20% testing
◦ 20% training, 80% testing

Features
◦ PCFG rule-based
◦ STYLE11
 6 parameters from distribution of sentence types
 5 topological metrics
67
Experiments
SVM classifiers builtLibLinear
% of sentences that are
5-fold
cross validation
1. Simple
2. Complex
80%
training, 20% testing
3. Compound
20%
training, 80% testing
4. Complex-compound
5. Loose
6. Periodic
◦ STYLE11
 6 parameters from distribution of sentence types
 5 topological metrics
68
Experiments
SVM classifiers builtLibLinear
1. Leaf height
5-fold cross
validation
2. Furcation
height
3. Level-width
80% training,
20% testing
4. Horizontal imbalance
20% training,
80% testing
5. Vertical imbalance
◦ STYLE11
 6 parameters from distribution of sentence types
 5 topological metrics
69
Experimental results
Scientific Papers: 20% training data
70
65
60
Parse-tree features
55
50
unigrams
pr^*
syn↕*
syn*v+h
70
Experimental results
Scientific Papers: 20% training data
70
65
Parse-tree features
60
Parse-tree + Style11 features
55
50
unigrams
pr^*
syn↕*
syn*v+h
71
Experimental results
Scientific Papers: 20% training data
70
65
Parse-tree features
60
Best unlexicalized feature (pr^): 60.6%
Parse-tree + Style11 features
55
50
unigrams
pr^*
syn↕*
syn*v+h
72
Experimental results
Novels: 20% training data
75
70
Parse-tree features
65
Parse-tree + Style11 features
60
unigrams
pr^*
syn↕*
syn*v+h
73
Experimental results
Novels: 20% training data
75
Best unlexicalized feature (synv+h): 73.2%
70
Parse-tree features
65
Parse-tree + Style11 features
60
unigrams
pr^*
syn↕*
syn*v+h
74
Unlexicalized features across domains
80
17.0%
70
32.9%
Trained on 20% data
60
Trained on 80% data
50
40
Scientific Papers
Novels
Training v/s Performance: unlexicalized features
75
Conclusions
Analyzed writing styles with interpretable
characterization of stylistic elements.
 Even without lexical elements, features derived
from sentence structures alone can predict
authorship with high accuracy.
 Using topological features of parse trees in
conjunction with features derived from
production rules provide the best results in
authorship attribution.
 Even with little training data, our techniques
provide reasonably good performance.

76